
通过如下代码希望替换掉网页原文中的空标签节点
page = lxml.etree.HTML('<span lang="en-us"><p></p>223</span>') for empty in page.xpath('//*[not(node())]'): empty.getparent().remove(empty) print lxml.html.tostring(page) 结果输出为
<html><body><span lang="en-us"></span></body></html>
去掉了空节点外的字符,请问如何保留原文中的“223”并且实现替换?
1 cute 2015-02-25 11:22:49 +08:00 ` from lxml import html print html.fromstring('<span lang="en-us"><p></p>223</span>').text_content() ` |
2 gogogen OP |
3 cute 2015-02-26 10:13:34 +08:00 ``` from lxml import html doc = html.fromstring('<span lang="en-us">11<p></p>223</san>') for elem in doc.xpath('//*[not(node())]'): parent = elem.getparent() if elem.tail: if not parent.text: parent.text = elem.tail else: parent.text = parent.text + elem.tail parent.remove(elem) print html.tostring(doc) ``` |
4 cute 2015-02-26 10:53:10 +08:00 重新发一个。 from lxml import html doc = html.fromstring('<span lang="en-us">sss<p></p>223</span>') func = lambda x, p: setattr(p, 'text', p.text + x.tail if p.text else x.tail) map( lambda x: x.tail and func(x, x.getparent()) or x.getparent().remove(x), doc.xpath('//*[not(node())]') ) print html.tostring(doc) |