Skip to content Skip to sidebar Skip to footer

Xpath - Html With A Lot Of Children

Consider the html in the page variable. How do I access the tds ? I want to access them like xpath('/table/tr/td/text())' I don't want to indicate the other trs Unfortunately this

Solution 1:

Use xpath //td/text():

things = tree.xpath('//td/text()')

The //td stands for "find any td element in any depth.

Works for me.

Printing td elements grouped per table:

doc = html.fromstring(page)
for table_elm in doc.xpath("//table"):
    print"another table"
    things = table_elm.xpath('.//td/text()')
    print(things)

Note, that in this case is the . in xpath significant.

Solution 2:

You don'have to convert BeautifulSoup to str:

soup = str(BeautifulSoup(page, 'html.parser'))

You can use something like this:

>>> soup = BeautifulSoup(page, 'html.parser')
>>> for td in soup.find_all('td'):
...     print(td)
... 
<td>table1 td1</td>
<td>table1 td2</td><td>table2 td1</td><td>table2 td2</td><td>table3 td1</td><td>table3 td2</td>

Or, you can also use print(td.text) if you want the text inside the element.

Solution 3:

tr inside of tr is invalid HTML.

And this seems to be "fixed" by the html.fromstring() parser.

You can test this with this xpath:

things = tree.xpath('//table/tr/*')

And output with:

for thing in things:
   print(thing.tag)

Which generates:

tdtdtdtdtd

Post a Comment for "Xpath - Html With A Lot Of Children"