So I’m trying to parse school’s website for some info. I’m trying to get some values using xpath. So I found a html 5 parser and it can’t properly parse the first line. Then I figure you it’s actually XHTML and not HTML. After quick Google search I found out XHTML can be properly parsed using any XML parser and so I found one and… It can’t parse the first line. So I ask LLama3.1 (like a real programmer) why I can’t parse the first line with any parser. It explained so nicely that I did not destroy my keyboard when I was told that this document is “XHTML 1.0 Transitional” and it’s a mix of HTML 4 and XHTML and can’t be parsed with HTML nor XML parser. I hate the guy that invented that so much…

So I can’t find a crate to parse XHTML 1.0 transitional? Or a crate to convert xhtml to something else? Any advice?

You are viewing a single thread.
View all comments
3 points

I would try another HTML 5 parser. HTML 5 is somewhat of a unification of HTML and XHTML, getting into syntax-specifics between the two with XML parsing is probably going to be an uphill battle. That said, I’m curious what the first line is, it could just be malformed entirely.

permalink
report
reply
2 points

Thats the first line:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

I thought it was html because it everything on the web is html. But because of the first line I figured out it was xhtml which should be parsed with xml parser, but I did not know the transitional is a mix which cant be parsed with anything.

permalink
report
parent
reply
1 point

Hmm, doctype declarations are sort of like the markup equivalent of headers. Usually parsers read them to know what flavor to expect and then go parse the rest of the page separately. You shouldn’t have to do this, but if you chop off that first line and run it through a standard HTML parser it might work fine.

permalink
report
parent
reply
1 point
*

Thats the first thing that I tried and still failes somewhere deep in the html where I probably shouldn’t skip a line.

permalink
report
parent
reply

Rust

!rust@programming.dev

Create post

Welcome to the Rust community! This is a place to discuss about the Rust programming language.

Wormhole

!performance@programming.dev

Credits
  • The icon is a modified version of the official rust logo (changing the colors to a gradient and black background)

Community stats

  • 572

    Monthly active users

  • 812

    Posts

  • 3.7K

    Comments