Yesterday I translated the RegEx answer match open tags except XHTML self-contained tags with its famous code snippet:
You can't parse [X]HTML with regular expressions because HTML can't be parsed with regex. Regex is not a tool that can be used to properly parse HTML. Since I have already answered many HTML and regex questions, using regex will not allow you to render HTML. Regular expressions are a tool that is not sophisticated enough to understand the constructs used by HTML. HTML is not a regular language and therefore cannot be parsed using regular expressions. Regular expressions are not equipped to dissect HTML into its representative parts.
which ends with a final demo of broken HTML:
appears
, thestinking regex infection will devour your HT ML parser, your application and your existence forever as mere Visual Basic or worse he comes don't fight he comes v̡im̡ie̶ne, ̕h̵u radiation destroying all brightness, tags of HTML filtering from your eyes, like a fragrant liquid, the song of parsing regular expressionsisgoing to extinguish the voices of mortal man from the sphere I can see it you can see it 's beautiful or the ending extinguishing Men's lies EVERYTHING IS LOST EVERYTHING IS LOSTDO e l pon̷and he comeshe comes he comestheíco r permeates everything M I FACE M I FACE ᵒh dos n o o NO NOO̼ OON Θ para los an*̶͑̾̾ ̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆ul͖͉̗̩̳̟o ̍ͫͥͨ ͨ Or they are rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆es za̡͊͠͝lg red e sͮ̂҉̯͈͕̹̘̱ alȳ̳ ë͖̉l ͠p̯͍̭o̚ n̐y̡ ȩ̬̩̾͛ͪ̈̀͘l ̶̧̨̱̹̭̯ͧ̾ͬvien ȇ̴̟̟͙̞ͩ͌͝ "
I agree with your assertions:
- HTML cannot be parsed with regex
- regex are not sophisticated enough for this task
- HTML is not a regular language and therefore cannot be parsed with regular expressions.
But then I received a comment from Mariano :
I know this is a joke that became famous. However, "HTML cannot be parsed with regex" is false. "not sophisticated enough" is false. "they are not equipped to dissect HTML" is false. "is not a regular language and therefore cannot be parsed using regular expressions" is flat out false. What is true is that it will give you headaches, because it is not a tool that fits the job... I hate this post.
And I was left wondering. Further searching brought me to a blog post by Jeff Atwood Parsing Html The Cthulhu Way , from 2009, where he starts off by talking about the response I just quoted, showing the sentiment that generated it. However, he parses the state of the matter and shows that it is not so clear that it cannot be done. He mentions a discussion in which experienced programmers defend its use in certain cases.
Therefore, the question is:
- Can you parse HTML with regular expressions?
- In which cases is it recommended to do it?
- In which cases is it inadvisable?
You may have noticed that I use parse and parse interchangeably. I do it because one seems to be the translation of the other, but it is no less true that in Spanish-speaking environments the use of parsing is very widespread.
The first question is to know what we mean by "parsing HTML".
The strict interpretation is to process the document, check that it is correct HTML, work with the entire document, etc. In that sense, regular expressions are completely insufficient .
The classic example is that of elements that can be nested indefinitely. If I start doing
<div><div><div>....<div>Hola mundo</div>....</div></div></div>
, there is no regular expression that can check that I have opened the same number ofdiv
as I have closed (source: finite automata theory).Now, this is where someone walks in and says, "But I'm not building a web browser/parser. I just want to know what you put inside the
div
. I don't care if all the tags are closed or not, that's a problem." who generates the HTML. For me, regular expressions are completely enough ."Naturally, if there are changes to the HTML, regular expressions are much more brittle. The problem is not so much that 1 fails as that they give false positives.
For example, we have our expression to find the content of
<div>
(<div>(.*)<\/div>
), and suddenly the page changes to:Wow... we better change it to (
<div>(.*)<
), right? Well, until we get:Well, we solved it (I don't put the regular expression anymore), and the following week we have
In all of the above cases, the regular expression eats the error like it's nothing and the process continues until someone (possibly a human) notices that the values don't match, perhaps weeks or months later 2 .
So:
In general, NO .
More than "recommended", it's not too much of a problem when:
The origin of the HTML is controlled. Is it a program of mine, or is it someone from my organization who will notify me when there is going to be a change.
Also related to the above, we know what structure it will have. If we know that it will be a document such that:
and that no tags or comments or JavaScript are going to be inserted in the middle, there is no problem 3
Everyone else.
1 If it fails, the error is processed and the regular expression is adapted accordingly. After all, if the format of the page being parsed changes, programs that use parsers can also have problems (although they will always be more flexible).
2 A different kind of problem would be if I want to get the content of the first
div
one and they move the content to the third one. But that is unsolvable for both regexp and parsers unless it is usedid
on the elements; and if it is usedid
what is sought is not the nthdiv
but the element with theid
corresponding one.3 In fact, the subset of HTML so defined is actually a regular language, so regular expressions are quite sufficient to fully parse it.
Sure it does (not tested):
I would recommend it, only under the verification of open and closed tags they are the same number although syntactically they differ (this case needs a succession check), and in obtaining the content (requires HTML Tag Cleaning);
When you need to use data, from the content.