Yesterday I translated the RegEx answer match open tags except XHTML self-contained tags with its famous code snippet:
You can't parse [X]HTML with regular expressions because HTML can't be parsed with regex. Regex is not a tool that can be used to properly parse HTML. Since I have already answered many HTML and regex questions, using regex will not allow you to render HTML. Regular expressions are a tool that is not sophisticated enough to understand the constructs used by HTML. HTML is not a regular language and therefore cannot be parsed using regular expressions. Regular expressions are not equipped to dissect HTML into its representative parts.
which ends with a final demo of broken HTML:
appears
, thestinking regex infection will devour your HT ML parser, your application and your existence forever as mere Visual Basic or worse he comes don't fight he comes v̡im̡ie̶ne, ̕h̵u radiation destroying all brightness, tags of HTML filtering from your eyes, like a fragrant liquid, the song of parsing regular expressionsisgoing to extinguish the voices of mortal man from the sphere I can see it you can see it 's beautiful or the ending extinguishing Men's lies EVERYTHING IS LOST EVERYTHING IS LOSTDO e l pon̷and he comeshe comes he comestheíco r permeates everything M I FACE M I FACE ᵒh dos n o o NO NOO̼ OON Θ para los an*̶͑̾̾ ̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆ul͖͉̗̩̳̟o ̍ͫͥͨ ͨ Or they are rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆es za̡͊͠͝lg red e sͮ̂҉̯͈͕̹̘̱ alȳ̳ ë͖̉l ͠p̯͍̭o̚ n̐y̡ ȩ̬̩̾͛ͪ̈̀͘l ̶̧̨̱̹̭̯ͧ̾ͬvien ȇ̴̟̟͙̞ͩ͌͝ "
I agree with your assertions:
- HTML cannot be parsed with regex
- regex are not sophisticated enough for this task
- HTML is not a regular language and therefore cannot be parsed with regular expressions.
But then I received a comment from Mariano :
I know this is a joke that became famous. However, "HTML cannot be parsed with regex" is false. "not sophisticated enough" is false. "they are not equipped to dissect HTML" is false. "is not a regular language and therefore cannot be parsed using regular expressions" is flat out false. What is true is that it will give you headaches, because it is not a tool that fits the job... I hate this post.
And I was left wondering. Further searching brought me to a blog post by Jeff Atwood Parsing Html The Cthulhu Way , from 2009, where he starts off by talking about the response I just quoted, showing the sentiment that generated it. However, he parses the state of the matter and shows that it is not so clear that it cannot be done. He mentions a discussion in which experienced programmers defend its use in certain cases.
Therefore, the question is:
- Can you parse HTML with regular expressions?
- In which cases is it recommended to do it?
- In which cases is it inadvisable?
You may have noticed that I use parse and parse interchangeably. I do it because one seems to be the translation of the other, but it is no less true that in Spanish-speaking environments the use of parsing is very widespread.