What is a promise in Javascript?

Question

Asked: 2020-07-26 01:11:06 +0800 CST 2020-07-26 01:11:06 +0800 CST 2020-07-26 01:11:06 +0800 CST

Can you parse HTML with regular expressions?

772

Yesterday I translated the RegEx answer match open tags except XHTML self-contained tags with its famous code snippet:

You can't parse [X]HTML with regular expressions because HTML can't be parsed with regex. Regex is not a tool that can be used to properly parse HTML. Since I have already answered many HTML and regex questions, using regex will not allow you to render HTML. Regular expressions are a tool that is not sophisticated enough to understand the constructs used by HTML. HTML is not a regular language and therefore cannot be parsed using regular expressions. Regular expressions are not equipped to dissect HTML into its representative parts.

which ends with a final demo of broken HTML:

appears ~~, the~~ stinking regex infection will devour your HT ML parser, your application and your existence forever as mere Visual Basic or worse he comes don't fight he comes v̡im̡ie̶ne, ̕h̵u radiation destroying all brightness, tags of HTML filtering from your eyes, like a fragrant liquid, the song of parsing regular expressions is going to extinguish the voices of mortal man from the sphere I can see it you can see it 's beautiful or the ending extinguishing Men's lies EVERYTHING IS LOST EVERYTHING IS LOSTDO e l pon̷and he comes ~~he comes he comes~~ ~~the~~ íco r permeates everything M I FACE M I FACE ᵒh dos n o o NO NOO̼ OON Θ para los an*̶͑̾̾ ̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆ul͖͉̗̩̳̟o ̍ͫͥͨ ͨ Or they are rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆es za̡͊͠͝lg red e sͮ̂҉̯͈͕̹̘̱ alȳ̳ ë͖̉l ͠p̯͍̭o̚ n̐y̡ ȩ̬̩̾͛ͪ̈̀͘l ̶̧̨̱̹̭̯ͧ̾ͬvien ȇ̴̟̟͙̞ͩ͌͝ "

I agree with your assertions:

HTML cannot be parsed with regex
regex are not sophisticated enough for this task
HTML is not a regular language and therefore cannot be parsed with regular expressions.

But then I received a comment from Mariano :

I know this is a joke that became famous. However, "HTML cannot be parsed with regex" is false. "not sophisticated enough" is false. "they are not equipped to dissect HTML" is false. "is not a regular language and therefore cannot be parsed using regular expressions" is flat out false. What is true is that it will give you headaches, because it is not a tool that fits the job... I hate this post.

And I was left wondering. Further searching brought me to a blog post by Jeff Atwood Parsing Html The Cthulhu Way , from 2009, where he starts off by talking about the response I just quoted, showing the sentiment that generated it. However, he parses the state of the matter and shows that it is not so clear that it cannot be done. He mentions a discussion in which experienced programmers defend its use in certain cases.

Therefore, the question is:

Can you parse HTML with regular expressions?
In which cases is it recommended to do it?
In which cases is it inadvisable?

_{You may have noticed that I use parse and parse interchangeably. I do it because one seems to be the translation of the other, but it is no less true that in Spanish-speaking environments the use of parsing is very widespread.}

2 Answers

Voted

SJuan76 · Answer 1 · 2020-07-26T03:21:56+08:00

The first question is to know what we mean by "parsing HTML".

The strict interpretation is to process the document, check that it is correct HTML, work with the entire document, etc. In that sense, regular expressions are completely insufficient .

The classic example is that of elements that can be nested indefinitely. If I start doing <div><div><div>....<div>Hola mundo</div>....</div></div></div>, there is no regular expression that can check that I have opened the same number of divas I have closed (source: finite automata theory).

Now, this is where someone walks in and says, "But I'm not building a web browser/parser. I just want to know what you put inside the div. I don't care if all the tags are closed or not, that's a problem." who generates the HTML. For me, regular expressions are completely enough ."

Naturally, if there are changes to the HTML, regular expressions are much more brittle. ^{The problem is not so much that 1} fails as that they give false positives.

For example, we have our expression to find the content of <div>( <div>(.*)<\/div>), and suddenly the page changes to:

 <div>Hola mundo<!-- Tonto el que lo lea!!--></div>

Wow... we better change it to ( <div>(.*)<), right? Well, until we get:

 <div>Hola <a href="http://micasa.example">mundo</a></div>

Well, we solved it (I don't put the regular expression anymore), and the following week we have

 <!-- <div>Hola mundo</div> No lo borro, solo lo comento porque no me fío del SVN. Firmado: el novato -->
 <div>Adios mundo</div>

In all of the above cases, the regular expression eats the error like it's nothing and the process continues until someone (possibly a human) notices that the values don't match, perhaps weeks or months later ² .

So:

Can you parse HTML with regular expressions?

In general, NO .

In which cases is it recommended to do it?

More than "recommended", it's not too much of a problem when:

The origin of the HTML is controlled. Is it a program of mine, or is it someone from my organization who will notify me when there is going to be a change.
Also related to the above, we know what structure it will have. If we know that it will be a document such that:
```
<html><body>
<ul>
<li>Punto 1.</li>
<li>Punto 2.</li>
...
</ul>
```
and that no tags or comments or JavaScript are going to be inserted in the middle, there is no problem ³

In which cases is it inadvisable?

Everyone else.

¹ If it fails, the error is processed and the regular expression is adapted accordingly. After all, if the format of the page being parsed changes, programs that use parsers can also have problems (although they will always be more flexible).

² A different kind of problem would be if I want to get the content of the first divone and they move the content to the third one. But that is unsolvable for both regexp and parsers unless it is used idon the elements; and if it is used idwhat is sought is not the nth divbut the element with the idcorresponding one.

³ In fact, the subset of HTML so defined is actually a regular language, so regular expressions are quite sufficient to fully parse it.

ArcanisGK507 · Answer 2 · 2020-08-19T07:19:37+08:00

Can you parse HTML with regular expressions?

Sure it does (not tested):

if(!preg_match('#(?<=<)\w+(?=[^<]*?>)#', $string)){ 
    return $string;
}

$patterns = array('<b>','<p>','<br>'); //etc array de etiquetas

// converts the array into a regex friendly or list
$patterns_flattened = implode('|', $patterns);

if ( preg_match('/'. $patterns_flattened .'/', $string, $matches) )
{
}

In which cases is it recommended to do it?

I would recommend it, only under the verification of open and closed tags they are the same number although syntactically they differ (this case needs a succession check), and in obtaining the content (requires HTML Tag Cleaning);

In which cases is it inadvisable?

When you need to use data, from the content.

Can you parse HTML with regular expressions?

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?