What is a promise in Javascript?

Question

Asked: 2020-08-31 17:15:55 +0800 CST 2020-08-31 17:15:55 +0800 CST 2020-08-31 17:15:55 +0800 CST

Extract image URL in HTML using regular expression (regex)

772

It turns out that I try to extract an image in this way:

$url = 'https://m.fa.com/perfil123';//cualquier perfil
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url );
curl_setopt($ch,CURLOPT_HEADER,0); //visualizar ñ y acentos.
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true );
curl_setopt($ch, CURLOPT_ENCODING, "gzip,deflate"); //(aceptación de codificación gzip)   
$url = curl_exec($ch); //almacena el response de la pagina.

curl_close($ch);
    preg_match('#class="bo img" src=[^"]*"([^"]*)"#', $url, $datos)
$img = $datos[1];

echo $img;

This is the HTML of the image I'm looking for:

<img width="72" height="72" alt="" class="bo img" src="https://scontent-mia3-2.xx.fbcdn.net/v/t1.0-1/cp0/e15/q65/p74x74/21151613_1725782907724134_7535903357386699205_n.jpg?efg=eyJpIjoiYiJ9&amp;oh=4f22a577f965566b2016ef842f5b110f&amp;oe=5A1DE043">

I'm using the classto define the image but I don't know where the error is.

1 Answers

Voted

Mariano · Answer 1 · 2020-08-31T19:37:10+08:00

With regex (not recommended)

As I told you, the regular expression you are using matches the HTML of your question perfectly ( see demo ). However, using regex for this is not recommended. For example:

you are not considering that it is inside a tag <img>, so with <input type='text' value'class="bo img" src="url.jpg"'> you would have a problem... and it can be easily solved, but...
with having another attribute between the class and the URL, for example, class="bo img" data-ejemplo="bla" src="url.jpg" you would have a problem... and it can be easily solved, but...
Just by altering the order of the classes class="bo img" you would have a problem... and it can be easily solved, but...
if that part of the HTML is commented out, like for example in  you would have a problem... and it can be solved, but...
There's always going to be some unconventional rule in HTML syntax that makes things difficult for you , and makes your regex mismatch because of something you didn't think could happen.

It's probably better to modify it to something like:Ver en regex101

#<img\b(?=[^>]*\sclass\s*=\s*"(?=[^"]*\bbo\b)[^"]*\bimg\b)[^>]*\ssrc\s*=\s*"([^"]*)"#i

but still, it would fail in many cases.

Using DOM (recommended)

You shouldn't use regular expressions to process HTML. At the level you're setting your expression, even a small change to the HTML would cause your regex to fail. An extra space, a change in the tag attributes, a comment, or more complex structures, would make even a gigantic regex not follow the rules. Even with a very advanced expression, you could generate a near-fail-safe case, but you could almost always find a rare case that would cause it to fail. Also, it would require an expert every time you want to modify it.

It is very easy to process HTML with DOM , these are the tools that are designed for that.

If we have an HTML like the following:

$html = '
    <img class="img" src="ejemplo1.jpg">
    <img width="72" height="72" alt="" class="bo img" src="https://scontent-mia3-2.xx.fbcdn.net/v/t1.0-1/cp0/e15/q65/p74x74/21151613_1725782907724134_7535903357386699205_n.jpg?efg=eyJpIjoiYiJ9&amp;oh=4f22a577f965566b2016ef842f5b110f&amp;oe=5A1DE043">
    <img class="bo etc" src="ejemplo2.jpg">
    <img class="bo etc img" src="ejemplo3.jpg">
';

Simply generate the DOM like this:

//Englobamos en body para corregirlo y que lo procese bien
$html = "<body>$html</body>";

//Generar el DOM
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_COMPACT | LIBXML_HTML_NOIMPLIED | LIBXML_NONET | LIBXML_HTML_NODEFDTD);

And we can get all the images inside the DOM with:

$img_nodelist = $dom->getElementsByTagName('img');

to go through them with

foreach ($img_nodelist as $img) {
    // ...
}

Obtaining the classes of each with:

$clases = $img->getAttribute('class');

and the image URL with:

$urlImagen = $img->getAttribute('src');

Note: you can also search with XPath, which is much less code (and probably runs a little faster), but I preferred to explain it this way, more explicit, to make it clearer.

Code:

//Ingreso
$html = '
    <img class="img" src="ejemplo1.jpg">
    <img width="72" height="72" alt="" class="bo img" src="https://scontent-mia3-2.xx.fbcdn.net/v/t1.0-1/cp0/e15/q65/p74x74/21151613_1725782907724134_7535903357386699205_n.jpg?efg=eyJpIjoiYiJ9&amp;oh=4f22a577f965566b2016ef842f5b110f&amp;oe=5A1DE043">
    <img class="bo etc" src="ejemplo2.jpg">
    <img class="bo etc img" src="ejemplo3.jpg">
';

//Englobamos en body para corregirlo y que lo procese bien
$html = "<body>$html</body>";

//Generar el DOM
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_COMPACT | LIBXML_HTML_NOIMPLIED | LIBXML_NONET | LIBXML_HTML_NODEFDTD);


//Obtener todas las imágenes
$img_nodelist = $dom->getElementsByTagName('img');

//Recorrer cada una
foreach ($img_nodelist as $img) {
    //Obtener la lista de clases
    $clases = $img->getAttribute('class');
    $clases_arr = explode(' ', $clases);

    //Ver si contiene ambas clases
    $clases_buscadas = array('bo', 'img');
    if (!array_diff($clases_buscadas, $clases_arr)) { //Contiene las clases
        //Obtener el SRC
        $urlImagen = $img->getAttribute('src');
        echo "URL: $urlImagen\n";
    }
}

Result:

URL: https://scontent-mia3-2.xx.fbcdn.net/v/t1.0-1/cp0/e15/q65/p74x74/21151613_1725782907724134_7535903357386699205_n.jpg?efg=eyJpIjoiYiJ9&oh=4f22a577f965566b2016ef842f5b110f&oe=5A1DE043
URL: ejemplo3.jpg

demonstration:

Ejecutar en 3v4l.org

Extract image URL in HTML using regular expression (regex)

With regex (not recommended)

Using DOM (recommended)

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?