What is a promise in Javascript?

Question

Asked: 2020-04-18 08:52:42 +0800 CST 2020-04-18 08:52:42 +0800 CST 2020-04-18 08:52:42 +0800 CST

Extract data from a webpage without using curl php

772

I found a few ways to pull data from another page that work with AJAX to load the content. So it's impossible to use PHP or specifically curl.

My problem is that I don't quite understand how to do it. He left the codes:

<script src='//ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js'></script>
<script>
  xmlhttp = new XMLHttpRequest();
  xmlhttp.open("GET", "https://ycapi.org/iframe/?v=6YzGOq42zLk", false);
  xmlhttp.send();
  var data = JSON.parse(xmlhttp.responseText);
  document.getElementById("demo").innerHTML = 'enlace:' + data + '';
</script>
<div id='demo' />

Example 2:

<!DOCTYPE html>
<html>

<head>
  <script src="http://yui.yahooapis.com/3.7.2/build/yui/yui-min.js"></script>
  <meta charset=utf-8 />
  <title>j</title>
</head>

<body>
  <script>
    YUI().use('yql', function(Y) {
      Y.YQL('select * from data.html.cssselect where url="https://ycapi.org/iframe/?v=6YzGOq42zLk" and css="buttons", function(response) {
        var html = response.query.results.results.div.download.content; document.getElementById('muestra').innerHTML = html;
      });
    });
  </script>
  <div id="muestra"></div>
</body>

</html>

I wonder why they don't work.

4 Answers

Voted

alo Malbarez · Answer 1 · 2020-04-23T10:41:52+08:00

As Alvaro explains , these two methods do not work due to an access control issue.

As Delcio mentions, reading the content of a web page and extracting its content is known as "scraping".

Note: While scraping per se is not usually considered illegal, most of the time it is against the terms of use of the website/service. If there is an API use the api, if there is RSS use rss.

That said, "scraping" dynamic content only with php is quite complicated, you have to pretend to be a web browser with all its gadgets: VM javascript, HTML parser, HTML render, sessions, events, etc.

Therefore, and in order not to reinvent the wheel, we are going to use QA and testing tools:

Headless Chrome - A browser without a GUI https://developers.google.com/web/updates/2017/04/headless-chrome
Puppeteer : a Node library to control headless Chrome https://github.com/GoogleChrome/puppeteer
Node : V8 engine based javascript execution environment https://nodejs.org/en/
Rialto : library/package to handle Node resources from PHP https://github.com/extractr-io/rialto/
PuPHPeteer : wrapper/bridge puppeter implementation for PHP https://github.com/extractr-io/puphpeteer

Requires: PHP >=7.1, Node >= 8, Composer >= 0.2.2, 64bit OS (tested on Ubuntu 16.04.4 LTS x86_64)

You can install everything separately (for a clean install between dependencies and patches calculate 600mb), or if you already have Node 8, PHP 7.1 (CLI) and Composer:

composer require extractr-io/puphpeteer
npm install @extractr-io/puphpeteer

test.php

#!/usr/bin/env php
<?php
require_once 'vendor/autoload.php';

use ExtractrIo\Puphpeteer\Puppeteer;
use ExtractrIo\Rialto\Data\JsFunction;

$puppeteer = new Puppeteer;

echo "le damos cuerda al Cromo ".PHP_EOL;

$browser = $puppeteer->launch();
$page = $browser->newPage();
$page->goto('https://ycapi.org/iframe/?v=dQw4w9WgXcQ',
            ['waitUntil' => 'networkidle2']);
$page->screenshot(['path' => 'asimeveoantes.png']);

echo "espera ";
for ($i=6;$i>0;$i--) :
  sleep(1);
  echo $i;
endfor;
echo " ya casi".PHP_EOL;
$page->screenshot(['path' => 'asimeveodespues.png']);

$enlace = $page->evaluate(JsFunction::create("
    h = document.querySelectorAll('a[id=download]')[0].href;
    return {
        href : h,
    };
"));

printf('Enlace: %s', print_r($enlace, true));

$browser->close();
echo "asemo cURL ".PHP_EOL;

$url = $enlace['href'];
$options = array(
  CURLOPT_RETURNTRANSFER => true,
  CURLOPT_PROGRESSFUNCTION => 'progreso',
  CURLOPT_NOPROGRESS => false,
  CURLOPT_FILE    => fopen("clickme.mp3", 'w'),
  CURLOPT_TIMEOUT =>  28800,
  CURLOPT_URL     => $url
);
$ch = curl_init();
curl_setopt_array($ch, $options);
curl_exec($ch);
curl_close($ch);

function progreso($resource,$download_size, $downloaded, $upload_size, $uploaded)
{
    if ($download_size > 0)
         echo (($downloaded / $download_size  * 100)%10)==0?".":"";
}

echo "listo".PHP_EOL;

we create a browser instance via the puppeteer
we tell him to go to the web address
we take a screenshot to see what it looks like
we wait a while for the site to do its thing
we take another screenshot to see the difference
we execute a javascript in the context of the page
we do something with the extracted info

Note/Warning/Disclaimer

The code is provided for educational purposes and as a proof of concept.

The example domain uses session variables and a temporary url to access the resource, this is standard practice.
It's also standard practice to monitor network requests to prevent abuse, so don't be surprised if after N attempts you get a captcha, or get your IP blocked (or IP groups if you rotate proxies, vpns or access via tor).
Simple changes to the page's HTML or Javascript will cause the extraction to fail, hence this is a useful methodology for detecting unintentional code changes in QA automation.

track3r · Answer 2 · 2020-04-21T02:34:26+08:00

You have the possibility to use http://docs.guzzlephp.org/en/latest/

A simple example would be:

$client = new GuzzleHttp\Client();
$res = $client->request('GET', 'https://ycapi.org/iframe/?v=6YzGOq42zLk',
    ['auth' => ['user', 'pass']
]);
echo $res->getStatusCode();
echo $res->getBody();

And you also have the possibility to use file_get_contents ( http://php.net/manual/es/function.file-get-contents.php ) which would be more or less like this:

$pagina_inicio = file_get_contents('https://ycapi.org/iframe/?v=6YzGOq42zLk');
echo $pagina_inicio;

Update :

I just thought of another way to capture the page as if you were actually browsing. It would be about implementing a bot in javascript in the browser element inspector console itself. So I have searched a bit and I have not invented anything, as usually happens.

I give you a very basic example, suppose you want to perform a search on google, for example, "bot scrap javascript"; open google, paste the text in the search engine and open the element inspector of the browser, go to the console and type $("#[SEARCH BUTTON ID]").click(); and this will perform the search. That is, you can access jQuery and javascript from the console, so make your script in it.

Note: Obviously it is a manual process, for example, where it says [SEARCH BUTTON ID] you have to put the ID of the google search button and it is dynamic, so you have to either search for it each time, or access it in another way way, for example with nth-child(x), but when you have the script, you will only have to open the browser (it can be automated), go to the web (same) and paste your bot script that can even make ajax requests with information.

Alvaro Montoro · Answer 3 · 2020-04-20T20:41:13+08:00

The first method tries to read the content of the web page by loading it via an XMLHttpRequest (xhr) object that is used to get information from a URL without having to reload the entire page.

The problem why it doesn't work is that requests can only be made to the same domain from which the code is executed (domain, protocol, etc. must match) and that is not the case (at least not from this site). So when you run the code you will get a message like this in the JavaScript console:

Failed to load https://ycapi.org/iframe/?v=6YzGOq42zLk : No 'Access-Control-Allow-Origin' header is present on the requested resource.

<script src='//ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js'></script>
<script>
  xmlhttp = new XMLHttpRequest();
  xmlhttp.open("GET", "https://ycapi.org/iframe/?v=6YzGOq42zLk", false);
  xmlhttp.send();
  var data = JSON.parse(xmlhttp.responseText);
  document.getElementById("demo").innerHTML = 'enlace:' + data + '';
</script>
<div id='demo' />

Yes it would work and they could be executed without problems if an HTTP access control system (CORS) was implemented on the destination server (but I don't know if it's yours).

The second method you present is based on YUI and YQL (Yahoo! Query Language), which is a language similar to SQL that allows you to make queries and obtain data through different services.

The code has several errors that make it not work. For example: the query is not well encapsulated in single quotes (it has an opening but not a closing quote) so you will receive an "Invalid or unexpected token" error. That can be easily fixed.

Then the query seems to be trying to read the CSS of the buttons (I'm not an expert in YQL and I may be reading it wrong, so don't pay much attention to me) and not the HTML code of the URL that you pass as a parameter. To read the HTML the query would be something like:

select * from html where url="https://ycapi.org/iframe/?v=6YzGOq42zLk"

So the code would look like this with those two corrections:

<!DOCTYPE html>
<html>

<head>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/yui/3.7.2/yui/yui-min.js"></script>
  <meta charset=utf-8 />
  <title>j</title>
</head>

<body>
  <script>
    YUI().use('yql', function(Y) {
      Y.YQL('select * from html where url="https://ycapi.org/iframe/?v=6YzGOq42zLk"', function(response) {
        var html = response.query.results.results.div.download.content; document.getElementById('muestra').innerHTML = html;
      });
    });
  </script>
  <div id="muestra"></div>
</body>

</html>

With this, the request works but the result returns nothing, just an error message: the html table is no longer supported and therefore does not return the HTML code of the requested URL. Something that you can't see well in this snippet (because it shows an https/http error, but you can see it in this JSFiddle ).

html table is no longer supported. See https://policies.yahoo.com/us/en/yahoo/terms/product-atos/yql/index.htm for YQL Terms of Use

Delcio Polanco · Answer 4 · 2020-04-21T20:19:29+08:00

Some news aggregators take advantage of iframes to display sites under their own domain and prevent users from leaving, Google itself does it with its image search engine and it is rare for us to find webmasters who abuse this technique for their own benefit

Here I show you an example of how you can extract data (webscraping) from a website using the php-simple-html-dom-parser library from PHP:

Example:

   <?php
    include_once('./simple_html_dom.php');

    // Create DOM from URL or file
    $html = file_get_html('https://ycapi.org/iframe/?v=6YzGOq42zLk');

    echo $html;

    ?>

Result:

Extract data from a webpage without using curl php

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?