I found a few ways to pull data from another page that work with AJAX to load the content. So it's impossible to use PHP or specifically curl.
My problem is that I don't quite understand how to do it. He left the codes:
<script src='//ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js'></script>
<script>
xmlhttp = new XMLHttpRequest();
xmlhttp.open("GET", "https://ycapi.org/iframe/?v=6YzGOq42zLk", false);
xmlhttp.send();
var data = JSON.parse(xmlhttp.responseText);
document.getElementById("demo").innerHTML = 'enlace:' + data + '';
</script>
<div id='demo' />
Example 2:
<!DOCTYPE html>
<html>
<head>
<script src="http://yui.yahooapis.com/3.7.2/build/yui/yui-min.js"></script>
<meta charset=utf-8 />
<title>j</title>
</head>
<body>
<script>
YUI().use('yql', function(Y) {
Y.YQL('select * from data.html.cssselect where url="https://ycapi.org/iframe/?v=6YzGOq42zLk" and css="buttons", function(response) {
var html = response.query.results.results.div.download.content; document.getElementById('muestra').innerHTML = html;
});
});
</script>
<div id="muestra"></div>
</body>
</html>
I wonder why they don't work.
As Alvaro explains , these two methods do not work due to an access control issue.
As Delcio mentions, reading the content of a web page and extracting its content is known as "scraping".
Note: While scraping per se is not usually considered illegal, most of the time it is against the terms of use of the website/service. If there is an API use the api, if there is RSS use rss.
That said, "scraping" dynamic content only with php is quite complicated, you have to pretend to be a web browser with all its gadgets: VM javascript, HTML parser, HTML render, sessions, events, etc.
Therefore, and in order not to reinvent the wheel, we are going to use QA and testing tools:
Headless Chrome - A browser without a GUI https://developers.google.com/web/updates/2017/04/headless-chrome
Puppeteer : a Node library to control headless Chrome https://github.com/GoogleChrome/puppeteer
Node : V8 engine based javascript execution environment https://nodejs.org/en/
Rialto : library/package to handle Node resources from PHP https://github.com/extractr-io/rialto/
PuPHPeteer : wrapper/bridge puppeter implementation for PHP https://github.com/extractr-io/puphpeteer
Requires: PHP >=7.1, Node >= 8, Composer >= 0.2.2, 64bit OS (tested on Ubuntu 16.04.4 LTS x86_64)
You can install everything separately (for a clean install between dependencies and patches calculate 600mb), or if you already have Node 8, PHP 7.1 (CLI) and Composer:
test.php
Note/Warning/Disclaimer
The code is provided for educational purposes and as a proof of concept.
You have the possibility to use http://docs.guzzlephp.org/en/latest/
A simple example would be:
And you also have the possibility to use file_get_contents ( http://php.net/manual/es/function.file-get-contents.php ) which would be more or less like this:
Update :
I just thought of another way to capture the page as if you were actually browsing. It would be about implementing a bot in javascript in the browser element inspector console itself. So I have searched a bit and I have not invented anything, as usually happens.
I give you a very basic example, suppose you want to perform a search on google, for example, "bot scrap javascript"; open google, paste the text in the search engine and open the element inspector of the browser, go to the console and type $("#[SEARCH BUTTON ID]").click(); and this will perform the search. That is, you can access jQuery and javascript from the console, so make your script in it.
Note: Obviously it is a manual process, for example, where it says [SEARCH BUTTON ID] you have to put the ID of the google search button and it is dynamic, so you have to either search for it each time, or access it in another way way, for example with nth-child(x), but when you have the script, you will only have to open the browser (it can be automated), go to the web (same) and paste your bot script that can even make ajax requests with information.
The first method tries to read the content of the web page by loading it via an XMLHttpRequest (xhr) object that is used to get information from a URL without having to reload the entire page.
The problem why it doesn't work is that requests can only be made to the same domain from which the code is executed (domain, protocol, etc. must match) and that is not the case (at least not from this site). So when you run the code you will get a message like this in the JavaScript console:
Yes it would work and they could be executed without problems if an HTTP access control system (CORS) was implemented on the destination server (but I don't know if it's yours).
The second method you present is based on YUI and YQL (Yahoo! Query Language), which is a language similar to SQL that allows you to make queries and obtain data through different services.
The code has several errors that make it not work. For example: the query is not well encapsulated in single quotes (it has an opening but not a closing quote) so you will receive an "Invalid or unexpected token" error. That can be easily fixed.
Then the query seems to be trying to read the CSS of the buttons (I'm not an expert in YQL and I may be reading it wrong, so don't pay much attention to me) and not the HTML code of the URL that you pass as a parameter. To read the HTML the query would be something like:
So the code would look like this with those two corrections:
With this, the request works but the result returns nothing, just an error message: the html table is no longer supported and therefore does not return the HTML code of the requested URL. Something that you can't see well in this snippet (because it shows an https/http error, but you can see it in this JSFiddle ).
Here I show you an example of how you can extract data (webscraping) from a website using the php-simple-html-dom-parser library from PHP:
Example:
Result: