What possibilities to process an XML document exist in Python? I propose to review the most outstanding today through XML documents that resist being processed. Let's see:
I will take as XML source the one on this page:
That allows us to consult a database and depending on the parameters that are applied to the URL, it returns the XML files with the requested information; in this case, we have a list of sporting events:
For each event ( <event></event>
) we have, at a minimum, a <id>
, a name ( <name>
), and chronological information that describes its beginning ( <start>
).
I take it for granted that there are no errors in the composition of these XML , since if they did exist , other errors would in turn occur in the page that uses them .
There are many and diverse possibilities of handling an XML document, but in order not to extend ourselves unnecessarily, I will settle for being able to relate each event with its ID and start time; that is, the end result could be a Python dictionary, like this:
dic_eventos = {1076553300890015:{'name':'Manuel Guinard vs Evgeny Donskoy',
'start':datetime.strptime('2019-03-27T10:23:00.000Z', '%Y-%m-%dT%H:%M:%S.000Z')},
1075839532180015:{'name':'Jurgen Zopp vs Mikael Torpegaard',
'start':datetime.strptime('2019-03-27T10:40:00.000Z', '%Y-%m-%dT%H:%M:%S.000Z')},
....}
But the problem is that we didn't even get to parse the XML document ; that is, we can't operate on it because we can't convert it to Python objects of any kind.
The process of downloading the XML document is trivial:
import requests
print(requests.get('https://www.matchbook.com/edge/rest/events?language=en¤cy=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis').text)
An object is obtained first Response
and then, applying the method .text
, the content of the XML in plain text ( str
).
Below I present the 5 possibilities of processing these documents that I have used and the reasons for the failure of each one of them.
Requests_XML+XPath:
Code used:
from requests_xml import XMLSession
session = XMLSession()
maio = session.get('https://www.matchbook.com/edge/rest/events?language=en¤cy=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis')
eventos = maio.xml.xpath('//event')
Exception thrown:
Traceback (most recent call last):
File "C:\Users\usuario\maio.py", line 37, in <module>
event = maio.xml.xpath('//event')
File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\site-packages\requests_xml.py", line 224, in xpath
selected = self.lxml.xpath(selector)
File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\site-packages\requests_xml.py", line 120, in lxml
self._lxml = etree.fromstring(self.raw_xml)
File "src\lxml\etree.pyx", line 3222, in lxml.etree.fromstring
File "src\lxml\parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
File "src\lxml\parser.pxi", line 1765, in lxml.etree._parseDoc
File "src\lxml\parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
Beautifulsoup+lxml:
Code used:
import requests
from bs4 import BeautifulSoup
muno = requests.get("https://www.matchbook.com/edge/rest/events?language=en¤cy=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis")
maio = BeautifulSoup(muno.text, "xml")
print(maio)
Result:
<?xml version="1.0" encoding="utf-8"?>
That doesn't help us, because we can't retrieve the information about the events, which was what we were looking for.
XML.ElementTree:
SO-ES question I'm basing it on
Code used:
import urllib.request
import xml.etree.ElementTree as ET
url = "https://www.matchbook.com/edge/rest/events?language=en¤cy=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis"
uh = urllib.request.urlopen(url)
data = uh.read()
commentinfo = ET.fromstring(data)
Exception thrown:
Traceback (most recent call last):
File "C:/Users/usuario/maio.py", line 48, in <module>
commentinfo = ET.fromstring(data)
File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\xml\etree\ElementTree.py", line 1314, in XML
parser.feed(text)
File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0
Evaluate():
Since we have an object str
that is 'almost' like a dictionary in Python, we could consider making it actually a dictionary:
import requests
maio = requests.get("https://www.matchbook.com/edge/rest/events?language=en¤cy=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=mlb&in-running-flag=false")
print((eval(maio.text)))
But no, unfortunately it is not possible either:
Traceback (most recent call last):
File "C:/Users/usuario/maio.py", line 56, in <module>
print((eval(maio.text)))
File "<string>", line 1, in <module>
NameError: name 'false' is not defined
XMLtodict:
SO-ES question I'm basing it on
Code used:
import urllib.request
import xmltodict
url = "https://www.matchbook.com/edge/rest/events?language=en¤cy=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis"
data = urllib.request.urlopen(url)
parsed_data = xmltodict.parse(data.read())
Exception thrown:
Traceback (most recent call last):
File "C:/Users/usuario/maio.py", line 36, in <module>
parsed_data = xmltodict.parse(data.read())
File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\site-packages\xmltodict.py", line 327, in parse
parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 0
After all this ordeal, you will understand that processing XML documents in Python seems to me to be at least a feat. But I don't think that with this it has been demonstrated that it is impossible to carry it out; sure it is possible. And surely there will be multiple details that will be going unnoticed by me. Are they obvious to you?
The problem is that the server you are accessing is examining the header
Accept
that your client sends to it to decide if it responds with JSON or XML.When you test from the browser, it sends the header along with the request
in which you express your preference to receive the response in html, and if not in xhtml, and if not in xml, and if not in any other format. The server obeys these preferences and responds to you with an XML document.
Instead, when you test from Python using
requests
, this library by default sends this other header:which indicates that you have no preferences regarding the format and any format will do. For this case you see that the server decides to send the response in JSON , and hence all your attempts to parse it as XML have been unsuccessful.
You just need to send the appropriate header to receive the document in XML, that is:
and in
r.content
you would have the XML document.However, as you have already been told in another answer, JSON can be much simpler to process. If you are not forced to work with the XML, I would prefer to use the json response.
My answer goes along the lines of what you have already been told. What you are receiving is a Json and not an xml. So the treatment is even a bit simpler:
Any of your other options should work, always with the assumption that we receive a
json
, which we interpret by means ofjson.loads(<json data>)
, the return in this case, is a dictionary that we can access by any of the usual techniquesIt is always convenient to check the "raw" content and see what it is, for example, to see the first 20 bytes received:
We clearly realize that it would be a
json
and not axml
.Already.
According to what was discussed.
You have the option of using DicttoXml to convert from json to xml using Python, I have Anaconda installed to work with python, which allows me to use what is Anaconda Prompt to install modules.
The command is:
The page has an example that is valid for python 2.x, I have version 3.x installed so the code would look like this:
If you check the terminal first it will show you the code of the page where you bring the data from and then it will show you the conversion to xml with all the tags. You can try some of your alternatives by passing them the xml variable of this code.
Cheers