What possibilities to process an XML document exist in Python? I propose to review the most outstanding today through XML documents that resist being processed. Let's see:
I will take as XML source the one on this page:
That allows us to consult a database and depending on the parameters that are applied to the URL, it returns the XML files with the requested information; in this case, we have a list of sporting events:
For each event ( <event></event>
) we have, at a minimum, a <id>
, a name ( <name>
), and chronological information that describes its beginning ( <start>
).
I take it for granted that there are no errors in the composition of these XML , since if they did exist , other errors would in turn occur in the page that uses them .
There are many and diverse possibilities of handling an XML document, but in order not to extend ourselves unnecessarily, I will settle for being able to relate each event with its ID and start time; that is, the end result could be a Python dictionary, like this:
dic_eventos = {1076553300890015:{'name':'Manuel Guinard vs Evgeny Donskoy',
'start':datetime.strptime('2019-03-27T10:23:00.000Z', '%Y-%m-%dT%H:%M:%S.000Z')},
1075839532180015:{'name':'Jurgen Zopp vs Mikael Torpegaard',
'start':datetime.strptime('2019-03-27T10:40:00.000Z', '%Y-%m-%dT%H:%M:%S.000Z')},
....}
But the problem is that we didn't even get to parse the XML document ; that is, we can't operate on it because we can't convert it to Python objects of any kind.
The process of downloading the XML document is trivial:
import requests
print(requests.get('https://www.matchbook.com/edge/rest/events?language=en¤cy=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis').text)
An object is obtained first Response
and then, applying the method .text
, the content of the XML in plain text ( str
).
Below I present the 5 possibilities of processing these documents that I have used and the reasons for the failure of each one of them.
Requests_XML+XPath:
Code used:
from requests_xml import XMLSession
session = XMLSession()
maio = session.get('https://www.matchbook.com/edge/rest/events?language=en¤cy=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis')
eventos = maio.xml.xpath('//event')
Exception thrown:
Traceback (most recent call last):
File "C:\Users\usuario\maio.py", line 37, in <module>
event = maio.xml.xpath('//event')
File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\site-packages\requests_xml.py", line 224, in xpath
selected = self.lxml.xpath(selector)
File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\site-packages\requests_xml.py", line 120, in lxml
self._lxml = etree.fromstring(self.raw_xml)
File "src\lxml\etree.pyx", line 3222, in lxml.etree.fromstring
File "src\lxml\parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
File "src\lxml\parser.pxi", line 1765, in lxml.etree._parseDoc
File "src\lxml\parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
File "src\lxml\parser.pxi", line 640, in lxml.etree._raiseParseError
File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
Beautifulsoup+lxml:
Code used:
import requests
from bs4 import BeautifulSoup
muno = requests.get("https://www.matchbook.com/edge/rest/events?language=en¤cy=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis")
maio = BeautifulSoup(muno.text, "xml")
print(maio)
Result:
<?xml version="1.0" encoding="utf-8"?>
That doesn't help us, because we can't retrieve the information about the events, which was what we were looking for.
XML.ElementTree:
SO-ES question I'm basing it on
Code used:
import urllib.request
import xml.etree.ElementTree as ET
url = "https://www.matchbook.com/edge/rest/events?language=en¤cy=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis"
uh = urllib.request.urlopen(url)
data = uh.read()
commentinfo = ET.fromstring(data)
Exception thrown:
Traceback (most recent call last):
File "C:/Users/usuario/maio.py", line 48, in <module>
commentinfo = ET.fromstring(data)
File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\xml\etree\ElementTree.py", line 1314, in XML
parser.feed(text)
File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0
Evaluate():
Since we have an object str
that is 'almost' like a dictionary in Python, we could consider making it actually a dictionary:
import requests
maio = requests.get("https://www.matchbook.com/edge/rest/events?language=en¤cy=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=mlb&in-running-flag=false")
print((eval(maio.text)))
But no, unfortunately it is not possible either:
Traceback (most recent call last):
File "C:/Users/usuario/maio.py", line 56, in <module>
print((eval(maio.text)))
File "<string>", line 1, in <module>
NameError: name 'false' is not defined
XMLtodict:
SO-ES question I'm basing it on
Code used:
import urllib.request
import xmltodict
url = "https://www.matchbook.com/edge/rest/events?language=en¤cy=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis"
data = urllib.request.urlopen(url)
parsed_data = xmltodict.parse(data.read())
Exception thrown:
Traceback (most recent call last):
File "C:/Users/usuario/maio.py", line 36, in <module>
parsed_data = xmltodict.parse(data.read())
File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\site-packages\xmltodict.py", line 327, in parse
parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 0
After all this ordeal, you will understand that processing XML documents in Python seems to me to be at least a feat. But I don't think that with this it has been demonstrated that it is impossible to carry it out; sure it is possible. And surely there will be multiple details that will be going unnoticed by me. Are they obvious to you?