What is a promise in Javascript?

Question

Asked: 2020-03-28 06:08:55 +0800 CST 2020-03-28 06:08:55 +0800 CST 2020-03-28 06:08:55 +0800 CST

How to process XML in Python: 5 possible alternatives and they all fail?

772

What possibilities to process an XML document exist in Python? I propose to review the most outstanding today through XML documents that resist being processed. Let's see:

I will take as XML source the one on this page:

tennis matches

That allows us to consult a database and depending on the parameters that are applied to the URL, it returns the XML files with the requested information; in this case, we have a list of sporting events:

For each event ( <event></event>) we have, at a minimum, a <id>, a name ( <name>), and chronological information that describes its beginning ( <start>).

I take it for granted that there are no errors in the composition of these XML , since if they did exist , other errors would in turn occur in the page that uses them .

There are many and diverse possibilities of handling an XML document, but in order not to extend ourselves unnecessarily, I will settle for being able to relate each event with its ID and start time; that is, the end result could be a Python dictionary, like this:

dic_eventos = {1076553300890015:{'name':'Manuel Guinard vs Evgeny Donskoy',
                                 'start':datetime.strptime('2019-03-27T10:23:00.000Z', '%Y-%m-%dT%H:%M:%S.000Z')},
               1075839532180015:{'name':'Jurgen Zopp vs Mikael Torpegaard',
                                 'start':datetime.strptime('2019-03-27T10:40:00.000Z', '%Y-%m-%dT%H:%M:%S.000Z')},
               ....}

But the problem is that we didn't even get to parse the XML document ; that is, we can't operate on it because we can't convert it to Python objects of any kind.

The process of downloading the XML document is trivial:

import requests
print(requests.get('https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis').text)

An object is obtained first Responseand then, applying the method .text, the content of the XML in plain text ( str).

Below I present the 5 possibilities of processing these documents that I have used and the reasons for the failure of each one of them.

Requests_XML+XPath:

author page

XPath Usage Examples

Code used:

from requests_xml import XMLSession

session = XMLSession()

maio = session.get('https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis')

eventos = maio.xml.xpath('//event')

Exception thrown:

Traceback (most recent call last):
  File "C:\Users\usuario\maio.py", line 37, in <module>
    event = maio.xml.xpath('//event')
  File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\site-packages\requests_xml.py", line 224, in xpath
    selected = self.lxml.xpath(selector)
  File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\site-packages\requests_xml.py", line 120, in lxml
    self._lxml = etree.fromstring(self.raw_xml)
  File "src\lxml\etree.pyx", line 3222, in lxml.etree.fromstring
  File "src\lxml\parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
  File "src\lxml\parser.pxi", line 1765, in lxml.etree._parseDoc
  File "src\lxml\parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
  File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 640, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Beautifulsoup+lxml:

author page

Code used:

import requests
from bs4 import BeautifulSoup

muno = requests.get("https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis")

maio = BeautifulSoup(muno.text, "xml")

print(maio)

Result:

<?xml version="1.0" encoding="utf-8"?>

That doesn't help us, because we can't retrieve the information about the events, which was what we were looking for.

XML.ElementTree:

SO-ES question I'm basing it on

Code used:

import urllib.request
import xml.etree.ElementTree as ET

url = "https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis"
uh = urllib.request.urlopen(url)
data = uh.read()
commentinfo = ET.fromstring(data)

Exception thrown:

Traceback (most recent call last):
  File "C:/Users/usuario/maio.py", line 48, in <module>
    commentinfo = ET.fromstring(data)
  File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\xml\etree\ElementTree.py", line 1314, in XML
    parser.feed(text)
  File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

Evaluate():

Since we have an object strthat is 'almost' like a dictionary in Python, we could consider making it actually a dictionary:

import requests

maio = requests.get("https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=mlb&in-running-flag=false")

print((eval(maio.text)))

But no, unfortunately it is not possible either:

Traceback (most recent call last):
  File "C:/Users/usuario/maio.py", line 56, in <module>
    print((eval(maio.text)))
  File "<string>", line 1, in <module>
NameError: name 'false' is not defined

XMLtodict:

SO-ES question I'm basing it on

Code used:

import urllib.request
import xmltodict

url = "https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis"
data = urllib.request.urlopen(url)
parsed_data = xmltodict.parse(data.read())

Exception thrown:

Traceback (most recent call last):
  File "C:/Users/usuario/maio.py", line 36, in <module>
    parsed_data = xmltodict.parse(data.read())
  File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\site-packages\xmltodict.py", line 327, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 0

After all this ordeal, you will understand that processing XML documents in Python seems to me to be at least a feat. But I don't think that with this it has been demonstrated that it is impossible to carry it out; sure it is possible. And surely there will be multiple details that will be going unnoticed by me. Are they obvious to you?

3 Answers

Voted

abulafia · Answer 1 · 2020-03-28T09:16:55+08:00

The problem is that the server you are accessing is examining the header Acceptthat your client sends to it to decide if it responds with JSON or XML.

When you test from the browser, it sends the header along with the request

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

in which you express your preference to receive the response in html, and if not in xhtml, and if not in xml, and if not in any other format. The server obeys these preferences and responds to you with an XML document.

Instead, when you test from Python using requests, this library by default sends this other header:

Accept: */*

which indicates that you have no preferences regarding the format and any format will do. For this case you see that the server decides to send the response in JSON , and hence all your attempts to parse it as XML have been unsuccessful.

You just need to send the appropriate header to receive the document in XML, that is:

import requests

url = 'https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis'
r = requests.get(url, headers={"Accept": "application/xml"})

and in r.contentyou would have the XML document.

However, as you have already been told in another answer, JSON can be much simpler to process. If you are not forced to work with the XML, I would prefer to use the json response.

Patricio Moracho · Answer 2 · 2020-03-28T07:59:15+08:00

My answer goes along the lines of what you have already been told. What you are receiving is a Json and not an xml. So the treatment is even a bit simpler:

import urllib.request
import json as json

url = "https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis"
data = urllib.request.urlopen(url)

d = json.loads(data.read())
for e in d['events']:
  print(e['id'], e['name'])

1076553296870016 Aslan Karatsev vs Quentin Halys
1075756125600015 Jurgen Melzer vs Pablo Andujar
1076553356490016 Facundo Bagnis vs Gianluca Mager
1076701387430016 Stefano Travaglia vs Nicolas Mahut
1077302532250015 Miami Open Double
1076177074730015 Qiang Wang vs Simona Halep
1069506693330015 WTA Miami 2019
1076413142330015 Gregoire Barrere vs Jelle Sels
1075839531500015 Viktor Troicki vs Mirza Basic
1075978769150016 Daniil Medvedev vs Roger Federer
1070308857800015 ATP Miami 2019
1076864676570015 Roberto Bautista Agut vs John Isner
1076183849220015 Karolina Pliskova vs Marketa Vondrousova
1076739116680016 Felix Auger Aliassime vs Borna Coric
1077338546190016 Alejandro Davidovich Fokina vs Carlos Taberner
1076828485460015 Alessandro Giannessi vs Raul Brancaccio
1077431606420015 Benoit Paire vs Steven Diez
1076828449700016 Dennis Novak vs Antoine Hoang
1076828446190015 Filip Horansky vs Maxime Janvier
1077338563450016 Jiri Vesely vs Andrea Arnaboldi
1077338494290015 Mikael Torpegaard vs Kamil Majchrzak
1076828488980016 Pedro Martinez vs Guillermo Garcia-Lopez
1076828452010015 Ricardas Berankis vs Sebastian Ofner
1077338497410016 Roman Safiullin vs Evgeny Donskoy
1077082658670015 Denis Shapovalov vs Frances Tiafoe
1077047365530015 Anett Kontaveit vs Ashleigh Barty

Any of your other options should work, always with the assumption that we receive a json, which we interpret by means of json.loads(<json data>), the return in this case, is a dictionary that we can access by any of the usual techniques

It is always convenient to check the "raw" content and see what it is, for example, to see the first 20 bytes received:

print(data.read()[:30])
b'{"offset":0,"per-page":100,"to'

We clearly realize that it would be a jsonand not a xml.

Sebastián Miranda · Answer 3 · 2020-03-28T08:00:43+08:00

Already.

According to what was discussed.

You have the option of using DicttoXml to convert from json to xml using Python, I have Anaconda installed to work with python, which allows me to use what is Anaconda Prompt to install modules.

The command is:

pip install dicttoxml

The page has an example that is valid for python 2.x, I have version 3.x installed so the code would look like this:

import json
import urllib.request
import dicttoxml
page = urllib.request.urlopen('https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis')
content = page.read()
obj = json.loads(content)
print(obj)
{u'mylist': [u'foo', u'bar', u'baz'], u'mydict': {u'foo': u'bar', u'baz': 1}, u'ok': True}
xml = dicttoxml.dicttoxml(obj)
print(xml)

If you check the terminal first it will show you the code of the page where you bring the data from and then it will show you the conversion to xml with all the tags. You can try some of your alternatives by passing them the xml variable of this code.

Cheers

How to process XML in Python: 5 possible alternatives and they all fail?

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?