What is a promise in Javascript?

Question

Ruben García Tutor

Asked: 2020-03-28 06:08:55 +0800 CST 2020-03-28 06:08:55 +0800 CST 2020-03-28 06:08:55 +0800 CST

如何在 Python 中处理 XML：5 种可能的替代方案，它们都失败了？

772

Python 中存在哪些处理 XML 文档的可能性？我建议通过拒绝被处理的 XML 文档来回顾当今最优秀的。让我们来看看：

我将把这个页面上的那个作为 XML 源：

网球比赛

这使我们可以查询数据库，并根据应用于 URL 的参数，返回带有请求信息的 XML 文件；在这种情况下，我们有一个体育赛事列表：

对于每个事件 ( <event></event>)，我们至少有一个<id>、一个名称 ( <name>) 和描述其开始的时间顺序信息 ( <start>)。

我想当然地认为这些 XML 的组合没有错误，因为如果它们确实存在，那么使用它们的页面中会依次出现其他错误。

处理 XML 文档有很多不同的可能性，但为了不进行不必要的扩展，我将满足于能够将每个事件与其 ID 和开始时间相关联；也就是说，最终结果可能是一个 Python 字典，如下所示：

dic_eventos = {1076553300890015:{'name':'Manuel Guinard vs Evgeny Donskoy',
                                 'start':datetime.strptime('2019-03-27T10:23:00.000Z', '%Y-%m-%dT%H:%M:%S.000Z')},
               1075839532180015:{'name':'Jurgen Zopp vs Mikael Torpegaard',
                                 'start':datetime.strptime('2019-03-27T10:40:00.000Z', '%Y-%m-%dT%H:%M:%S.000Z')},
               ....}

但问题是我们甚至无法解析 XML 文档；也就是说，我们无法对其进行操作，因为我们无法将其转换为任何类型的 Python 对象。

下载 XML 文档的过程很简单：

import requests
print(requests.get('https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis').text)

首先获取一个对象Response，然后应用该方法.text获取纯文本 ( str) 中的 XML 内容。

下面我将介绍我使用过的处理这些文档的 5 种可能性，以及每种失败的原因。

请求_XML+XPath：

作者页面

XPath 使用示例

使用的代码：

from requests_xml import XMLSession

session = XMLSession()

maio = session.get('https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis')

eventos = maio.xml.xpath('//event')

抛出异常：

Traceback (most recent call last):
  File "C:\Users\usuario\maio.py", line 37, in <module>
    event = maio.xml.xpath('//event')
  File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\site-packages\requests_xml.py", line 224, in xpath
    selected = self.lxml.xpath(selector)
  File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\site-packages\requests_xml.py", line 120, in lxml
    self._lxml = etree.fromstring(self.raw_xml)
  File "src\lxml\etree.pyx", line 3222, in lxml.etree.fromstring
  File "src\lxml\parser.pxi", line 1877, in lxml.etree._parseMemoryDocument
  File "src\lxml\parser.pxi", line 1765, in lxml.etree._parseDoc
  File "src\lxml\parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
  File "src\lxml\parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 640, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

美丽的汤+lxml：

作者页面

使用的代码：

import requests
from bs4 import BeautifulSoup

muno = requests.get("https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis")

maio = BeautifulSoup(muno.text, "xml")

print(maio)

结果：

<?xml version="1.0" encoding="utf-8"?>

这对我们没有帮助，因为我们无法检索有关事件的信息，而这正是我们正在寻找的。

XML.ElementTree：

SO-ES问题我基于它

使用的代码：

import urllib.request
import xml.etree.ElementTree as ET

url = "https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis"
uh = urllib.request.urlopen(url)
data = uh.read()
commentinfo = ET.fromstring(data)

抛出异常：

Traceback (most recent call last):
  File "C:/Users/usuario/maio.py", line 48, in <module>
    commentinfo = ET.fromstring(data)
  File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\xml\etree\ElementTree.py", line 1314, in XML
    parser.feed(text)
  File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 0

评估（）：

由于我们有一个str“几乎”像 Python 中的字典的对象，我们可以考虑将其设为真正的字典：

import requests

maio = requests.get("https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=mlb&in-running-flag=false")

print((eval(maio.text)))

但是不，不幸的是，这也不可能：

Traceback (most recent call last):
  File "C:/Users/usuario/maio.py", line 56, in <module>
    print((eval(maio.text)))
  File "<string>", line 1, in <module>
NameError: name 'false' is not defined

XMLtodict：

SO-ES问题我基于它

使用的代码：

import urllib.request
import xmltodict

url = "https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis"
data = urllib.request.urlopen(url)
parsed_data = xmltodict.parse(data.read())

抛出异常：

Traceback (most recent call last):
  File "C:/Users/usuario/maio.py", line 36, in <module>
    parsed_data = xmltodict.parse(data.read())
  File "C:\Users\usuario\AppData\Local\Programs\Python\Python36\lib\site-packages\xmltodict.py", line 327, in parse
    parser.Parse(xml_input, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 0

在经历了所有这些磨难之后，您会明白，在我看来，用 Python 处理 XML 文档至少是一项壮举。但我不认为这样做已经证明不可能执行。当然有可能。当然，我不会注意到许多细节。它们对你来说很明显吗？

3 Answers

Voted

abulafia · Answer 1 · 2020-03-28T09:16:55+08:00

问题是您正在访问的服务器正在检查Accept您的客户端发送给它的标头以确定它是使用 JSON 还是 XML 响应。

当您从浏览器进行测试时，它会随请求一起发送标头

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

您在其中表达了您希望以 html 格式接收响应的偏好，如果不是 xhtml 格式，如果不是 xml 格式，或者不是任何其他格式的响应。服务器遵循这些偏好并使用 XML 文档响应您。

相反，当您使用 Python 从 Python 进行测试时requests，默认情况下此库会发送另一个标头：

Accept: */*

这表明您对格式没有偏好，任何格式都可以。对于这种情况，您会看到服务器决定以 JSON 格式发送响应，因此您将其解析为 XML 的所有尝试均未成功。

您只需发送适当的标头即可接收 XML 格式的文档，即：

import requests

url = 'https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis'
r = requests.get(url, headers={"Accept": "application/xml"})

并且r.content您将拥有 XML 文档。

但是，正如您在另一个答案中已经被告知的那样，JSON 可以更简单地处理。如果您不被迫使用 XML，我更愿意使用 json 响应。

Patricio Moracho · Answer 2 · 2020-03-28T07:59:15+08:00

我的回答与你已经被告知的内容一致。您收到的是 Json 而不是 xml。所以处理就更简单了一点：

import urllib.request
import json as json

url = "https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis"
data = urllib.request.urlopen(url)

d = json.loads(data.read())
for e in d['events']:
  print(e['id'], e['name'])

1076553296870016 Aslan Karatsev vs Quentin Halys
1075756125600015 Jurgen Melzer vs Pablo Andujar
1076553356490016 Facundo Bagnis vs Gianluca Mager
1076701387430016 Stefano Travaglia vs Nicolas Mahut
1077302532250015 Miami Open Double
1076177074730015 Qiang Wang vs Simona Halep
1069506693330015 WTA Miami 2019
1076413142330015 Gregoire Barrere vs Jelle Sels
1075839531500015 Viktor Troicki vs Mirza Basic
1075978769150016 Daniil Medvedev vs Roger Federer
1070308857800015 ATP Miami 2019
1076864676570015 Roberto Bautista Agut vs John Isner
1076183849220015 Karolina Pliskova vs Marketa Vondrousova
1076739116680016 Felix Auger Aliassime vs Borna Coric
1077338546190016 Alejandro Davidovich Fokina vs Carlos Taberner
1076828485460015 Alessandro Giannessi vs Raul Brancaccio
1077431606420015 Benoit Paire vs Steven Diez
1076828449700016 Dennis Novak vs Antoine Hoang
1076828446190015 Filip Horansky vs Maxime Janvier
1077338563450016 Jiri Vesely vs Andrea Arnaboldi
1077338494290015 Mikael Torpegaard vs Kamil Majchrzak
1076828488980016 Pedro Martinez vs Guillermo Garcia-Lopez
1076828452010015 Ricardas Berankis vs Sebastian Ofner
1077338497410016 Roman Safiullin vs Evgeny Donskoy
1077082658670015 Denis Shapovalov vs Frances Tiafoe
1077047365530015 Anett Kontaveit vs Ashleigh Barty

您的任何其他选项都应该起作用，始终假设我们收到 a json，我们通过解释，json.loads(<json data>)在这种情况下，返回是我们可以通过任何常用技术访问的字典

检查“原始”内容并查看它是什么总是很方便的，例如，查看接收到的前 20 个字节：

print(data.read()[:30])
b'{"offset":0,"per-page":100,"to'

我们清楚地意识到它将是 ajson而不是 a xml。

Sebastián Miranda · Answer 3 · 2020-03-28T08:00:43+08:00

已经。

根据讨论的内容。

您可以选择使用DicttoXml使用 Python从json 转换为 xml，我安装了 Anaconda 以使用 python，这允许我使用 Anaconda Prompt 来安装模块。

命令是：

pip install dicttoxml

该页面有一个适用于 python 2.x 的示例，我安装了 3.x 版，因此代码如下所示：

import json
import urllib.request
import dicttoxml
page = urllib.request.urlopen('https://www.matchbook.com/edge/rest/events?language=en&currency=GBP&price-mode=aggregated&exchange-type=back-lay&odds-type=DECIMAL&price-depth=3&price-order=price%20desc&include-event-participants=true&offset=0&per-page=100&market-states=open,suspended,closed&runner-states=open,suspended,closed&tag-url-names=tennis')
content = page.read()
obj = json.loads(content)
print(obj)
{u'mylist': [u'foo', u'bar', u'baz'], u'mydict': {u'foo': u'bar', u'baz': 1}, u'ok': True}
xml = dicttoxml.dicttoxml(obj)
print(xml)

如果您首先检查终端，它将向您显示您从中获取数据的页面的代码，然后它将向您显示转换为带有所有标签的 xml。您可以通过将此代码的xml变量传递给它们来尝试一些替代方案。

干杯

如何在 Python 中处理 XML：5 种可能的替代方案，它们都失败了？

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?