I'm trying to extract data from a piece of html that can vary the information you have in a listing.
Below are the possible portions of html with their respective expected output.
Example 1
<ul>
<li class="contentnode">
<dl><dt>País</dt><dd>Uganda</dd></dl>
</li>
<li class="contentnode">
<dl><dt>Ciudad</dt><dd>Foo</dd></dl>
</li>
<li class="contentnode">
<dl><dt>Email</dt><dd>[email protected]</dd></dl>
</li>
</ul>
Expected output:
{'country': 'Uganda', 'city': 'Foo', 'email': '[email protected]'}
Example 2
<ul>
<li class="contentnode">
<dl><dt>País</dt><dd>Uganda</dd></dl>
</li>
<li class="contentnode">
<dl><dt>Ciudad</dt><dd>Foo</dd></dl>
</li>
</ul>
Expected output:
{'country': 'Uganda', 'city': 'Foo', 'email': None}
# o también
{'country': 'Uganda', 'city': 'Foo'}
Example 3
<ul>
<li class="contentnode">
<dl><dt>País</dt><dd>Uganda</dd></dl>
</li>
</ul>
Expected output:
{'country': 'Uganda', 'city': None, 'email': None}
# o también
{'country': 'Uganda'}
Example 4
<ul>
<li class="contentnode">
<dl><dt>Email</dt><dd>[email protected]</dd></dl>
</li>
</ul>
Expected output:
{'country': None, 'city': None, 'email': '[email protected]'}
# o también
{'email': '[email protected]'}
Details
- I can know which fields are going to extract the data but not if the fields are.
- The dictionary must be created in a single step and completely due to irrelevant reasons, that is, a previous verification of whether something is there or not should not be done (if possible) and then extract the data and add it. to the dictionary.
- Homework should focus as much as possible on the regular expression.
- The input (html) that the regular expression receives does not have line breaks. Example:
<ul><li class="contentnode"><dl><dt>País</dt><dd>Uganda</dd></dl></li><li class="contentnode"><dl><dt>Ciudad</dt><dd>Foo</dd></dl></li><li class="contentnode"><dl><dt>Email</dt><dd>[email protected]</dd></dl></li></ul>
Try
I have tried with the following regular expression:
import re
# input
html = '<ul><li class="contentnode"><dl><dt>País</dt><dd>Uganda</dd></dl></li><li class="contentnode"><dl><dt>Ciudad</dt><dd>Foo</dd></dl></li><li class="contentnode"><dl><dt>Email</dt><dd>[email protected]</dd></dl></li></ul>'
# regex
pattern = r'<dt>(?:País</dt><dd>(?P<country>\w+)|Ciudad</dt><dd>(?P<city>\w+)|Email</dt><dd>(?P<email>[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+))</dd>'
m = re.search(pattern, html)
print(m.groupdict())
mi salida: {'country': 'Uganda', 'city': None, 'email': None}
salida esperada: {'country': 'Uganda', 'city': 'Foo', 'email': '[email protected]'}
Thanks in advance. Greetings.
Assuming your html has all three tags
li
always in that order, you can combine non-capturing groups (?:
) that wrap eachli
with the quantifier?
(zero or one) to make them optional:If you want not to include the values
None
you can filter the dictionary:Normally it's not a good idea to use regex to parse html/xml, but I guess you'll have your reasons for using regular expressions instead of using a specific parser.