What is a promise in Javascript?

Question

Isaac Bruno Quiroga

Asked: 2020-05-14 16:51:02 +0800 CST 2020-05-14 16:51:02 +0800 CST 2020-05-14 16:51:02 +0800 CST

How to create a dictionary with a regex knowing what text to extract, but not if it appears in a given input?

772

I'm trying to extract data from a piece of html that can vary the information you have in a listing.

Below are the possible portions of html with their respective expected output.

Example 1

<ul>
    <li class="contentnode">
        <dl><dt>País</dt><dd>Uganda</dd></dl>
    </li>
    <li class="contentnode">
        <dl><dt>Ciudad</dt><dd>Foo</dd></dl>
    </li>
    <li class="contentnode">
        <dl><dt>Email</dt><dd>[email protected]</dd></dl>
    </li>
</ul>

Expected output:

{'country': 'Uganda', 'city': 'Foo', 'email': '[email protected]'}

Example 2

<ul>
    <li class="contentnode">
        <dl><dt>País</dt><dd>Uganda</dd></dl>
    </li>
    <li class="contentnode">
        <dl><dt>Ciudad</dt><dd>Foo</dd></dl>
    </li>
</ul>

Expected output:

{'country': 'Uganda', 'city': 'Foo', 'email': None}
# o también
{'country': 'Uganda', 'city': 'Foo'}

Example 3

<ul>
    <li class="contentnode">
        <dl><dt>País</dt><dd>Uganda</dd></dl>
    </li>
</ul>

Expected output:

{'country': 'Uganda', 'city': None, 'email': None}
# o también
{'country': 'Uganda'}

Example 4

<ul>
    <li class="contentnode">
        <dl><dt>Email</dt><dd>[email protected]</dd></dl>
    </li>
</ul>

Expected output:

{'country': None, 'city': None, 'email': '[email protected]'}
# o también
{'email': '[email protected]'}

Details

I can know which fields are going to extract the data but not if the fields are.
The dictionary must be created in a single step and completely due to irrelevant reasons, that is, a previous verification of whether something is there or not should not be done (if possible) and then extract the data and add it. to the dictionary.
Homework should focus as much as possible on the regular expression.
The input (html) that the regular expression receives does not have line breaks. Example:

<ul><li class="contentnode"><dl><dt>País</dt><dd>Uganda</dd></dl></li><li class="contentnode"><dl><dt>Ciudad</dt><dd>Foo</dd></dl></li><li class="contentnode"><dl><dt>Email</dt><dd>[email protected]</dd></dl></li></ul>

Try

I have tried with the following regular expression:

import re
# input
html = '<ul><li class="contentnode"><dl><dt>País</dt><dd>Uganda</dd></dl></li><li class="contentnode"><dl><dt>Ciudad</dt><dd>Foo</dd></dl></li><li class="contentnode"><dl><dt>Email</dt><dd>[email protected]</dd></dl></li></ul>'
# regex
pattern = r'<dt>(?:País</dt><dd>(?P<country>\w+)|Ciudad</dt><dd>(?P<city>\w+)|Email</dt><dd>(?P<email>[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+))</dd>'
m = re.search(pattern, html)
print(m.groupdict())

mi salida:       {'country': 'Uganda', 'city': None, 'email': None}
salida esperada: {'country': 'Uganda', 'city': 'Foo', 'email': '[email protected]'}

Thanks in advance. Greetings.

1 Answers

Voted

FJSevilla · Answer 1 · 2020-05-15T05:36:12+08:00

Assuming your html has all three tags lialways in that order, you can combine non-capturing groups ( ?:) that wrap each liwith the quantifier ?(zero or one) to make them optional:

import re


html1 = '<ul><li class="contentnode"><dl><dt>País</dt><dd>Uganda</dd></dl></li><li class="contentnode"><dl><dt>Ciudad</dt><dd>Foo</dd></dl></li><li class="contentnode"><dl><dt>Email</dt><dd>[email protected]</dd></dl></li></ul>'
html2 = '<ul><li class="contentnode"><dl><dt>País</dt><dd>Uganda</dd></dl></li><li class="contentnode"><dl><dt>Ciudad</dt><dd>Foo</dd></dl></li></ul>'
html3 = '<ul><li class="contentnode"><dl><dt>País</dt><dd>Uganda</dd></dl></li></ul>'
html4 = '<ul><li class="contentnode"><dl><dt>Email</dt><dd>[email protected]</dd></dl></li></ul>'


pattern = re.compile(r'''
<ul>
    (?:<li\ class=\"contentnode\">
        <dl>
            <dt>País</dt>
            <dd>(?P<country>\w+)</dd>
        </dl>
    </li>)?.*?
    (?:<li\ class=\"contentnode\">
        <dl>
            <dt>Ciudad</dt>
            <dd>(?P<city>\w+)</dd>
        </dl>
    </li>)?.*?
    (?:<li\ class=\"contentnode\">
        <dl>
            <dt>Email</dt>
            <dd>(?P<email>[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)</dd>
        </dl>
    </li>)?
</ul>''', flags=re.VERBOSE)


for html in (html1, html2, html3, html4):
    m = pattern.search(html)
    print(m.groupdict())

{'country': 'Uganda', 'city': 'Foo', 'email': '[email protected]'}
{'country': 'Uganda', 'city': 'Foo', 'email': None}
{'country': 'Uganda', 'city': None, 'email': None}
{'country': None, 'city': None, 'email': '[email protected]'}

If you want not to include the values Noneyou can filter the dictionary:

for html in (html1, html2, html3, html4):
    m = pattern.search(html)
    print({group: value for group, value in m.groupdict().items() if value is not None})

{'country': 'Uganda', 'city': 'Foo', 'email': '[email protected]'}
{'country': 'Uganda', 'city': 'Foo'}
{'country': 'Uganda'}
{'email': '[email protected]'}

Normally it's not a good idea to use regex to parse html/xml, but I guess you'll have your reasons for using regular expressions instead of using a specific parser.

How to create a dictionary with a regex knowing what text to extract, but not if it appears in a given input?

Example 1

Example 2

Example 3

Example 4

Details

Try

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?