I put you in context: I have made a couple of scripts with regular expressions through the python re module.
def get_document_emails(pdf_format_content):
""" The function runs through the document and extracts all the email addresses it finds.
This method returns an ordered list of document emails without repeating. """
new_list = []
document_emails = re.findall(r'[w.\w]*@w*.[w.\w]*', pdf_format_content)
for i in document_emails:
if i not in new_list and not str(i).endswith('.'):
new_list.append(i)
return sorted(new_list)
def get_document_provider(pdf_format_content):
""" Return the name of the provider """
return re.match(r'(PROVEEDOR:)+(.*?\\n)', pdf_format_content)
The problem arises when executing the second function. I have used the tika library to extract the information from the pdf's and then call these functions to extract the emails from the document and the supplier data.
Both regular expressions I have tested in Regex101 and they capture what I need, at the web page scope. When I run my scripts in the console, with ipython, the first one works fine but the second one doesn't, I've tried the findall(), match(), search() functions... and they all return NoneType or [].
The first function works on a unicode string, and returns the list of mails from the document without problems, in unicode format, but the second function does not. I have also tried to encode it as utf-8 but it gives me an encoding error in some chars, also convert it to String with the following statement: fpdf = fpdf.encode('utf-8').strip() But the result is the same , or empty list or NoneType..
I've read the documentation for the re module , surfed the web on multiple sites and tried a bunch of lines of code but always get the same result.
What bothers me the most is that I can't understand why with emails it works but with the other regex it doesn't.
The image shows that the pattern works correctly.
In this other image I show the console output, I have removed the sensitive information but I think you understand what I want you to see. If anyone can help me fix this I would be very grateful.
Thank you all!!!
EDIT
After the comments of the colleagues and solving the regular expression leaving it like this r'(PROVIDER:)+(.*?\n)' I tried to see if it solved it.
Nothing could be further from the truth, although I still don't understand why the first expression works and finds results and the second expression doesn't work, on the same variable that contains the text in unicode format.
I attach the complete code that I use:
# -*- coding: UTF-8 -*-
import re
from tika import parser
def get_pdf_content(path):
""" Return the text content from the file given through path variable. """
pdf = parser.from_file(path)
return pdf['content']
def format_pdf_content(pdf_content):
""" The method formats the content of the pdf, removes the line breaks and returns a unicode string. """
variable = filter(lambda i: i != '\r', pdf_content)
return "".join(variable)
def get_document_emails(pdf_format_content):
""" The function runs through the document and extracts all the email addresses it finds.
This method returns an ordered list of document emails without repeating. """
new_list = []
document_emails = re.findall(r'[w.\w]*@w*.[w.\w]*', pdf_format_content)
for i in document_emails:
if i not in new_list and not str(i).endswith('.'):
new_list.append(i)
return sorted(new_list)
def get_document_provider(pdf_format_content):
""" Return the name of the provider """
return re.match(r'(PROVEEDOR:)+(.*?\r)', pdf_format_content)
Thank you very much again!!
Aside from the initial mess you had about whether or not the slash should go double, you also have a problem with the regular expression you use to find the provider, and the package function
re
you use to do it.So let's go by parts.
The character
\
Although I think this is already clear, just in case I explain it in more detail below.
Within a quoted string, the character
\
is considered special and its meaning depends on which character follows it. If there is an
, the pair\n
actually represents a single character called "Newline" and whose ASCII is 13. If there is ar
, the pair\r
is a single character called "carriage Return" and whose ASCII is 10. These two characters usually appear together in the order\r\n
, also called "CRLF", but it depends on the operating (in Unix it is more usual that it\n
comes alone).If
\
another appears after\
, then the pair\\
represents a single character, which is the backslash (ASCII 92). The fact that folders in Windows are separated by\
forces you to duplicate them when they appear inside a Python string.Also in regular expressions the character
\
is special to the regular expression itself, since it is usually used to express categories of characters. For example\d
it represents "digits". Since regular expressions in python are stored in strings, it would be necessary to repeat this\
when used to specify categories for regular expressions. Thus, the regular expression "one or more digits" which would be\d+
, in a Python string would be writtenejemplo = "\\d+"
(doubled to remove its special meaning within the string, so that only one would be stored. Thus it wouldlen(ejemplo)
be 3, and itejemplo[0]
would be the character\
, while itejemplo[1]
would be thed
).If we don't want it to
\
have any special meaning in a Python string, we can use raw strings , which are preceded by ar
. This avoids having to duplicate that character each time it appears, which can be useful for Windows: pathsruta = r'C:\Users\abulafia\Documentos\Mi Carpeta\Otra carpeta'
. In return we lose the ability to express a carriage return, since it wouldr"\n"
store the sequence of two characters\
andn
, instead of just one (ASCII 13).Since the is used a lot in regular expressions , raw strings
\
are often used to contain it to avoid having to duplicate it. Thus, the regular expression "a sequence of one or more digits" can also be written as , and the variable would store exactly the same as in the case .ejemplo = r"\d+"
ejemplo = "\\d+"
To further complicate matters, the regular expression
r"\n"
actually contains two characters, but the regular expression engine considers the two together to represent the carriage return (in the same way that it is considered to\d
mean "a digit"), so there is no problem in always using raw strings in regular expressions.An additional mess occurs if you dump a Python string to the console. Python chooses to represent strings when they are output in a format that allows them to be "copy-and-pasted" as part of code. So if you do the assignment:
and dump that variable to see what it contains:
Python chooses to display it as normal (never raw ), delimited by
'
and therefore with each\
repeat, so you could copy that and assign it to another variable, which would have the same value as variableejemplo
. But this output format is very confusing to users, who think that the string contains the\
repeated character, when in fact it only contains it once, and its representation shows it repeated.regular expression
r'(PROVEEDOR:)+(.*?\r)'
The first parentheses are unnecessary, since I understand that you do not want to capture the word "PROVIDER", but only find it. What you want to capture is what comes after it.
The sign
+
after those parentheses also seems to be wrong. That sign means "one or more repetitions of what precedes it". But what precedes it is the group(PROVEEDOR:)
, which means that you would be looking for "One or more repetitions of the text in a"PROVEEDOR:"
row", that is, something like"PROVEEDOR:PROVEEDOR:PROVEEDOR:"
. Actually since the case "one" is also supported, it would find a single as valid"PROVEEDOR:"
, but I guess that repeated string never appears, so it is+
left over.Perhaps you wanted to put "The string
"PROVEEDOR:"
followed by one or more spaces", but in that case you forgot to put a space in front of the+
, like so:(PROVEEDOR:) +
. Although this forces there to be at least one space. If there may not be any, it's better to use*
instead of+
.Finally comes the capturing group that interests you, which is "Any sequence of non-greedy characters , until the first appears
\r
. This is also wrong, because in principle it may\r
not appear (it depends on the operating system whether or not that character is present). ) and it would be safer to use a\n
. And secondly because you don't want (I guess) the carriage return to be part of the result, so it would be better off outside the capturing group. And by the way remove the capturing group from the hypothetical trailing spaces.Therefore like this:
The function to use
You have used
re.match()
, but this function only returns a match group if the string matches the regular expression from its beginning , which is not your case where what you are looking for is in the middle.For that case it is better to use
re.search()
, assuming there is only one provider, orre.findall()
there may be several.Suppose there is only one.
re.search()
will look for the first occurrence of the regular expression in the string, and return a match . This will contain the full string for which the match occurred (which also includes the text"PROVEEDDOR: "
and the spaces and carriage returns we don't want), and it also contains the capturing groups, which is what you're interested in in this case already. that the capturing group will contain what is the name of the provider only.So your function would be:
additional details
I don't see the function that removes the
\r
.It is also not really necessary to include the
\n
inside the regular expression, since in the group.*?
the default point represents "Any character except the carriage return", so as soon as it finds a carriage return it would end the match and the capture group.So I think the following will work just as well (or maybe better):
you can do it like this:
you are grouping provider in the result and escaping the line break
\n
with\\n
additional you also add the\n
to the result group and not as a final character you must put it like this:your code would look like this: