Another one about regex.. I have created a regex on a text of a pdf formatted through the tika library , that is, the text of the pdf saved in a variable, in unicode format.
'^[A-Z]\S{2,} *(?:\n+ *\S+ *)*?\n*.*?\d+ +\d+(?:[.,]\d+)?%'
With it I want to get:
Analista programador-DyD 1 49,54%
Programador-DyD 1 50,46%
TOTAL 2 100%
The appearance of the text when doing print() is this:
If we display the content of the variable without doing print() we get this:
That is, where \n appear, they are actually line breaks, as can be seen in the first image, in which we show the content of the variable through the print() function
When I take this text to the web page regex101.com the text is captured as I want, but when running the script it always returns an empty list (I use the findall method of the re module).
Both in this link and in the one above you can see how it matches. It should be noted that on the regex101.com page I have replaced the \n that the raw variable returns (without using the print() function, nor parsing str, nor anything, pure unicode) for line breaks, so that regex101.com don't treat \n as string.
Now the doubt. Why on the web if it works but when passing the text in unicode it doesn't work?
Thank you very much for your time!!
If you look at the regex101 web page , the regular expression has certain flags activated:
Specifically, it has the "Global" and "Multiline" options active. The "Global" option is irrelevant when you use
findall()
(although it has its importance formatch()
), but the "Multiline" option is essential, since with it^
it refers to the beginning of any line, but without it it refers to the beginning of the string. If you try to deactivate it you will see that it no longer finds anything.In python these flags are activated with additional parameters of
findall
. In this case it would be:Now the result (over the text I copied from the regex101 page) is: