Another one about regex.. I have created a regex on a text of a pdf formatted through the tika library , that is, the text of the pdf saved in a variable, in unicode format.
'^[A-Z]\S{2,} *(?:\n+ *\S+ *)*?\n*.*?\d+ +\d+(?:[.,]\d+)?%'
With it I want to get:
Analista programador-DyD 1 49,54%
Programador-DyD 1 50,46%
TOTAL 2 100%
The appearance of the text when doing print() is this:
If we display the content of the variable without doing print() we get this:
That is, where \n appear, they are actually line breaks, as can be seen in the first image, in which we show the content of the variable through the print() function
When I take this text to the web page regex101.com the text is captured as I want, but when running the script it always returns an empty list (I use the findall method of the re module).
Both in this link and in the one above you can see how it matches. It should be noted that on the regex101.com page I have replaced the \n that the raw variable returns (without using the print() function, nor parsing str, nor anything, pure unicode) for line breaks, so that regex101.com don't treat \n as string.
Now the doubt. Why on the web if it works but when passing the text in unicode it doesn't work?
Thank you very much for your time!!