I have quite a long text with data names etc. Let's say the data would be like this:
[espacios]Elnombre1[espacios](237)
[espacios]Elnombre3(237)
[espacios]Elnombre4(17)
I just need to get the names. Normally the names go with spaces before and after the name, and finally parentheses and inside a number.
Also, I need to add some text inside parentheses (any text will do).
Expected result:
nombremio123textoquepuse(losparentesis)
I tried with:
with open("e.txt", 'r+') as f:
texto = re.sub('^\s+([a-zA-Z-0-9]+)\s*', f.read())
f.seek(0)
f.write(texto)
f.truncate()
Any possible way to do it by reading a text file and rewriting it with the correct data?
An idea. Split the line by the parentheses and then make a
strip
of the first element, to eliminate the spaces that it may have at the beginning and at the end. Namely:Comes out
Update
To work on file, if the number of lines is not huge, one approach would be to read it first, process the lines and accumulate the results in a list, and write it later.
The drawback of this approach is that you have to have the results in memory before writing them. This shouldn't be a problem unless the file is monstrous, but if it were a problem then it would be better to open two files (the original for reading and the results for writing) and write the lines as they are processed instead of storing them in a list. In the end, once the files are closed, you could rename the output file and give it the same name as the input.
Update 2
In a later edition of the question, the possibility of extracting what goes in the parentheses and adding extra text (I understand that prefixed) is requested.
For this type of processing, it becomes preferable to build a regular expression that captures the different parts of the line that are of interest. However regular expressions are known to be a touchy subject, and there is already another answer that shows how to use them, so I will show here the "handmade" solution (although it is not the one I would recommend in general).
To extract what is inside the parentheses, we can take advantage of the fact that we have already divided the line by the
(
, so[1]
the rest of the line will be in the element. Just remove the last character (which will be the)
) to get what was inside the parentheses. Namely:One way to get the name using a regular expression could be the following:
End output:
Explanation
What the regular expression
References^\s+([a-zA-Z-0-9]+)\s*
does is look for a space character to start with\s
at least once, continue with letters or numbers (the parentheses () allows capturing the name) and this is followed again*
by zero or more space characters\s
.Replace capturing a part of the text
Taking as an example:
So, from the beginning of the text
^
, optional spaces*
, any number of characters.*?
, optional spaces , the parentheses with the number\(\d+\)
and the end of the text$
.I used
.*?
with the?
at the end to tell it to match "as little as possible". This is a lazy quantifier . And in this way, it does not consume the spaces that are after the name.In that same construction I used a dot (
.
), which matches any character except a newline, but you could perfectly well limit it to whatever you want, for example:[\w .,;!áéíóúüñ]*?
, or any character except spaces[^ ]*?
, etc.What we're going to do is capture the name. When parentheses are used in a regular expression, the matched text is captured and saved, so that text can be used in the replacement, using
\1
.Regex:
Replacement:
otrotexto
it shouldn't have\
s (or you should escape them as\\
).Code:
Result:
demonstration:
https://ideone.com/s16Wrx
Replace inner spaces with
_
To convert
"El nombre 1 (237)"
to"El_nombre_1(otrotexto)"
we use a function as a parameter of re.sub(). Let's use a lambda.