I'm trying to get Python to search for me certain lines (and slices) of a text file.
The text is as follows (the first 60 lines of 13000):
Linea 0 A 213992,"A 114416","05/01/2021","19/01/2021","N","E","1005","* "," 0"," 0"," 0"," 0"," 0",
Linea 1 A 114416,"CIOTOLO NORMA TELMA ","* ","POLA 1438 ","CAPITAL ","4682-1534 "," 74","F","1041","* "," 0"," "," ","17/01/2013","17/01/2013","1202","15052545070100 ","DNI",
Linea 2 C,"3755162 ","01/05/1938","* ","2","* ",
Linea 3 G
Linea 4 A 213992," "," "," "," "," ",
Linea 5 13
Linea 6 1 475I 1
Linea 7 2 941I 1
Linea 8 3 190I 1
Linea 9 4 192I 1
Linea 10 5 412I 1
Linea 11 6 4811I 1
Linea 12 7 865I 1
Linea 13 8 867I 1
Linea 14 9 902I 1
Linea 15 10 8298I 1
Linea 16 11 546I 1
Linea 17 12 711I 1
Linea 18 13 120N 1
Linea 19 A 213993,"A 129320","05/01/2021","12/01/2021","N","E","1005","* "," 0"," 0"," 0"," 0"," 0",
Linea 20 A 129320,"LOPREITO ALICIA MIRTA ","* ","PIEDRABUENA 3841 ","CAPITAL ","4601-5620 1150631906 "," 73","F"," 0","DNI 5694842 "," 0"," "," ","05/01/2021","05/01/2021","1005","15046761690600 ","DNI",
Linea 21 C,"5694842 ","11/10/1947","* ","2","* ",
Linea 22 G
Linea 23 A 213993," "," "," "," "," ",
Linea 24 12
Linea 25 1 475I 1
Linea 26 2 653I 1
Linea 27 3 746I 1
Linea 28 4 133I 1
Linea 29 5 192I 1
Linea 30 6 362I 1
Linea 31 7 412I 1
Linea 32 8 4811I 1
Linea 33 9 546I 1
Linea 34 10 902I 1
Linea 35 11 948I 1
Linea 36 12 120N 1
Linea 37 A 214012,"A 129321","04/01/2021","18/01/2021","N","E","1005","* "," 0"," 0"," 0"," 0"," 0",
Linea 38 A 129321,"SERRANO MARIA DOLORES ","* ","LARRAZABAL 1551 ","CAPITAL ","1123101950 "," 86","F"," 0","DNI 16561081 "," 0"," "," ","04/01/2021","04/01/2021","1005","15053746050100 ","DNI",
Linea 39 C,"16561081 ","02/03/1934","* ","2","* ",
Linea 40 G
Linea 41 A 214012," "," "," "," "," ",
Linea 42 11
Linea 43 1 475I 1
Linea 44 2 192I 1
Linea 45 3 297I 1
Linea 46 4 412I 1
Linea 47 5 4811I 1
Linea 48 6 546I 1
Linea 49 7 865I 1
Linea 50 8 866I 1
Linea 51 9 867I 1
Linea 52 10 500I 1
Linea 53 11 8298I 1
Linea 54 A 214013,"A 125271","04/01/2021","13/01/2021","N","E","1005"," 136 "," 0"," 0"," 0"," 0"," 0",
Linea 55 A 125271,"IMPRENTA MIGUEL ARTURO ","* ","P.GARCIA 5887 ","CAPITAL ","4605-5813 "," 69","M"," 136","DNI 6151369 "," 0"," "," ","27/10/2017","27/10/2017","1005","15060320150300 ","DNI",
Linea 56 C,"6151369 ","01/11/1948","* ","2","* ",
Linea 57 G
Linea 58 A 214013," "," "," "," "," ",
Linea 59 2
Linea 60 1 412I 1
Linea 61 2 500I 1
You must take information from lines 0,1 and 2
which should give this result:
DATA
CIATTLO MARIA TELMA;DNI;3666162;;;;;;;;;;01/05/1900;F;;;05/01/2021;158888450701;00;;;;;;;;
The code so far is:
with open ("E:\Test.txt", "r") as f:
linea = f.readlines()
def persona(linea):
data = [] # Lista con los campos a mostrar en una línea
data.append(linea[1][16:49].rstrip()) # Nombre y Apellido
data.append(linea[1][375:378]) # Tipo Documento
data.append(linea[2][5:20].replace(" ", "")) # Nro Documento
data.append(";;;;;;;;")
data.append(linea[2][23:33]) # Fecha de Nacimiento
data.append(linea[1][217:218]) # Sexo
data.append(";")
data.append(linea[0][31:41]) # fecha
data.append(linea[1][342:354]) # Beneficio
data.append(linea[1][354:356]) # Parentesco
data.append(";;;;;;;")
print(";".join(data))
print("DATA")
persona(linea)
I don't know how to add the same information to lines 19, 20 and 21, and 37, 38 and 39, and so on, of the 13,000 lines of the file.
The key to know which are the next 3 lines to read is the sum of the value of line 5 (in this case 13) + 6 (which are the lines with information). The result for this example would be:
DATA
CIATTLO MARIA TELMA;DNI;3666162;;;;;;;;;;01/05/1900;F;;;05/01/2021;158888450701;00;;;;;;;;
LOPEZITO ALACIA MIRNA;DNI;5688842;;;;;;;;;;11/10/1900;F;;;05/01/2021;150646464906;00;;;;;;;;
SICRANO MARIA DOLORES;DNI;16001081;;;;;;;;;;02/03/1900;F;;;04/01/2021;150538888501;00;;;;;;;;
IMPARNTA MIGUEL ARTURO;DNI;6152269;;;;;;;;;;01/11/1900;M;;;04/01/2021;150669691503;00;;;;;;;;
Thanks for the help.
First, functions should be defined at the outermost level, so I moved the function
persona
out of the main loop. The function has no changes.For processes of this type it does not take care of
readlines()
. This function attempts to read the entire file and convert it to a list, which is slow, inefficient, and can cause a crash due to out of memory.What is used for massive files, and in general for any process where you have to calculate, like here, is the function
readline()
, which only reads one line at a time.So the process boils down to:
It is not clear how the file ends. For these purposes, I will assume that it is marked with a
salto == 0
produces:
Having already the lines, you can adjust the output to the format you want.
What if you treat the file as a CSV?
In this tip, convert each line to CSV separated by
,
Each line is traversed, discarding those with one element or less, leaving 4 lines for each entry.
For each cycle an extend sublist is made every 4 lines. Upon reaching the 4th , the list
(i%4 == 0)
is appendeddata
and restartedsublist
.Once the cycle is finished, we have a list with sublists in
data
. If you did aprint(data)
, you would get something like this:You already have all the records inside the variable
data
. Now it's just a matter of looking for the required items in each sublist: