Update my idea, now I want to create a txt file with the first word of the sentence and have the second word of the sentence inside, then delete the first word of the sentence and create another txt with the new first word and save it again word that follows and so on until ending with a sentence.
Example:
(Yo soy Lola.)
Yo.txt=soy
soy.txt=Lola.
Lola..txt=(no habra nada porque la oración concluyo)
If the second sentence that comes has words that have been created, then only the second word is added, but if the second word already exists in that file then it is not added.
Example.
(Yo seré Lola.)
Yo.txt= soy seré
seré.txt= Lola.
Lola.txt=(no habrá nada aqui oración terminada)
With this function I get the first word of the sentence.
def primera_pal(oracion):
for palabra in oracion.split():
print("llege a la funcion: ",palabra)
return palabra
ignore this
def procesar_parrafo(parrafo):
completo = ' '.join(parrafo)
#completo = completo.replace(",", ".")
completo = completo.replace(";", ".")
completo = completo.replace("—","")
completo = completo.replace("«", "")
completo = completo.replace("»", "")
lista_punto = completo.split(".")
return [x.strip() for x in lista_punto]
parrafo=[]
activar_af=0
with open(ruta_libros.format("quijote"), "r", encoding="utf-8") as libro:
parrafo = []
for line in libro:
line = line.strip() # Botar los whitespaces al final.
if line == '':
for oracion in procesar_parrafo(parrafo):
#print(oracion)
with open(ruta_libros.format("quijote2"), "a", encoding="utf-8") as librox:
Well here is the invention (I need to add the sentences in their respective files, with line break.)
### ENFOCATE DE AQUI PARA ABAJO #####
pal_en1=oracion
pal_en2=pal_en1
print("-----Pal 2 Antes: ",pal_en2)
activar_af=0
for oracionx in pal_en2.split():
#print("Oracionx: ", oracionx)
pr_pal = primera_pal(pal_en2)
#pr_pal=' '.join(pal_en2.split()[1:])
with open(ruta_conocimientos.format(pr_pal), "a", encoding="utf-8") as datox:
if oracionx not in "" and activar_af <=2:
print("La oracionx: ",oracionx)
print("Dentro-----------------------------------------")
print("primera_palabra: ",pr_pal)
datox.write(oracionx+" ")
pal_en2=pal_en2.replace(pr_pal,"",1)
activar_af+=1
if activar_af>=2:
datox.write("\n")
datox.close()
print("Pal 2 despues: ",pal_en2)
#if oracion not in "":
# librox.write(oracion+".")
parrafo = []
else:
#print(line)
parrafo.append(line)
The data structure you want to store in the background is a list of words (the files) each of which is a reference to another list of words (the contents of the files).
I think you can store all that information much more efficiently if in Python you create a dictionary whose keys are the words, and whose values are the lists with other words.
Following your same example, instead of creating files called
Yo.txt
,soy.txt
,seré.txt
,Lola..txt
that contain words, what you would have would be the following dictionary:Once you have built this dictionary in memory (which will be much faster than creating the equivalent structure on disk), you can also save it in a file, if what you are concerned about is the persistence of the data (that is, that they can continue to exist to disk once the program has finished).
Saving it to file can be extremely simple if you use the module
pickle
:And to retrieve it would be like this:
The use of
pickle
as you see is quite simple. The drawback is that the resulting file is not editable. If you open it with a text editor you will see "garbage" mixed in with your data (that "garbage" is actually what tells python what kind of data is stored there, which allows it to retrieve it later).If you prefer an "editable" format (if only to read it from an editor, without needing to load it into Python) you can use
json
. In this case you would save it like this:And you would retrieve it like this:
As you can see, the mechanics are practically the same, but the content of the file is now readable and in fact it looks like a Python dictionary just like the one I wrote above (the JSON format, although it is not exactly the same as the Python data syntax, is seems like a lot in many cases, and in this particular case where the data is all of type string, list, and dictionary, the syntax would be identical).
Note
Since the purpose of all this is to store a data structure that captures the pairs of words that can appear one after the other, I think you need some kind of "start of sentence" and "end of sentence" marker , as additional pseudowords . In this way the start marker would be one more key in the dictionary and the associated list would give which words can start a sentence. Similarly, if a word could be the last of a sentence, the end-of-sentence mark would appear among the items in its list.
For example, suppose the start flag is "START" and the end flag is "END" (any other string that can't appear as a word would do). So the dictionary corresponding to your example would be more like this:
This allows you to know where to start a sentence, and also allows some words to appear both at the end of the sentence and after others (if the special "END" marker appears among your list of successors).
Bonuses
A dictionary like the one above can be built with a few lines of code, assuming the list
oraciones
contains a series of phrases, such as:The following code would be used to build the searched dictionary. During the construction of that dictionary I use sets (
set()
) as an efficient way to avoid putting repeated words in the lists. At the end of the loop I convert those sets to alphabetical lists to make them easier to browse:As a curiosity, for the two phrases from Quixote shown above, the resulting dictionary will be: