I am trying to make a program which tells me the words of a text that contains 4 sentences but it should only count the word 1 time per sentence, what happens to me is that when counting them it counts the word only once. How could I solve this?
I attach the text of which I have to do that validation.
Podador que podas la parra, que parra podas?
Podas mi parra o tu parra podas?
Ni podo tu parra, ni mi parra podo,
que podo la parra de mi tio Bartolo que apodase tolo.
In my case it counts me 1 times "parra" which should be 4 times. (those that are repeated in the same sentence are not counted)
Thanks.
Attached code.
def lim(x):
x=x.lower()
x=x.rstrip('?')
x=x.rstrip('.')
x=x.rstrip(',')
x=x.rstrip(':')
x=x.rstrip(';')
return x
a=open('discurso.txt','r')
palabras={}
for i in a:
p=i.split()
print(p)
for j in p:
for k in p:
j=lim(j)
if len(j)>4 and j!=k:
if j not in palabras:
palabras[j]=1
print(palabras)
Reading your code I can't understand what you are trying to do, but I think you are looking for something like this:
You are not really counting, you are missing the
+=1
, but you assign a1
to the words that are found with more than 4 characters. Thej!=k
will not work as expected either, what if a word appears three times on the same line?I hope this is the solution you are looking for, the result I get is:
This solution uses regular expressions to extract the words, removing the punctuation marks. It does this by defining a pattern with capture
([a-zA-Z]{5,})
that recognizes only words of five letters or more.We compile this pattern
and then we use it to break a phrase into its words:
That produces a list of words with repetitions and upper/lower case. We use list compression to convert everything to lowercase and then build a
set
to remove duplicates from each phrase:Then we are left to loop through the list
palabras
and update the counters. For that we use adefaultdict
. It's just like a standard dictionary, only it automatically creates the entry when the key doesn't exist.This dictionary uses the word by key and keeps track of how many times it appears in total:
In short, the complete code is:
produces: