I am needing to build a data.frame
with words from the Spanish language (or at least a significant number of them), the idea is to use them to later "clean" others data.frame
, in order to remove patterns that do not correspond to valid words.
There is a resource in the RAE that is the Reference Corpus of Current Spanish (CREA) , it is a set of some 140,000 documents, made up of books, press material and others. On the other hand, the mentioned document talks about a Frequent Forms Report and I am particularly interested in working with the Total List of Frequencies , which according to what I understand, is a complete list of the words of this Corpus ordered by frequency.
The most specific query is: How can I incorporate this resource into a data.frame
? , and the other more general , is this a valid resource for what I'm looking for?
Let's go to the more general question , is this a valid resource? , according to this note , the dictionary of the rae contains 88,000 terms and that of americanisms about 70,000, in total, almost 160,000 terms and according to what is said, it is usually estimated 30% more words, that is to say that we would be talking about Spanish has about 210,000 words. In the commented link , it is a compressed file that contains another text file where each line is a word and contains about 737,799 words in total, about 3 times more than our base number. Keep in mind that this resource contains all kinds of verb conjugations. Which is why I would say. in principle,yes, it seems a valid resource and consistent with the idea of having as complete a list as possible of words in the Spanish language .
The next thing is to see how to import this file, and transform it into a
data.frame
. The following may be one way:A moderate option is to take the list of frequencies: http://corpus.rae.es/frec/10000_formas.TXT
The problem that I find with the other option is that the corpus contains many incorrectly written words, with typing errors, and then the data frame becomes useless as a source of contrast. Already in these ten thousand it is easy to find words that are not or that are misspelled.