What is a promise in Javascript?

Question

Patricio Moracho

Asked: 2020-07-02 20:04:45 +0800 CST 2020-07-02 20:04:45 +0800 CST 2020-07-02 20:04:45 +0800 CST

How to build a data.frame with all the words of the Spanish language?

772

I am needing to build a data.framewith words from the Spanish language (or at least a significant number of them), the idea is to use them to later "clean" others data.frame, in order to remove patterns that do not correspond to valid words.

There is a resource in the RAE that is the Reference Corpus of Current Spanish (CREA) , it is a set of some 140,000 documents, made up of books, press material and others. On the other hand, the mentioned document talks about a Frequent Forms Report and I am particularly interested in working with the Total List of Frequencies , which according to what I understand, is a complete list of the words of this Corpus ordered by frequency.

The most specific query is: How can I incorporate this resource into a data.frame? , and the other more general , is this a valid resource for what I'm looking for?

2 Answers

Voted

Patricio Moracho · Answer 1 · 2020-07-02T20:04:45+08:00

Let's go to the more general question , is this a valid resource? , according to this note , the dictionary of the rae contains 88,000 terms and that of americanisms about 70,000, in total, almost 160,000 terms and according to what is said, it is usually estimated 30% more words, that is to say that we would be talking about Spanish has about 210,000 words. In the commented link , it is a compressed file that contains another text file where each line is a word and contains about 737,799 words in total, about 3 times more than our base number. Keep in mind that this resource contains all kinds of verb conjugations. Which is why I would say. in principle,yes, it seems a valid resource and consistent with the idea of having as complete a list as possible of words in the Spanish language .

The next thing is to see how to import this file, and transform it into a data.frame. The following may be one way:

tmppath <- tempdir()
tmpfile <- file.path(tmppath,"CREA_total.zip")
url <- "http://corpus.rae.es/frec/CREA_total.zip"
download.file(url, tmpfile)
unzip(tmpfile, exdir = tmppath)
RAE_words <- read.table(file=file.path(tmppath,"CREA_total.TXT"), 
                          sep = "\t",
                          quote = "",
                          stringsAsFactors = FALSE,
                          nrows = -1,
                          skip = 1,
                          dec = '.',
                          strip.white = TRUE,
                          fileEncoding = "Latin1",
                          col.names =c("X", "token", "Freq.A", "Freq.N")
                 )

# Estructura del data.frame
str(RAE_words)

'data.frame':   737799 obs. of  4 variables:
 $ X     : num  1 2 3 4 5 6 7 8 9 10 ...
 $ token : chr  "de" "la" "que" "el" ...
 $ Freq.A: chr  "9,999,518" "6,277,560" "4,681,839" "4,569,652" ...
 $ Freq.N: num  65546 41149 30689 29953 27755 ...

# Primeros casos
head(RAE_words)

  X token    Freq.A   Freq.N
1 1    de 9,999,518 65545.55
2 2    la 6,277,560 41148.59
3 3   que 4,681,839 30688.85
4 4    el 4,569,652 29953.48
5 5    en 4,234,281 27755.16
6 6     y 4,180,279 27401.19

Colibri · Answer 2 · 2020-08-14T14:29:06+08:00

Best Answer

Colibri

2020-08-14T14:29:06+08:002020-08-14T14:29:06+08:00

A moderate option is to take the list of frequencies: http://corpus.rae.es/frec/10000_formas.TXT

listapalabras <- read.table("http://corpus.rae.es/frec/10000_formas.TXT")
head(listapalabras)
   Orden Frec.absoluta Frec.normalizada
1.    de     9,999,518         65545.55
2.    la     6,277,560         41148.59
3.   que     4,681,839         30688.85
4.    el     4,569,652         29953.48
5.    en     4,234,281         27755.16
6.     y     4,180,279         27401.19

The problem that I find with the other option is that the corpus contains many incorrectly written words, with typing errors, and then the data frame becomes useless as a source of contrast. Already in these ten thousand it is easy to find words that are not or that are misspelled.

1

How to build a data.frame with all the words of the Spanish language?

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?