What is a promise in Javascript?

Question

Alejandro Carrera

Asked: 2020-04-20 22:30:23 +0800 CST 2020-04-20 22:30:23 +0800 CST 2020-04-20 22:30:23 +0800 CST

Problems with text mining in R (incomplete words and meaningless terms)

772

Before presenting my problem, I share the link where you can download the document I work with:

https://drive.google.com/file/d/178s_tfbqbXmnxsknxF8DP154_N1DYjgf/view

Here is the code I am using: (Don't forget to create your own route before running it)

library(rJava)
library(tm)
library(qdap)
library(tidyverse)
library(pdftools)   
library(stringr)
library(tidytext)
library(stringi)
library(wordcloud)

stop_es <- c(stopwords("es"))   #Vector de palabras "innecesarias" 

cce <- pdf_text("ruta_del_archivo")     #Para leer el archivo
corpus <- Corpus(VectorSource(cce))  #Crear corpus

#Limpieza y preprocesamiento
CCE  <- tm_map(corpus, tolower) %>% 
tm_map(stripWhitespace) %>%
tm_map(removePunctuation) %>% 
tm_map (removeWords, stop_es) %>%
stri_trans_general("Latin-ASCII")   #Con esto removemos acentos en español

##Crear nuevamente el corpus. Por alguna razón que no entiendo, stri_trans_general crea un archivo que no puede leerse si no se transforma nuevamente en corpus 

CCEGTO <- Corpus(VectorSource(CCE))

After the previous steps, we proceed to create the table of most frequent terms

ft <- freq_terms(CCEGTO, 50, stopwords=stop_es)
ft

The output shows words that seem to be incomplete or don't make sense: (I removed some to focus my attention on the problem)

   WORD        FREQ
2  ca           105  ##¿?
3  guanajuato    94
5  vo            86
6  ufb           75   ##¿¿??
9  va            69
10 propuestas    68
11 nivel         64
12 par           58    #Esto podría ser la raíz de "parte", "participacion" o alguna por el estilo
27 ins           42    #Esto podría ser para "instituto", "institucion" o algo parecido 
28 n             42    #Una sola letra como término frecuente... ¿?
30 vos           41
33 numero        40
34 vas           40
35 l             39
38 d             37
39 s             37
42 poli          35  #Esta podría ser para "policia", "politica", "politicas"
43 vidad         35  #Esta podría ser para "vida" o"actividad"
44 cas           34
45 r             34   #Un caracter aislado
46 cipacion      33   #Esta podría ser la primera parte de la palabra "participacion" 
47 i             33   #Otra letra aislada...

My fundamental question is if I am doing something wrong with the pre-processing of the text or if it is the structure of the pdf file itself that is not allowing me to do a good job of parsing.

Any comments and suggestions are greatly appreciated.

1 Answers

Voted

Patricio Moracho · Answer 1 · 2020-04-21T06:17:10+08:00

Alejandro, the problem you have is not with R or what you are doing, the problem is that this is not a pdf"friendly" way to extract text. What I could see is that there are many texts that insert symbols or graphics which "breaks" certain words. An interesting example is the text tiin the texts with the smallest letter, they are not letters but some type of graphic, which "breaks" the word, for example with texts like "definitive", "allowed", etc. Let's look at this example:

Page 13. We make a selection of the text and we already see that "you" seems to be invisible when selecting, evidence that it is some kind of graphic and not a text. When pasting the copied result:

1.- Cierre de establecimientos que vendan bebidas
alcohólicas fuera del horario permi do.
2.- Cierre defini vo de establecimientos no
autorizados.
3.- Cierre de establecimientos que vendan bebidas
alcohólicas a menores (clausura defini va para los
reincidentes).

We see that effectively "you" has disappeared. Something similar happens with "fi", but it seems that some unicode symbol is used that you are solving with: stri_trans_general("Latin-ASCII")I don't know if there are other similar cases, but this is repeated throughout the document, the only valid "ti" are those whose text has a larger font, which makes me think this is some kind of visual enhancement decided by the designer.

Solutions:

Exporting the PDF to images and doing optical character recognition (OCR) might eventually work, but it can also lead to other problems.

If the case of tiwere more or less limited, we can make a list of word replacements, to apply it after stri_trans_general("Latin-ASCII"), something like this:

from <-c('permi do', 'si o')
to <-c('permitido', 'sitio')
texto <- 'El si o es el permi do'

for(i in seq_along(from)) texto <- gsub(from[i], to[i], texto, fixed = TRUE)
texto

[1] "El sitio es el permitido"

Additional note: I didn't mention it, but another issue you should be aware of is hyphenation at the end of a line.

Problems with text mining in R (incomplete words and meaningless terms)

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?