Before presenting my problem, I share the link where you can download the document I work with:
https://drive.google.com/file/d/178s_tfbqbXmnxsknxF8DP154_N1DYjgf/view
Here is the code I am using: (Don't forget to create your own route before running it)
library(rJava)
library(tm)
library(qdap)
library(tidyverse)
library(pdftools)
library(stringr)
library(tidytext)
library(stringi)
library(wordcloud)
stop_es <- c(stopwords("es")) #Vector de palabras "innecesarias"
cce <- pdf_text("ruta_del_archivo") #Para leer el archivo
corpus <- Corpus(VectorSource(cce)) #Crear corpus
#Limpieza y preprocesamiento
CCE <- tm_map(corpus, tolower) %>%
tm_map(stripWhitespace) %>%
tm_map(removePunctuation) %>%
tm_map (removeWords, stop_es) %>%
stri_trans_general("Latin-ASCII") #Con esto removemos acentos en español
##Crear nuevamente el corpus. Por alguna razón que no entiendo, stri_trans_general crea un archivo que no puede leerse si no se transforma nuevamente en corpus
CCEGTO <- Corpus(VectorSource(CCE))
After the previous steps, we proceed to create the table of most frequent terms
ft <- freq_terms(CCEGTO, 50, stopwords=stop_es)
ft
The output shows words that seem to be incomplete or don't make sense: (I removed some to focus my attention on the problem)
WORD FREQ
2 ca 105 ##¿?
3 guanajuato 94
5 vo 86
6 ufb 75 ##¿¿??
9 va 69
10 propuestas 68
11 nivel 64
12 par 58 #Esto podría ser la raíz de "parte", "participacion" o alguna por el estilo
27 ins 42 #Esto podría ser para "instituto", "institucion" o algo parecido
28 n 42 #Una sola letra como término frecuente... ¿?
30 vos 41
33 numero 40
34 vas 40
35 l 39
38 d 37
39 s 37
42 poli 35 #Esta podría ser para "policia", "politica", "politicas"
43 vidad 35 #Esta podría ser para "vida" o"actividad"
44 cas 34
45 r 34 #Un caracter aislado
46 cipacion 33 #Esta podría ser la primera parte de la palabra "participacion"
47 i 33 #Otra letra aislada...
My fundamental question is if I am doing something wrong with the pre-processing of the text or if it is the structure of the pdf file itself that is not allowing me to do a good job of parsing.
Any comments and suggestions are greatly appreciated.
Alejandro, the problem you have is not with R or what you are doing, the problem is that this is not a
pdf
"friendly" way to extract text. What I could see is that there are many texts that insert symbols or graphics which "breaks" certain words. An interesting example is the textti
in the texts with the smallest letter, they are not letters but some type of graphic, which "breaks" the word, for example with texts like "definitive", "allowed", etc. Let's look at this example:Page 13. We make a selection of the text and we already see that "you" seems to be invisible when selecting, evidence that it is some kind of graphic and not a text. When pasting the copied result:
We see that effectively "you" has disappeared. Something similar happens with "fi", but it seems that some unicode symbol is used that you are solving with:
stri_trans_general("Latin-ASCII")
I don't know if there are other similar cases, but this is repeated throughout the document, the only valid "ti" are those whose text has a larger font, which makes me think this is some kind of visual enhancement decided by the designer.Solutions:
Exporting the PDF to images and doing optical character recognition (OCR) might eventually work, but it can also lead to other problems.
If the case of
ti
were more or less limited, we can make a list of word replacements, to apply it afterstri_trans_general("Latin-ASCII")
, something like this:Additional note: I didn't mention it, but another issue you should be aware of is hyphenation at the end of a line.