What is a promise in Javascript?

Question

kev

Asked: 2022-06-27 06:30:26 +0800 CST 2022-06-27 06:30:26 +0800 CST 2022-06-27 06:30:26 +0800 CST

How to import data in R when it is not sorted?

772

I need to import a database into R that is in csv format. The problem is that the file contains the information in a way that makes it impossible to import correctly. Of the four rows with words, I only need to keep the year of each table in the database.

What I finally need to get is something like the following:

Mes Día Año  Precipitaciones
Ene 1   2018  10
Ene 2   2018  23
Ene 3   2018  22
Ene 4   2018  11
......
Ene 1   2019  13
Ene 2   2019  31.3
Ene 3   2019
......
.
.
..

The link to the data is: https://www.dropbox.com/s/rq8cql40r5dk413/190068%20B.%20Juarez.txt?dl=0

2 Answers

Voted

mpaladino · Answer 1 · 2022-06-27T10:32:49+08:00

Quite an elegant solution with tidyr.

The idea is to read the entire text file with readLines, so that each line is an element in a vector of character strings. Quickly pass that vector to a data.frame and use dplyr+ tidyrto clean and separate the data.

con <- file("190068 B. Juarez.txt", blocking = FALSE) #Abres al archivo como una conexión, si no se complica para encontrar el fin del archivo.
readLines(con, encoding = "UTF-8") -> foo  #Y lo lees línea por línea
close(con)   

tibble(crudo = foo) %>% 
  filter(str_detect(crudo, "^A|^[0-9]")) %>% #Dejo solo las filas que me interesan
  mutate(año = str_detect(crudo, "^A"), 
         dato = str_detect(crudo, "^[0-9]")) %>%  #Dos vectores lógicos para identificar filas año y filas dato
  group_by(cumsum(año)) %>%                        #El cumsum crea los grupos para cada año
    mutate(año = str_match(crudo[1], "(\\d{4})")[1]) %>%   #Extraigo el año (la parte numérica de esa cadena) y la paso a columna nueva, por grupo_by cambia de año para cada grupo
  ungroup() %>% 
  filter(dato) %>%     #Quito las filas fecha, ya no las necesito
  separate(crudo, into = nombres_columna, sep = ";") %>%   #Acá está la magia: separo en columnas definidas por ";"
  select(Día:año)  %>%                                     #Saco las columnas que ya no necesito.
  pivot_longer(Ene:Dic,             #Pivoteo para pasar a formato largo
               names_to = "Mes", 
               values_to = "Precipitaciones")

1st solution, much more complicated

I found a solution for it. The files with which they are distributing that data are horrible, so the solution has to be a Frankenstein. In this case there is a mix of tidyverse with base R that could be better standardized to make the code more maintainable.

I try to explain it in the comments, but it's still complicated because it uses regular expressions and iteration over lists.

The idea is to read each line of the file as a vector, then separate the data (rainfall per day and month) from the years to which they correspond, then within each line/day with data separate the data for each month and finally put everything back together.

library(tidyverse)
con <- file("190068 B. Juarez.txt", blocking = FALSE) #Abres al archivo como una conexión, si no se complica para encontrar el fin del archivo. Modifica el path si es diferente.
readLines(con, encoding = "UTF-8") -> foo  #Y lo lees línea por línea
close(con)                                 #Cierras la conexión

foo

# Estos serán los nombres de columna del data.frame
nombres_columna <- c("Día","Ene","Feb","Mar","Abr","May","Jun","Jul","Ago","Sep","Oct","Nov","Dic")

# Aquí te quedas con el cuerpo de los datos, todas las líneas que empiezan con un número.
foo [grep("^[0-9]", foo)] -> cuerpo

# Y aquí con las líneas que tienen el año, empiezan con A mayúscula

foo [grep("^A", foo)] -> año

# Y te quedas solo con la parte numérica, el año. 

gsub(".*?([0-9]+).*", "\\1", año) -> año

# Complicado: detectas cada línea que empieza con un 1 (primer día del año)
# Luego lo usas como inicio para una suma acumulada (los no 1 son FALSE y suman 0)
# Y usas ese vector para separar a los datos de cada años en un lista

split(cuerpo, cumsum(grepl("^1;", cuerpo))) -> lista_años

# Nombras la lista, cada bloque pertenece a un año y está en la misma secuencia de la suma acumulada
names(lista_años) <- año

# Tomas la lista

lista_años %>% 
  #Iteras para separar a cada línea (vector) en elementos separados usando el patrón ";"
  #TE queda una lista dentro de una lista
  map(~str_split(.x, pattern = ";") %>%  
  # Dentro del map a esa "lista interna" le aplicas rbind para convertirla en una matriz. No sé porque funciona, misterios de R base.
  do.call("rbind", .) %>% 
  # La coercionas a data.frame
    as.data.frame %>% 
  # Después le pones los nombres de columna que ya tenías  
    setNames(nombres_columna)) %>% 
  #Ya fuera del map juntas todos esos data.frame en uno solo
  
  bind_rows(.id = "año") -> df_ancho

# Para el resultado final pivoteas los datos y ya. 
df_ancho %>% 
  select(-...14) %>% 
  pivot_longer(cols = Ene:Dic, names_to = "Mes", values_to = "Precipitaciones")

And you know this:

 A tibble: 1,860 × 4
   año   Día   Mes   Precipitaciones
   <chr> <chr> <chr> <chr>          
 1 2018  1     Ene   S/P            
 2 2018  1     Feb   S/P            
 3 2018  1     Mar   S/P            
 4 2018  1     Abr   S/P            
 5 2018  1     May   S/P            
 6 2018  1     Jun   1,0            
 7 2018  1     Jul   0,2            
 8 2018  1     Ago   S/P            
 9 2018  1     Sep   S/P            
10 2018  1     Oct   S/P            
# … with 1,850 more rows

It might be simpler at some intermediate step to write a decent .csv and read it, but with this solution you don't rely on write privileges.

Patricio Moracho · Answer 2 · 2022-06-27T15:02:23+08:00

An alternative with R base. The idea is relatively simple:

We read the entire file, removing blank rows
We eliminate some lines/rows that do not participate in the process
We are left with a list of lines that we can divide into blocks of 33 for each year
We get the year by reading the appropriate rows
We divide each block by year and on this list we apply a read.csv()to read each data.frameindividual
We put it all together into one grand data.framefinale and added the column ofAño

file <- '~/../../Downloads/190068 B. Juarez.txt'

# Leemos el earchivo completo, cada línea una fila
df <- read.delim(file, 
                 sep = "\n", 
                 header = FALSE, 
                 skip = 1, 
                 col.names="linea",
                 strip.white = TRUE,
                 blank.lines.skip = TRUE)

# Eliminamos filas que no interesan
lineas <- df[-grep('Estación Meteorológica|Datos Preliminares|Total;', df$linea), ]

# Nos quedamos con bloques de 33 líneas
bloques <- length(lineas) / 33
años_idx <- (((1:bloques) - 1) * 33) + 1
años <- regmatches(lineas[años_idx],regexpr("[0-9]+",lineas[años_idx]))

# Separamos cada bloque en una lista
lapply(split(lineas, rep(años, each=33)),
       FUN=function(x) {read.csv(text=paste0(x, collapse = '\n'),
                                 header = TRUE, skip = 1, sep=";")
       }
) -> l

# Juntamos cada elemento de la lista en un df y adignamos el año
final <- do.call(rbind, l)
final$año <- substr(rownames(final), 1, 4)

rownames(final) <- NULL

How to import data in R when it is not sorted?

1st solution, much more complicated

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?