I am doing Web Scraping to get two variables precio
and area
. It just so happens that it precio
has a length()
greater than area
since not all posts have an area. With the information I must create a data frame but since the length of the data is different it is impossible, if it were possible to assemble two variables with different lengths it would be impossible to join each precio
with each area
. The question is how can the data be scraped so that when it does not exist area
in that element of the list there is a string of some type, for example NA? I am using the packagervest
In this link they have the same problem but it is not solved. https://stackoverflow.com/questions/29996952/r-rvest-getting-2-elements-nodes-at-the-same-time
The code is
library(rvest)
library(robotstxt)
library(selectr)
library(xml2)
library(dplyr)
library(stringr)
library(forcats)
library(magrittr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(tibble)
library(purrr)
url = "https://capital-federal.properati.com.ar/nf/propiedades/venta/"
paths_allowed(paths = c(url))
# Leer el HTML
leahtml <- read_html(url)
leahtml %>%
html_nodes(".price") %>%
html_text() -> price
leahtml %>%
html_nodes(".area") %>%
html_text() -> area
leahtml %>%
html_nodes(".location") %>%
html_text() -> location
precio = gsub("\n","",price)
precio = gsub("exp","",precio)
precio_a = gsub("\\$","",precio)
precio_b = gsub("US","",precio_a)
precio_limpio = gsub("\\.","",precio_b)
precio_limpio = str_trim(precio_limpio)
precio_su= substr(precio_limpio,1,5)
precio_su= as.numeric(precio_su)
area_a = gsub("\n","",area)
area_b = gsub("m²","",area_a)
area_limpia = as.numeric(area_b)
dataset <- data.frame(location, area_limpia, precio_su)
The problem, you may have already noticed, is that there are properties that do not have
area
, so it is not useful to captureprecio
andarea
separately. You must work on the node that represents each property and on these elements search for each data. A quick way would be:Detail:
.item-description
that each property encompassesmap_df
for each element we now extract yes, the area and the price, I added a replacement of some characters withstr_replace_all()
simply to make the output look clearer.map_df
, it returns atibble
with each property and its variables.