What is a promise in Javascript?

Question

kev

Asked: 2020-04-30 05:48:33 +0800 CST 2020-04-30 05:48:33 +0800 CST 2020-04-30 05:48:33 +0800 CST

Problem with Web Scraping when creating a Data Frame in R

772

I am doing Web Scraping to get two variables precioand area. It just so happens that it preciohas a length()greater than areasince not all posts have an area. With the information I must create a data frame but since the length of the data is different it is impossible, if it were possible to assemble two variables with different lengths it would be impossible to join each preciowith each area. The question is how can the data be scraped so that when it does not exist areain that element of the list there is a string of some type, for example NA? I am using the packagervest

In this link they have the same problem but it is not solved. https://stackoverflow.com/questions/29996952/r-rvest-getting-2-elements-nodes-at-the-same-time

The code is

library(rvest)
library(robotstxt)
library(selectr)
library(xml2)
library(dplyr)
library(stringr)
library(forcats)
library(magrittr)
library(tidyr)
library(ggplot2)
library(lubridate)
library(tibble)
library(purrr)

url = "https://capital-federal.properati.com.ar/nf/propiedades/venta/"
paths_allowed(paths = c(url))

# Leer el HTML
leahtml <- read_html(url)

leahtml %>%
  html_nodes(".price") %>%
  html_text() -> price

leahtml %>%
  html_nodes(".area") %>%
  html_text() -> area

leahtml %>%
  html_nodes(".location") %>%
  html_text() -> location

precio = gsub("\n","",price)
precio = gsub("exp","",precio)
precio_a = gsub("\\$","",precio)
precio_b = gsub("US","",precio_a)
precio_limpio = gsub("\\.","",precio_b)
precio_limpio = str_trim(precio_limpio)
precio_su= substr(precio_limpio,1,5)
precio_su= as.numeric(precio_su)

area_a = gsub("\n","",area)
area_b = gsub("m²","",area_a)
area_limpia = as.numeric(area_b)

dataset <- data.frame(location, area_limpia, precio_su)

1 Answers

Voted

Patricio Moracho · Answer 1 · 2020-04-30T10:14:52+08:00

The problem, you may have already noticed, is that there are properties that do not have area, so it is not useful to capture precioand areaseparately. You must work on the node that represents each property and on these elements search for each data. A quick way would be:

library(rvest)
library(tidyverse)

url = "https://capital-federal.properati.com.ar/nf/propiedades/venta/"

read_html(url) %>% 
  html_nodes('.item-description')  %>% 
  map_df( ~ {
              precio = html_nodes(.x, ".price") %>%  html_text() %>% str_replace_all( pattern="\n| ", repl="")
              area = html_nodes(.x, ".area") %>%  html_text() %>% str_replace_all( pattern="[\n| ] ", repl="")
              area <- ifelse(length(area)>0, area, NA_character_)
              list(precio=precio, area=area)
  }) 


# A tibble: 18 x 2
   precio                area    
   <chr>                 <chr>   
 1 U$S930.000$16082exp   " 285m²"
 2 U$S990.000            " 116m²"
 3 U$S160.000            " 120m²"
 4 U$S120.000            " 55m²" 
 5 U$S2.200.000           NA     
 6 U$S178.359            " 65m²" 
 7 U$S258.000$11500exp   " 157m²"
 8 U$S749.000$25500exp   " 111m²"
 9 U$S230.000$2157138exp " 134m²"
10 U$S112.000$4500exp    " 61m²" 
11 U$S280.000             NA     
12 U$S170.000$4000exp    " 72m²" 
13 U$S480.000             NA     
14 U$S250.000             NA     
15 U$S125.000$3500exp    " 66m²" 
16 U$S340.000             NA     
17 U$S258.000$15000exp   " 86m²" 
18 U$S120.000            " 55m²"

Detail:

We iterate through the nodes .item-descriptionthat each property encompasses
Using map_dffor each element we now extract yes, the area and the price, I added a replacement of some characters with str_replace_all()simply to make the output look clearer.
Finally map_df, it returns a tibblewith each property and its variables.

Problem with Web Scraping when creating a Data Frame in R

HTML button that sends you to another page

Why do I get the error "Call to undefined function mysql_connect()"?

How to create an HTML button that works as a link?

How to separate a String in Java. How to use split()

Filter by dates in sql server

How to limit the number of decimal places in a double?

For each in JavaScript?

Position footer ALWAYS glued to the footer

Definitive Guide to Type Conversion in Java

How to properly compare Strings (and objects) in Java?