I am graphing a histogram from a very important set of data using geom_histogram()
, I have realized that as the "definition" of it increases by increasing the number of bins
bars, the result is slower and slower. The ratio, with a histogram of R base is at least 10 to 1 in time. Example:
library("ggplot2")
library("microbenchmark")
set.seed(2019)
x <- rnorm(100000)
df <- data.frame(x=x)
ggplot_hist <- function(data, bins=100000){
print(ggplot(data, aes(x=x)) + geom_histogram(bins=bins))
}
base_hist <- function(x, breaks=100000){
print(hist(x, breaks=length(x)))
}
microbenchmark(
base_hist(x),
ggplot_hist(df),
times=3L
)
Unit: seconds
expr min lq mean median uq max neval
base_hist(x) 4.503556 4.632358 4.680143 4.761159 4.768436 4.775713 3
ggplot_hist(df) 56.330033 57.249490 60.182923 58.168946 62.109369 66.049791 3
Is there a way to optimize a histogram in ggplot?
According to the hypothesis of this interesting answer , the bottleneck would be in the calculation of the
bins
or bars. We can try to test it:We can see that as we incorporate more level of detail, increasing the
bins
time grows rapidly. On the other hand, if we study the "base" histogram like this:With the basic histogram, we see that the growth over time as the increases
bins
is minimal. So the idea, which is proposed in the mentioned answer, is to replace the calculation of thebins
with the base functionhist()
and then draw the bars using ageom_rect()
. Let's see:And we see that with the ad-hoc function
quick_hist()
we have managedggplot
to improve the performance of the histogram in a radical way and with a very similar visual result.