I am trying to fit a quadratic polynomial to my data using the following code:
#Polynimial regression for y ~ x
model <- lm(y ~ x + I(x^2))
summary(model)
#Box and whisker plot + polynomial
boxplot(y ~ x,
col=c("white","lightgray"), ylab= "y", xlab= "x", dat)
means <- tapply(y,x,mean)
points(means,col="red",pch=18)
predicted.intervals <- predict(model,data.frame(x=x),interval='confidence',
level=0.99)
lines(x,predicted.intervals[,1],col='green',lwd=3)
lines(x,predicted.intervals[,2],col='black',lwd=1)
lines(x,predicted.intervals[,3],col='black',lwd=1)
The thing is that when I run the program the box graph appears, the red points representing the means and the green line of the polynomial fitted to the data. However, there is also a strange straight line joining the averages of levels 1 and 11 that I have no idea where it comes from. Here goes the graph:
I have fitted polynomials to my data more times in nonlinear regressions, but this has never happened to me.
Any solution?
Edit 1:
Graph obtained with data B.
Finally, I have managed to fit the polynomial to the data. The code used is the following:
#Factorizo la variable x
x <- as.factor(x)
#Vuelvo a transformar la variable a numérica
x <- as.numeric(x)
#Regresión cuadrática
model <- lm(y ~ x + I(x^2))
summary(model)
#Ajuste del polinomio
boxplot(y ~ x,
col=c("white","lightgray"), ylab= "y", xlab= "x", dat)
means <- tapply(y,x,mean)
points(means,col="red",pch=18)
predicted.intervals <- predict(model,data.frame(x=x),interval='confidence',
level=0.99)
lines(x,predicted.intervals[,1],col='green',lwd=3)
The result for data B is this:
However, I have some questions:
My variable x comprises values from 0 to 1 (11 levels in steps of 0.1)
Why did I have to factor my original variable x, then transform the variable back to numeric (adopting discrete values between 0 and 11? Only then can I fit the polynomial to the data, but the regression is run on numeric values of 1 to 11.
Why in the case of "data set B" should I use
lines(x,predicted.intervals[,1],col='green',lwd=3)
...while in the case of "dataset A" I should use
lines(predicted.intervals[,1],col='green',lwd=3)
?
Let's go to the first problem, the data set A . I'm going to work only with the regression, let's see:
Now let's plot just the points and the curve for the regression:
The result is surely familiar to you:
This is explainable because it
datA
is not ordered byx
, when drawing the lines from the pointsx
andfitted
eventually we could have a point(1, ?)
and then a(0, ?)
, so we would return to the origin making the graph "circular". To solve this, we simply order byx
:Now the result is more in line with what was sought:
Is this the solution and explanation of the problem? Yes and No. Let's see, if we add this graph to the
boxplot
we can see this result:
What can we notice? the original curve was "compressed" between the values of 0 and 1, the explanation is that the coordinate systems are not compatible, this is because he
boxplot
considers the values ofx
as discrete variables, however our values ofdatA$x
are not, and here is where the use of comes infactor()
, like so:Now the data of
x
are consistent with thosex
ofboxplot
:This explains and it has worked for me for both sets of data, I do not publish the results so as not to make the answer longer. In any case, you may wonder why the behavior of the two data sets was originally different, the explanation is simple, set A is disordered and set B is ordered (always speaking of the value of
x
).I also recommend that you use
spline()
to draw these curves, it would avoid having to order previously and the other most important advantage is that it does not passline()
the complete set of data but the minimum points to interpolate the line in the graph: