\frametitle {Caterpillar case: suggested steps}\par {\footnotesize \begin{itemize} \item Load the data and give the variables some names, e.g. \begin{verbatim} cat.dat <- read.table("...") colnames(cat.dat) <- c("Altitude", "Slope", "Nb.pines.in.area", "Height.tree.center", "Diameter.tree.center", "Density.index", "Orientation", "Height.dominant.tree", "Nb.vegetation.strata", "Mixing.index", "Nb.nests") \end{verbatim} \item Explore graphically the dataset (the R functions {\tt plot, pairs} and {\tt hist} are your best friends).\\ You may want to print the correlation coefficients on the pairs plot with:\\ \begin{verbatim} panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r <- abs(cor(x, y)) txt <- format(c(r, 0.123456789), digits=digits)[1] txt <- paste(prefix, txt, sep="") if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt) text(0.5, 0.5, txt, cex = cex.cor * r) } pairs(cat.dat, lower.panel=panel.smooth, upper.panel=panel.cor) \end{verbatim} \end{itemize} } %% \end{frame} %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% \begin{frame}[fragile] %% \frametitle{} {\footnotesize \begin{itemize} \item The raw correlation coefficient can be misleading as some of the explanatory variables are correlated with each other. One can ``get rid'' of the effect of one variable using the partial correlation coefficient. It is defined as the correlation coefficent bewteen the residuals of a regression on the variable one wants to get rid of. It is obtained as:\\ \begin{verbatim} pcor <- function(v1, v2, v3) { c12 <- cor(v1, v2) c23 <- cor(v2, v3) c13 <- cor(v1, v3) partial <- (c12-(c13*c23))/(sqrt(1-(c13^2)) * sqrt(1-(c23^2))) return(partial) } \end{verbatim} Use this to evaluate the correlation between the response variable and the explanatory variables given {\tt Nb.vegetation.strata}. Compare to the plain correlation. Which variable would you expect to be useless in a multiple regression once {\tt Nb.vegetation.strata} is included in the model? \end{itemize} } %% \end{frame} %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% \begin{frame}[fragile] %% \frametitle{} {\footnotesize \begin{itemize} \item Fit a linear model explaining the number of nests with all the other variables as predictors.\\ %Hints: {\tt lm.all = lm(formula = Nb.nests \~ \; . , data=cat.dat)}\\ Give the exact math equation that describes the model fitted by the R command above. What are the parameters involved? What are the main assumptions involved? \item Can you reject with confidence the null hypothesis in a global test? \item Plot the residuals againtst the predicted values. Does it seem reasonable to assume that the variance of the noise is the same for all observations?\\ Hints: The residuals are computed by the {\tt lm } function and stored as a component of the returned list named residuals. The predicted values can be obtained by {\tt predict(lm.all)} \item Can the model quality be improved by a regression of the logarithm of the number of nests (instead of the number of nests)? \end{itemize} } %% \end{frame} %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% \begin{frame}[fragile] %% \frametitle{} {\small From now on, we work with the logarithm of the number of nests as response variable.\\} {\footnotesize \begin{itemize} \item The last line printed by the R command {\tt summary(lm.all)} contains the p-value for the null hypothesis of absence of correlation with any explanatoty variable. Can you accept safely this null hypothesis? \item Plot the predicted values against the observed response values and compute their coefficient of correlation. Does this seem to be a useful model? \item Print the esimated coefficients and associated p-values. What variables stand out as good or poor predictors? Compare this to a linear model with Density.index as single predictor. How do you explain the difference? \end{itemize} } %% \end{frame} %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% \begin{frame}[fragile] %% \frametitle{} {\tiny \begin{itemize} \item From the above, there are obsviously some variables that are useless in the model. Compute leave-one-out prediction error with all variables then with all variables except Mixing.index. \\ Hints:\\ \begin{verbatim} ## leave one out prediction with full model mse.1 <- 0 loo.pred1 <- numeric(nrow(cat.dat)) for(i in 1:nrow(cat.dat)) { ## fitting a model with all variables ## to the dataset without obs. i lm.res.loo <- lm(formula=log(Nb.nests) ~. , data=cat.dat[-i,]) ## predicting Log.nb.nests[i] with this model ## (just a linear combination with estimated coef.): loo.pred1[i] <- sum(c(1,as.numeric(cat.dat[i,1:10]))*lm.res.loo$coefficients) ## squre error in prediction mse.1 <- mse.1 + (log(cat.dat$Nb.nests[i]) - loo.pred1[i])^2 } mse.1 <- mse.1/(nrow(cat.dat)-1) rmse.1 <- sqrt(mse.1) \end{verbatim} \end{itemize} } %% \end{frame} %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% \begin{frame}[fragile] %% \frametitle{} {\tiny \begin{verbatim} ## leave one out prediction with all variables exept Mixing.index mse.2 <- 0 loo.pred2 <- numeric(nrow(cat.dat)) for(i in 1:nrow(cat.dat)) { ## fitting a model with all variables except Mixing.index ## to the dataset without obs. i lm.res.loo <- lm(formula=(log(Nb.nests) ~. - Mixing.index), data=cat.dat[-i,]) ## predicting Log.nb.nests[i] with this model ## (just a linear combination with estimated coef.): loo.pred2[i] <- sum(c(1,as.numeric(cat.dat[i,1:9]))*lm.res.loo$coefficients) ## squre error in prediction mse.2 <- mse.2 + (log(cat.dat$Nb.nests[i]) - loo.pred2[i])^2 } mse.2 <- mse.2/(nrow(cat.dat)-1) rmse.2 <- sqrt(mse.2) rmse.1 rmse.2 ## plot(log(cat.dat$Nb.nests),loo.pred1,xlab="Log nb of nest", ylab="LOO prediction"); abline(0,1, col=3) points(log(cat.dat$Nb.nests),loo.pred2,col=2,pch=2) \end{verbatim} On the basis of the above, would you include {\tt Mixing.index} in the model? } %% {\footnotesize %% \begin{itemize} %% %% \item We want now to perform variable selection. We will minimize the Akaike information criterion (AIC) defined e.g. in [Bingham and Fry p. 140].\\ %% %% This is performed in R with the function {\tt step()} that takes the output of the {\tt lm ()} function as an argument.\\ %% %% What is the final model retained? %% \item What leave-one-out Root Mean Square Error (RMSE) do you obtain with this ``best'' model? %% \end{itemize} %% }