In this assignment, you will use R (within R-Studio) to load a data set and perform:
All file paths should be relative, starting from your Assignment_8 directory!!
This means that you need to create a new R-Project named “Assignment_8.Rproj” in your Assignment_8 directory, and work from scripts within that.
Which variable is your response?
Which variables are explanatory?
Are the explanatory variables continuous, categorical, or both?
1. All continuous: Regression
2. All categorial: ANOVA
3. Mix: ANCoVA
Here’s a handy website: https://stats.idre.ucla.edu/other/mult-pkg/whatstat/
Y ~ X This means “Y, is modeled as a function of X”
Y ~ X1 + X2 This means “Y, is modeled as a function of X1 AND X2” (two explanatory variables)
Y ~ X1:X2 This means "Y, is modeled as a function of THE INTERACTION BETWEEN X1 AND X2 (only the interaction term)
Y ~ X1*X2 This means "Y, is modeled as a function of X1 AND X2 AND THE INTERACTION BETWEEN X1 AND X2
R comes with a large variety of models you can choose from. Other packages provide hundreds more. Or you could write you own, if you are some sort of maniac!
The basic idea though, is to try to fit your data to a given model and then see how well it fits. If the fit is good, you can use that model to make meaningful (but not perfect) predictions.
library(modelr)
library(easystats)
library(broom)
library(tidyverse)
library(fitdistrplus)
data("mtcars")
glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
mod1 = lm(mpg ~ disp, data = mtcars)
summary(mod1)
##
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## disp -0.041215 0.004712 -8.747 9.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
If you want a deeper explanation, see this website: https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R
It may help to look at it visually…
ggplot(mtcars, aes(x=disp,y=mpg)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
Let’s look at one that incorporates speed instead
mod2 = lm(mpg ~ qsec, data = mtcars)
ggplot(mtcars, aes(x=disp,y=qsec)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
## `geom_smooth()` using formula 'y ~ x'
First thing is that you can visually look at the scatter around each line. By this measure, it seems clear our second model is a poorer fit than the first.
We can explicity measure that scatter. We call this the mean-squared-error of a model. The smaller, the better, so compare the two values from our models…
mean(mod1$residuals^2)
## [1] 9.911209
mean(mod2$residuals^2)
## [1] 29.02048
The add_predictions() function from the modelr package lets us take our data frame and our model and look at what values our model assigns to our response variable (mpg). This is looking at ACTUAL vs PREDICTED values. If they are close enough for comfort we can move on and make predictions for unknown values in our model.
df <- mtcars %>%
add_predictions(mod1)
df %>% dplyr::select("mpg","pred")
## mpg pred
## Mazda RX4 21.0 23.00544
## Mazda RX4 Wag 21.0 23.00544
## Datsun 710 22.8 25.14862
## Hornet 4 Drive 21.4 18.96635
## Hornet Sportabout 18.7 14.76241
## Valiant 18.1 20.32645
## Duster 360 14.3 14.76241
## Merc 240D 24.4 23.55360
## Merc 230 22.8 23.79677
## Merc 280 19.2 22.69220
## Merc 280C 17.8 22.69220
## Merc 450SE 16.4 18.23272
## Merc 450SL 17.3 18.23272
## Merc 450SLC 15.2 18.23272
## Cadillac Fleetwood 10.4 10.14632
## Lincoln Continental 10.4 10.64090
## Chrysler Imperial 14.7 11.46520
## Fiat 128 32.4 26.35622
## Honda Civic 30.4 26.47987
## Toyota Corolla 33.9 26.66946
## Toyota Corona 21.5 24.64992
## Dodge Challenger 15.5 16.49345
## AMC Javelin 15.2 17.07046
## Camaro Z28 13.3 15.17456
## Pontiac Firebird 19.2 13.11381
## Fiat X1-9 27.3 26.34386
## Porsche 914-2 26.0 24.64168
## Lotus Europa 30.4 25.68030
## Ford Pantera L 15.8 15.13335
## Ferrari Dino 19.7 23.62366
## Maserati Bora 15.0 17.19410
## Volvo 142E 21.4 24.61283
We can also ask R to predict dependent values based on hypothetical independent values. Let’s say we wanted to know what mpg our model “thinks” a car would have if it had an engine displacement of 500 cubic inches. What about a 900 cubic inch engine? What does our model say about the mpg in such a car?
# Make a new dataframe with the predictor values we want to assess
# mod1 only has "disp" as a predictor so that's what we want to add here
newdf = data.frame(disp = c(500,600,700,800,900)) # anything specified in the model needs to be here with exact matching column names
# making predictions
pred = predict(mod1, newdata = newdf)
# combining hypothetical input data with hypothetical predictions into one new data frame
hyp_preds <- data.frame(disp = newdf$disp,
pred = pred)
# Add new column showing whether a data point is real or hypothetical
df$PredictionType <- "Real"
hyp_preds$PredictionType <- "Hypothetical"
# joining our real data and hypothetical data (with model predictions)
fullpreds <- full_join(df,hyp_preds)
## Joining, by = c("disp", "pred", "PredictionType")
# plot those predictions on our original graph
ggplot(fullpreds,aes(x=disp,y=pred,color=PredictionType)) +
geom_point() +
geom_point(aes(y=mpg),color="Black") +
theme_minimal()
## Warning: Removed 5 rows containing missing values (geom_point).
One more thing for now is how we can compare predictions from several models simultaneously…
# Define a 3rd model
mod3 <- glm(data=mtcars,
formula = mpg ~ hp + disp + factor(am) + qsec)
# put all models into a list
mods <- list(mod1=mod1,mod2=mod2,mod3=mod3)
# apply "performance" function on all in the list and combine
map(mods,performance) %>% reduce(full_join)
## # Indices of model performance
##
## AIC | BIC | R2 | R2 (adj.) | RMSE | Sigma
## -----------------------------------------------------
## 170.209 | 174.607 | 0.718 | 0.709 | 3.148 | 3.251
## 204.588 | 208.985 | 0.175 | 0.148 | 5.387 | 5.564
## 164.863 | 173.658 | 0.802 | | 2.637 | 2.871
# gather residuals from all 3 models
mtcars %>%
gather_residuals(mod1,mod2,mod3) %>%
ggplot(aes(x=model,y=resid,fill=model)) +
geom_boxplot(alpha=.5) +
geom_point() +
theme_minimal()
# gather predictions from all 3 models
mtcars %>%
gather_predictions(mod1,mod2,mod3) %>%
ggplot(aes(x=disp,y=mpg)) +
geom_point(size=3) +
geom_point(aes(y=pred,color=model)) +
geom_smooth(aes(y=pred,color=model)) +
theme_minimal() +
annotate("text",x=250,y=32,label=mod1$call) +
annotate("text",x=250,y=30,label=mod2$call) +
annotate("text",x=250,y=28,label=mod3$call)
To me, I think that the 3rd model is the best of these 3. There’s probably a better model out there, but this one is a good fit and useful for understanding mpg. There’s an easy way to help put the model into interpretable English:
report(mod3)
## We fitted a linear model (estimated using ML) to predict mpg with hp, disp, am and qsec (formula: mpg ~ hp + disp + factor(am) + qsec). The model's explanatory power is substantial (R2 = 0.80). The model's intercept, corresponding to hp = 0, disp = 0, am = 0 and qsec = 0, is at 20.44 (95% CI [-1.74, 42.62], t(27) = 1.81, p = 0.082). Within this model:
##
## - The effect of hp is statistically significant and negative (beta = -0.04, 95% CI [-0.07, -3.36e-03], t(27) = -2.16, p = 0.039; Std. beta = -0.40, 95% CI [-0.77, -0.04])
## - The effect of disp is statistically non-significant and negative (beta = -0.01, 95% CI [-0.03, 5.25e-03], t(27) = -1.40, p = 0.174; Std. beta = -0.27, 95% CI [-0.64, 0.11])
## - The effect of am [1] is statistically significant and positive (beta = 4.43, 95% CI [1.05, 7.81], t(27) = 2.57, p = 0.016; Std. beta = 0.73, 95% CI [0.17, 1.30])
## - The effect of qsec is statistically non-significant and positive (beta = 0.34, 95% CI [-0.66, 1.34], t(27) = 0.66, p = 0.513; Std. beta = 0.10, 95% CI [-0.20, 0.40])
##
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using