Assignment 8

In this assignment, you will use R (within R-Studio) to load a data set and perform:

All file paths should be relative, starting from your Assignment_8 directory!!

This means that you need to create a new R-Project named “Assignment_8.Rproj” in your Assignment_8 directory, and work from scripts within that.

For credit…

  1. Push a completed version of your Rproj and R-script (details at end of this assignment) to GitHub
  2. Submit a plaintext file to Canvas answering the questions at the end of the assignment.

Your tasks:


Models

"In what way does variable Y (The response) depend on other variables (X1 …. Xn) in the study?

The model attempts to approximate this relationship

Statistical modeling is a very complex topic so we are just scratching the surface

Decisions, decisions…

  • Which variable is your response?

  • Which variables are explanatory?

  • Are the explanatory variables continuous, categorical, or both?

      1. All continuous: Regression
      2. All categorial: ANOVA
      3. Mix:            ANCoVA

Is the response variable continuous, a count, a proportion, a category????

  • Continuous: Regression, ANOVA, ANCoVA
  • Catergorical: ANOVA
  • Proportion: Logistic regression
  • Count: Log-Linear model
  • Binary: Binary logistic

Here’s a handy website: https://stats.idre.ucla.edu/other/mult-pkg/whatstat/


What models look like in R (the very basics)

Y ~ X This means “Y, is modeled as a function of X”

Y ~ X1 + X2 This means “Y, is modeled as a function of X1 AND X2” (two explanatory variables)

Y ~ X1:X2 This means "Y, is modeled as a function of THE INTERACTION BETWEEN X1 AND X2 (only the interaction term)

Y ~ X1*X2 This means "Y, is modeled as a function of X1 AND X2 AND THE INTERACTION BETWEEN X1 AND X2


R comes with a large variety of models you can choose from. Other packages provide hundreds more. Or you could write you own, if you are some sort of maniac!

The basic idea though, is to try to fit your data to a given model and then see how well it fits. If the fit is good, you can use that model to make meaningful (but not perfect) predictions.

Let’s look at an example:


First, load some useful packages…

library(modelr)
library(easystats)
library(broom)
library(tidyverse)
library(fitdistrplus)

Next, load some data…

data("mtcars")
glimpse(mtcars)
## Rows: 32
## Columns: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…

Our dependent variable is mpg (it’s the thing we want to know about)

Any or all of the other independent variables might help explain why it varies between different cars

Let’s try a simple linear model with displacement and horsepower as explanatory variables…

mod1 = lm(mpg ~ disp, data = mtcars)
summary(mod1)
## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8922 -2.2022 -0.9631  1.6272  7.2305 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
## disp        -0.041215   0.004712  -8.747 9.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared:  0.7183, Adjusted R-squared:  0.709 
## F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10

That’s a lot of summary information about our model and it’s kinda confusing.

  • “Call” shows what model you ran
  • “Residuals” are essentially measures of how well your actual data fit the model
  • “Coefficients” can be thought of as the intercept and slope of the best-fit line

If you want a deeper explanation, see this website: https://feliperego.github.io/blog/2015/10/23/Interpreting-Model-Output-In-R

It may help to look at it visually…

ggplot(mtcars, aes(x=disp,y=mpg)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

On this figure:

  • The Residuals are the distances of each point from the best-fit line
  • The Coefficient for displacement is the slope of that line

In other words, looking at our model summary tells us that this model predicts that for every increase in displacement of 1 cu.in. we can expect our fuel efficiency to drop by 0.04 mpg

Perhaps there’s a better model in our data, however…

Let’s look at one that incorporates speed instead

mod2 = lm(mpg ~ qsec, data = mtcars)
ggplot(mtcars, aes(x=disp,y=qsec)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  theme_minimal()
## `geom_smooth()` using formula 'y ~ x'

How to compare these two different models?

First thing is that you can visually look at the scatter around each line. By this measure, it seems clear our second model is a poorer fit than the first.

We can explicity measure that scatter. We call this the mean-squared-error of a model. The smaller, the better, so compare the two values from our models…

mean(mod1$residuals^2)
## [1] 9.911209
mean(mod2$residuals^2)
## [1] 29.02048

We’ll abandon our second model for now because it sucks (relative to the first one) and look at how to make predictions using mod1.

The add_predictions() function from the modelr package lets us take our data frame and our model and look at what values our model assigns to our response variable (mpg). This is looking at ACTUAL vs PREDICTED values. If they are close enough for comfort we can move on and make predictions for unknown values in our model.

df <- mtcars %>% 
  add_predictions(mod1) 
df %>% dplyr::select("mpg","pred")
##                      mpg     pred
## Mazda RX4           21.0 23.00544
## Mazda RX4 Wag       21.0 23.00544
## Datsun 710          22.8 25.14862
## Hornet 4 Drive      21.4 18.96635
## Hornet Sportabout   18.7 14.76241
## Valiant             18.1 20.32645
## Duster 360          14.3 14.76241
## Merc 240D           24.4 23.55360
## Merc 230            22.8 23.79677
## Merc 280            19.2 22.69220
## Merc 280C           17.8 22.69220
## Merc 450SE          16.4 18.23272
## Merc 450SL          17.3 18.23272
## Merc 450SLC         15.2 18.23272
## Cadillac Fleetwood  10.4 10.14632
## Lincoln Continental 10.4 10.64090
## Chrysler Imperial   14.7 11.46520
## Fiat 128            32.4 26.35622
## Honda Civic         30.4 26.47987
## Toyota Corolla      33.9 26.66946
## Toyota Corona       21.5 24.64992
## Dodge Challenger    15.5 16.49345
## AMC Javelin         15.2 17.07046
## Camaro Z28          13.3 15.17456
## Pontiac Firebird    19.2 13.11381
## Fiat X1-9           27.3 26.34386
## Porsche 914-2       26.0 24.64168
## Lotus Europa        30.4 25.68030
## Ford Pantera L      15.8 15.13335
## Ferrari Dino        19.7 23.62366
## Maserati Bora       15.0 17.19410
## Volvo 142E          21.4 24.61283

We can also ask R to predict dependent values based on hypothetical independent values. Let’s say we wanted to know what mpg our model “thinks” a car would have if it had an engine displacement of 500 cubic inches. What about a 900 cubic inch engine? What does our model say about the mpg in such a car?

# Make a new dataframe with the predictor values we want to assess
# mod1 only has "disp" as a predictor so that's what we want to add here
newdf = data.frame(disp = c(500,600,700,800,900)) # anything specified in the model needs to be here with exact matching column names

# making predictions
pred = predict(mod1, newdata = newdf)

# combining hypothetical input data with hypothetical predictions into one new data frame
hyp_preds <- data.frame(disp = newdf$disp,
                        pred = pred)

# Add new column showing whether a data point is real or hypothetical
df$PredictionType <- "Real"
hyp_preds$PredictionType <- "Hypothetical"

# joining our real data and hypothetical data (with model predictions)
fullpreds <- full_join(df,hyp_preds)
## Joining, by = c("disp", "pred", "PredictionType")
# plot those predictions on our original graph
ggplot(fullpreds,aes(x=disp,y=pred,color=PredictionType)) +
  geom_point() +
  geom_point(aes(y=mpg),color="Black") +
  theme_minimal()
## Warning: Removed 5 rows containing missing values (geom_point).

Note a few things about our model’s predictions (colored points). They fall right on the prediction line, as expected… but some of them are negative! Can a car have negative mpg?!

So, even a model that is good at explaining things within the bounds of your existing data might lose all relevance when you extrapolate out to hypothetical situations.

One more thing for now is how we can compare predictions from several models simultaneously…

# Define a 3rd model
mod3 <- glm(data=mtcars,
            formula = mpg ~ hp + disp + factor(am) + qsec)

# put all models into a list
mods <- list(mod1=mod1,mod2=mod2,mod3=mod3)
# apply "performance" function on all in the list and combine 
map(mods,performance) %>% reduce(full_join)
## # Indices of model performance
## 
## AIC     |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
## -----------------------------------------------------
## 170.209 | 174.607 | 0.718 |     0.709 | 3.148 | 3.251
## 204.588 | 208.985 | 0.175 |     0.148 | 5.387 | 5.564
## 164.863 | 173.658 | 0.802 |           | 2.637 | 2.871
# gather residuals from all 3 models
mtcars %>% 
  gather_residuals(mod1,mod2,mod3) %>% 
  ggplot(aes(x=model,y=resid,fill=model)) +
  geom_boxplot(alpha=.5) +
  geom_point() + 
  theme_minimal()

# gather predictions from all 3 models
mtcars %>% 
  gather_predictions(mod1,mod2,mod3) %>% 
  ggplot(aes(x=disp,y=mpg)) +
  geom_point(size=3) +
  geom_point(aes(y=pred,color=model)) +
  geom_smooth(aes(y=pred,color=model)) +
  theme_minimal() +
  annotate("text",x=250,y=32,label=mod1$call) +
  annotate("text",x=250,y=30,label=mod2$call) +
  annotate("text",x=250,y=28,label=mod3$call)

To me, I think that the 3rd model is the best of these 3. There’s probably a better model out there, but this one is a good fit and useful for understanding mpg. There’s an easy way to help put the model into interpretable English:

report(mod3)
## We fitted a linear model (estimated using ML) to predict mpg with hp, disp, am and qsec (formula: mpg ~ hp + disp + factor(am) + qsec). The model's explanatory power is substantial (R2 = 0.80). The model's intercept, corresponding to hp = 0, disp = 0, am = 0 and qsec = 0, is at 20.44 (95% CI [-1.74, 42.62], t(27) = 1.81, p = 0.082). Within this model:
## 
##   - The effect of hp is statistically significant and negative (beta = -0.04, 95% CI [-0.07, -3.36e-03], t(27) = -2.16, p = 0.039; Std. beta = -0.40, 95% CI [-0.77, -0.04])
##   - The effect of disp is statistically non-significant and negative (beta = -0.01, 95% CI [-0.03, 5.25e-03], t(27) = -1.40, p = 0.174; Std. beta = -0.27, 95% CI [-0.64, 0.11])
##   - The effect of am [1] is statistically significant and positive (beta = 4.43, 95% CI [1.05, 7.81], t(27) = 2.57, p = 0.016; Std. beta = 0.73, 95% CI [0.17, 1.30])
##   - The effect of qsec is statistically non-significant and positive (beta = 0.34, 95% CI [-0.66, 1.34], t(27) = 0.66, p = 0.513; Std. beta = 0.10, 95% CI [-0.20, 0.40])
## 
## Standardized parameters were obtained by fitting the model on a standardized version of the dataset. 95% Confidence Intervals (CIs) and p-values were computed using

At last, your assignment!