Example model exploration and interpretation (for assignment 9)

I will be looking at the Palmer Penguins dataset from the palmerpenguins package.

Some setup, first:

library(tidyverse)
library(modelr)
library(easystats)
library(palmerpenguins)
library(GGally)

Now, a quick glance at the data structure:

glimpse(penguins)
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
penguins %>% 
  select(species,body_mass_g,sex) %>% 
  ggpairs()

I’ll focus on modeling body mass of these penguins. From the ggpairs plot above, it seems likely that species and sex have significant effects, but I will explore other ways of accurately predicting body mass as well.

Let’s take a look at the distribution of body mass amongst these various penguin groups:

penguins %>% 
  ggplot(aes(x=body_mass_g,fill=sex)) +
  geom_density(alpha=.5) +
  facet_grid(island~species) +
  theme_minimal()

# quick test of normality
ggpubr::ggqqplot(penguins$body_mass_g)

Looks good enough for me!


First I’ll define a few potential models with varying complexity:

names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
mod1 <- glm(data=penguins,
            formula=body_mass_g ~ species + sex)

mod2 <- glm(data=penguins,
            formula=body_mass_g ~ species * sex)

mod3 <- glm(data=penguins,
            formula=body_mass_g ~ species * sex * island)

mod4 <- glm(data=penguins,
            formula=body_mass_g ~ species * sex * island + year)

full_mod <- glm(data=penguins,
            formula=body_mass_g ~ species * sex * island * year * flipper_length_mm * bill_length_mm * bill_depth_mm)

Next, I’m going to take that overly complex full_mod and see if I can winnow it down using stewise AIC:

step <- MASS::stepAIC(full_mod,trace=0) # trace=0 suppresses lengthy output to console
mod5 <- glm(data=penguins,
            formula=step$formula) # use that AIC-selected model as mod5 (it's too big to print here)

Now, I’ll compare the performance of each of these models using the performance package:

comps <- compare_performance(mod1,mod2,mod3,mod4,mod5,
                    rank=TRUE)
comps
## # Comparison of Model Performance Indices
## 
## Name | Model |      AIC |      BIC |    R2 |    RMSE |   Sigma | Performance-Score
## ----------------------------------------------------------------------------------
## mod5 |   glm | 4739.429 | 5352.540 | 0.948 | 183.798 | 255.000 |            80.00%
## mod2 |   glm | 4772.219 | 4798.876 | 0.855 | 306.597 | 309.397 |            30.92%
## mod3 |   glm | 4779.832 | 4821.722 | 0.855 | 306.419 | 311.126 |            26.29%
## mod4 |   glm | 4781.800 | 4827.497 | 0.855 | 306.404 | 311.594 |            25.08%
## mod1 |   glm | 4785.594 | 4804.634 | 0.847 | 314.701 | 316.608 |            19.79%
# and a plot of that table
comps %>% plot()

It sure looks like the complex mod5 outperforms the other models I tried. I have my worries that it’s overfitting though, because the formula is so cluttered. Perhaps I’ll try a scaled-back version of that model without so many interaction terms:

names(penguins)
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"
mod6 <- glm(data=penguins,
            formula = body_mass_g ~ (species * sex * island) + bill_length_mm + bill_depth_mm + flipper_length_mm + year)
  
compare_performance(mod1,mod2,mod3,mod4,mod5,mod6,
                    rank=TRUE)
## # Comparison of Model Performance Indices
## 
## Name | Model |      AIC |      BIC |    R2 |    RMSE |   Sigma | Performance-Score
## ----------------------------------------------------------------------------------
## mod5 |   glm | 4739.429 | 5352.540 | 0.948 | 183.798 | 255.000 |            71.36%
## mod6 |   glm | 4704.306 | 4761.428 | 0.887 | 270.302 | 276.169 |            67.87%
## mod2 |   glm | 4772.219 | 4798.876 | 0.855 | 306.597 | 309.397 |            27.15%
## mod3 |   glm | 4779.832 | 4821.722 | 0.855 | 306.419 | 311.126 |            24.00%
## mod4 |   glm | 4781.800 | 4827.497 | 0.855 | 306.404 | 311.594 |            23.17%
## mod1 |   glm | 4785.594 | 4804.634 | 0.847 | 314.701 | 316.608 |            18.54%

Though mod6 doesn’t do a good of a job fitting the data, I feel better about it due to it’s simplicity and flexibility in comparison to mod5. If you want to see the mod5 formula, I’m including it at the end of the document.

Let’s take a look at predictions:

penguins %>% 
  gather_predictions(mod1,mod2,mod3,mod4,mod5,mod6) %>% 
  ggplot(aes(x=body_mass_g,y=pred,color=model)) +
  geom_segment(aes(x=0,y=0,xend=6000,yend=6000),linetype=2, color="black",alpha=.5) +
  geom_smooth(method = "lm",se=FALSE) +
  facet_wrap(~species) +
  theme_minimal() +
  scale_color_viridis_d() +
  labs(title = "Predictions vs observations",
       subtitle = "dashed line indicates perfect overlap between observed values and model predictions")





Looking at this plot, I see that mod5 is doing a really good job of making predictions within this data set. But I want to test it and mod6 in a cross-validation framework to see how it performs when it can’t “see” the whole data set. Here, I will split the penguins data into a training and testing set, randomly.

set.seed(123)
training_samples <- caret::createDataPartition(seq_along(penguins$body_mass_g),
                           p=.8)
train_data <- penguins[training_samples$Resample1,]
test_data <- penguins[-training_samples$Resample1,] # all rows not in training_samples

Re-training my models:

mod5_formula <- mod5$formula
mod6_formula <- mod6$formula

mod5 <- glm(data=train_data,
            formula = mod5_formula)

mod6 <- glm(data=train_data,
            formula = mod6_formula)

Now, testing their predictive ability against the test data set:

gather_predictions(test_data, mod5, mod6) %>% 
  ggplot(aes(x=body_mass_g,y=pred,color=model)) +
  geom_segment(aes(x=0,y=0,xend=6000,yend=6000),linetype=2, color="black",alpha=.5) +
  geom_smooth(method = "lm",se=FALSE) +
  facet_wrap(~species,scales = "free") +
  theme_minimal() +
  scale_color_viridis_d() +
  labs(title = "Predictions vs observations",
       subtitle = "dashed line indicates perfect overlap between observed values and model predictions")

Okay, mod5 failed spectacularly for the Adelie penguins. So badly, it’s throwing off the scale and predicting some penguins to be heavier than the largest blue whales. I’m definitely going with mod6. This is a common problem when dealing with models that are overfit to a data set! Once they’re free to predict outside of their training, they look really stupid.

Here’s the model summary info for mod6, explaining body mass of these penguins:

mod6 %>% model_parameters()
## Parameter                        | Coefficient |       SE |              95% CI | t(253) |      p
## -------------------------------------------------------------------------------------------------
## (Intercept)                      |    95423.96 | 47240.43 | [2834.42, 1.88e+05] |   2.02 | 0.044 
## species [Chinstrap]              |     -123.24 |   109.52 | [-337.90,    91.42] |  -1.13 | 0.262 
## species [Gentoo]                 |      869.51 |   166.05 | [ 544.05,  1194.96] |   5.24 | < .001
## sex [male]                       |      446.57 |   104.50 | [ 241.76,   651.38] |   4.27 | < .001
## island [Dream]                   |      -27.30 |    89.44 | [-202.60,   148.00] |  -0.31 | 0.760 
## island [Torgersen]               |       25.09 |    91.80 | [-154.84,   205.02] |   0.27 | 0.785 
## bill_length_mm                   |       23.05 |     7.79 | [   7.77,    38.32] |   2.96 | 0.003 
## bill_depth_mm                    |       65.16 |    21.58 | [  22.86,   107.46] |   3.02 | 0.003 
## flipper_length_mm                |       18.71 |     3.52 | [  11.82,    25.60] |   5.32 | < .001
## year                             |      -48.59 |    23.63 | [ -94.90,    -2.28] |  -2.06 | 0.041 
## species [Chinstrap] * sex [male] |     -375.34 |   116.90 | [-604.46,  -146.22] |  -3.21 | 0.001 
## species [Gentoo] * sex [male]    |       10.67 |   116.56 | [-217.78,   239.12] |   0.09 | 0.927 
## sex [male] * island [Dream]      |        4.67 |   130.97 | [-252.03,   261.38] |   0.04 | 0.972 
## sex [male] * island [Torgersen]  |     -119.68 |   135.77 | [-385.79,   146.44] |  -0.88 | 0.379

Takeaway messages:

  1. Being a Gentoo penguin means you’re probably 869 g heavier than an equivalent Adelie
  2. Being male adds an expected 446 g to body mass
  3. Increase of 1 mm bill length adds an expected 23 g body mass
  4. Increase of 1 mm bill depth adds 65 g of expected body mass
  5. Each increase in 1 mm flipper length is expected to indicate an additional 18 g of body mass
  6. If you’re a male Chinstrap penguin, go ahead and knock off 375 g of expected body mass
  7. Each additional year in this series subtracts 48 g of expected body mass
  8. If we know the year and your flipper length, bill length and depth, sex, and species, we can make a decent guess as to your body mass

One more thing, for a bonus… I’ll take a look at dimensional reduction (NMDS) to see if we can find the same pattern from our model from another direction (for more confirmation of important parameters)

NMDS_data <- penguins %>% 
  select(bill_length_mm,bill_depth_mm,flipper_length_mm)
comp_cases <- complete.cases(NMDS_data)

NMDS_data <- NMDS_data[comp_cases,]

MDS <- vegan::metaMDS(comm=NMDS_data,distance = "bray",autotransform = TRUE,k = 2,trace=0)

penguins[comp_cases,] %>% 
  mutate(MDS1=MDS$points[,1],
         MDS2=MDS$points[,2]) %>% 
  ggplot(aes(x=MDS1,y=MDS2,color=body_mass_g)) +
  geom_point() +
  scale_color_viridis_c() +
  theme_minimal() + 
  facet_wrap(~species)

The color scheme relates to body mass, the location in the plot relates to overall composition of bill_length_mm, bill_depth_mm, and flipper_length_mm. Again, we see that those characteristics, combined with species identity, are important for predicting body mass.



Just for fun, here’s the horribly overfit mod5 formula:

mod5$formula
## body_mass_g ~ species + sex + island + year + flipper_length_mm + 
##     bill_length_mm + bill_depth_mm + species:sex + sex:island + 
##     species:year + sex:year + island:year + species:flipper_length_mm + 
##     sex:flipper_length_mm + island:flipper_length_mm + year:flipper_length_mm + 
##     species:bill_length_mm + sex:bill_length_mm + island:bill_length_mm + 
##     year:bill_length_mm + flipper_length_mm:bill_length_mm + 
##     species:bill_depth_mm + sex:bill_depth_mm + island:bill_depth_mm + 
##     year:bill_depth_mm + flipper_length_mm:bill_depth_mm + bill_length_mm:bill_depth_mm + 
##     species:sex:year + sex:island:year + species:sex:flipper_length_mm + 
##     sex:island:flipper_length_mm + species:year:flipper_length_mm + 
##     sex:year:flipper_length_mm + island:year:flipper_length_mm + 
##     species:sex:bill_length_mm + sex:island:bill_length_mm + 
##     species:year:bill_length_mm + sex:year:bill_length_mm + island:year:bill_length_mm + 
##     species:flipper_length_mm:bill_length_mm + sex:flipper_length_mm:bill_length_mm + 
##     island:flipper_length_mm:bill_length_mm + year:flipper_length_mm:bill_length_mm + 
##     species:sex:bill_depth_mm + sex:island:bill_depth_mm + species:year:bill_depth_mm + 
##     sex:year:bill_depth_mm + island:year:bill_depth_mm + species:flipper_length_mm:bill_depth_mm + 
##     sex:flipper_length_mm:bill_depth_mm + island:flipper_length_mm:bill_depth_mm + 
##     year:flipper_length_mm:bill_depth_mm + species:bill_length_mm:bill_depth_mm + 
##     sex:bill_length_mm:bill_depth_mm + island:bill_length_mm:bill_depth_mm + 
##     year:bill_length_mm:bill_depth_mm + flipper_length_mm:bill_length_mm:bill_depth_mm + 
##     species:sex:year:flipper_length_mm + sex:island:year:flipper_length_mm + 
##     species:sex:year:bill_length_mm + sex:island:year:bill_length_mm + 
##     species:sex:flipper_length_mm:bill_length_mm + sex:island:flipper_length_mm:bill_length_mm + 
##     species:year:flipper_length_mm:bill_length_mm + sex:year:flipper_length_mm:bill_length_mm + 
##     island:year:flipper_length_mm:bill_length_mm + species:sex:year:bill_depth_mm + 
##     sex:island:year:bill_depth_mm + species:sex:flipper_length_mm:bill_depth_mm + 
##     sex:island:flipper_length_mm:bill_depth_mm + species:year:flipper_length_mm:bill_depth_mm + 
##     sex:year:flipper_length_mm:bill_depth_mm + island:year:flipper_length_mm:bill_depth_mm + 
##     species:sex:bill_length_mm:bill_depth_mm + sex:island:bill_length_mm:bill_depth_mm + 
##     species:year:bill_length_mm:bill_depth_mm + sex:year:bill_length_mm:bill_depth_mm + 
##     island:year:bill_length_mm:bill_depth_mm + species:flipper_length_mm:bill_length_mm:bill_depth_mm + 
##     sex:flipper_length_mm:bill_length_mm:bill_depth_mm + island:flipper_length_mm:bill_length_mm:bill_depth_mm + 
##     year:flipper_length_mm:bill_length_mm:bill_depth_mm + species:sex:year:flipper_length_mm:bill_length_mm + 
##     sex:island:year:flipper_length_mm:bill_length_mm + species:sex:year:flipper_length_mm:bill_depth_mm + 
##     sex:island:year:flipper_length_mm:bill_depth_mm + species:sex:year:bill_length_mm:bill_depth_mm + 
##     sex:island:year:bill_length_mm:bill_depth_mm + species:sex:flipper_length_mm:bill_length_mm:bill_depth_mm + 
##     sex:island:flipper_length_mm:bill_length_mm:bill_depth_mm + 
##     species:year:flipper_length_mm:bill_length_mm:bill_depth_mm + 
##     sex:year:flipper_length_mm:bill_length_mm:bill_depth_mm + 
##     island:year:flipper_length_mm:bill_length_mm:bill_depth_mm + 
##     species:sex:year:flipper_length_mm:bill_length_mm:bill_depth_mm + 
##     sex:island:year:flipper_length_mm:bill_length_mm:bill_depth_mm