iris_analysis.knit

A quick analysis

This is just a glimpse of a simple analysis. It may not look simple right now, but pay attention to how just a few lines of code can give us professional figures and quality statistical results. I’m showing you this early on so you get an idea about what sorts of things you’ll be able to create in a few short weeks.

This is an analysis of a famous dataset that measured sepal and petal lengths and widths in three species of Iris flowers.

If you were to look at this data set in Excel, it would look something like this:

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

And a plot of Sepal Width vs Petal Width might look something like this:

But, honestly, that plot leaves a lot to be desired. If I was going to tell you how to recreate that plot, I’d basically need a video showing all the different options I clicked in Excel, how I had to rearrange and copy columns just to get it to work…and even then, it’s a bit horrible.

Contrast the above with this:

library(tidyverse)
ggplot(iris, aes(x=Sepal.Width,y=Petal.Width,color=Species)) +
  geom_point() + geom_smooth(method="lm") + theme_minimal()

Just a few lines of code and we’ve already got a MUCH better plot that even accounts for how the relationship changes between the 3 different Iris species. Now, if you want to run a statistical test to see if Petal Width really correlates with Sepal Width and Iris Species, it’s pretty simple once you know a few functions:

mod <- aov(data=iris, Petal.Width ~ Sepal.Width + Species)
summary(mod) # This is an ANOVA model

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Sepal.Width   1  11.60   11.60   353.4 <2e-16 ***
## Species       2  70.17   35.09  1068.6 <2e-16 ***
## Residuals   146   4.79    0.03                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

If you want a text-based summary of what that statistical table is telling you, that’s just another short bit of reproducible code:

report::report(mod)

## The ANOVA (formula: Petal.Width ~ Sepal.Width + Species) suggests that:
## 
##   - The main effect of Sepal.Width is statistically significant and large (F(1, 146) = 353.45, p < .001; Eta2 (partial) = 0.71, 90% CI [0.65, 0.76])
##   - The main effect of Species is statistically significant and large (F(2, 146) = 1068.64, p < .001; Eta2 (partial) = 0.94, 90% CI [0.92, 0.95])
## 
## Effect sizes were labelled following Field's (2013) recommendations.

Come to class, do your assignments, do your practice time every day and this will be second-nature to you. Your lab reports for other classes are going to be painfully impressive, and you are going to be able to get a pay bump in whatever job you have.

Go a bit further, taking advantage of the resources on the course web page and things are going to be even cooler, like interactive HTML plots or animations:

p <- ggplot(iris, aes(x=Sepal.Width,y=Petal.Width,color=Species)) +
  geom_point() + theme_minimal()

plotly::ggplotly(p)