Let’s now use lasso regression to select predictors for the Fish data.
rm(list=ls()) # clean up workspace
fish.data <- read.csv("Fish.csv")
# before fitting your model, check the weight parameter values to see if all of them make sense.
# you might want to discard the row with weight = 0.
fish.data <- fish.data[fish.data[, "Weight"] > 0, ]
Now let’s divide the data into a training set (~80% of original observations) and a validation set (~20% of original observations).
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-3
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(6040)
training_samples <- createDataPartition(fish.data$Weight, p = 0.8, list = FALSE)
Let’s start from Model 8 from last lab session: log(Weight) ~ Species * (Length1 + Length2 + Length3 + Height + Width)
. The glmnet
package takes a matrix (of predictors) and a vector (of responses) as input. We use model.matrix
function to create them. glmnet
will add intercept by default, so we drop intercept term when forming x
matrix.
# X and y from original data
x_all <- model.matrix(
log(Weight) ~ Species * (Length1 + Length2 + Length3 + Height + Width),
data = fish.data)
y_all <- log(fish.data$Weight)
# training X and y
x_train <- x_all[training_samples[, 1], ]
y_train <- y_all[training_samples[, 1]]
# validation X and y
x_val <- x_all[-training_samples[, 1], ]
y_val <- y_all[-training_samples[, 1]]
# fill the code here
# fill your code here
set.seed(7260)
cv_lasso <- cv.glmnet(x_all, y_all, alpha = 1, type.measure = "deviance", nfolds = 5) # check the documentation of the argument type.measure to find that "deviance" corresponds to squared-error for guassian models.
plot(cv_lasso)
The plot displays the cross-validation error (deviance by default) according to the log of lambda. The left dashed vertical line indicates that the log of the optimal value of log lambda is approximately -7.8, which is the one that minimizes the average deviance. This lambda value will give the most accurate model. The exact value of lambda and corresponding model can be viewed as follow:
cv_lasso$lambda.min
## [1] 0.0004156859
coef(cv_lasso, cv_lasso$lambda.min)
## 43 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 3.026683e+00
## (Intercept) .
## SpeciesParkki -9.455128e-01
## SpeciesPerch -7.465284e-01
## SpeciesPike 1.722408e-01
## SpeciesRoach -9.428091e-01
## SpeciesSmelt -2.731597e+00
## SpeciesWhitefish .
## Length1 .
## Length2 .
## Length3 5.735172e-02
## Height 5.038639e-02
## Width 6.826585e-02
## SpeciesParkki:Length1 .
## SpeciesPerch:Length1 1.144083e-05
## SpeciesPike:Length1 .
## SpeciesRoach:Length1 2.022911e-03
## SpeciesSmelt:Length1 .
## SpeciesWhitefish:Length1 .
## SpeciesParkki:Length2 .
## SpeciesPerch:Length2 2.739257e-02
## SpeciesPike:Length2 -1.780415e-03
## SpeciesRoach:Length2 .
## SpeciesSmelt:Length2 1.097543e-04
## SpeciesWhitefish:Length2 .
## SpeciesParkki:Length3 .
## SpeciesPerch:Length3 8.787732e-06
## SpeciesPike:Length3 -1.037714e-03
## SpeciesRoach:Length3 .
## SpeciesSmelt:Length3 .
## SpeciesWhitefish:Length3 .
## SpeciesParkki:Height .
## SpeciesPerch:Height .
## SpeciesPike:Height .
## SpeciesRoach:Height 2.018311e-02
## SpeciesSmelt:Height 3.708011e-01
## SpeciesWhitefish:Height .
## SpeciesParkki:Width 2.663791e-01
## SpeciesPerch:Width .
## SpeciesPike:Width -4.492663e-02
## SpeciesRoach:Width 1.805385e-01
## SpeciesSmelt:Width 2.196758e-01
## SpeciesWhitefish:Width 4.774716e-02
Reference: Two-way ANOVA test in R
Let’s use R to perform a Two-way ANOVA test on the honeybee data to test what you have calculated in class.
Data preparation:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.6 ✓ dplyr 1.0.8
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
## x tidyr::pack() masks Matrix::pack()
## x tidyr::unpack() masks Matrix::unpack()
honey.bee <- matrix(c(20, 20, 3.1, 3.7, 4.7,
20, 40, 5.5, 6.7, 7.3,
20, 60, 7.9, 9.2, 9.3,
30, 20, 6, 6.9, 7.5,
30, 40, 11.5, 12.9, 13.4,
30, 60, 17.5, 15.8, 14.7,
40, 20, 7.7, 8.3, 9.5,
40, 40, 15.7, 14.3, 15.9,
40, 60, 19.1, 18.0, 19.9), 9, 5, byrow = TRUE)
colnames(honey.bee) <- c("Temp", "Suc", "rep.1", "rep.2", "rep.3")
honey.bee.df <- as_tibble(honey.bee) %>% # %>% is the pipe operator, read more about it here: https://r4ds.had.co.nz/pipes.html
pivot_longer(cols = c("rep.1", "rep.2", "rep.3"), names_to = NULL, values_to = "Consumption") %>%
mutate(Temp = as.factor(Temp), Suc = as.factor(Suc)) # change variable types to factor
aov
to perform ANOVA test. You may refer to the reference link above.res.aov <- aov(Consumption ~ Temp * Suc, data = honey.bee.df)
summary(res.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## Temp 2 293.16 146.58 161.999 3.10e-12 ***
## Suc 2 309.96 154.98 171.283 1.93e-12 ***
## Temp:Suc 4 27.13 6.78 7.496 0.000974 ***
## Residuals 18 16.29 0.90
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
interaction.plot
to generate an interaction plot with Temp
as the x factor and Suc
as the trace factor.ggline
from ggpubr
packge to generate the same interaction plot.library("ggpubr")
Requested by Sang-Eun (last year).
You may practice implementing the Metropolis-Hastings Algorithm and apply it to sample from some distributions.
Dr. Vladimir Minin has a great lab session on doing so.
You should first read about the lab session here.
You can then fill out the code in this R script here for sampling the standard normal distribution and here for sampling from the distribution of the time of infection
Check your implementation and results with the filled version here for sampling the standard normal distribution and here for sampling from the distribution of the time of infection
For more materials, you can visit their course site here