Lasso Regression

Let’s now use lasso regression to select predictors for the Fish data.

rm(list=ls())  # clean up workspace
fish.data <- read.csv("Fish.csv")
# before fitting your model, check the weight parameter values to see if all of them make sense.
# you might want to discard the row with weight = 0.
fish.data <- fish.data[fish.data[, "Weight"] > 0, ]

Now let’s divide the data into a training set (~80% of original observations) and a validation set (~20% of original observations).

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-3
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(6040)

training_samples <- createDataPartition(fish.data$Weight, p = 0.8, list = FALSE)

Let’s start from Model 8 from last lab session: log(Weight) ~ Species * (Length1 + Length2 + Length3 + Height + Width). The glmnet package takes a matrix (of predictors) and a vector (of responses) as input. We use model.matrix function to create them. glmnet will add intercept by default, so we drop intercept term when forming x matrix.

# X and y from original data
x_all <- model.matrix(
  log(Weight) ~ Species * (Length1 + Length2 + Length3 + Height + Width), 
  data = fish.data)
y_all <- log(fish.data$Weight)
# training X and y
x_train <- x_all[training_samples[, 1], ]
y_train <- y_all[training_samples[, 1]]
# validation X and y
x_val <- x_all[-training_samples[, 1], ]
y_val <- y_all[-training_samples[, 1]]

Now fit the lasso regression and plot solution path:

# fill the code here
  • Now we can evaluate the performance of the models (corresponding to different \(\lambda\) values) on the validation set.
# fill your code here
  • Now the question is whether we should make decision on this single training-validation split? A common strategy is to use k-fold cross validation.
set.seed(7260)
cv_lasso <- cv.glmnet(x_all, y_all, alpha = 1, type.measure = "deviance", nfolds = 5)  # check the documentation of the argument type.measure to find that "deviance" corresponds to squared-error for guassian models.
plot(cv_lasso)

The plot displays the cross-validation error (deviance by default) according to the log of lambda. The left dashed vertical line indicates that the log of the optimal value of log lambda is approximately -7.8, which is the one that minimizes the average deviance. This lambda value will give the most accurate model. The exact value of lambda and corresponding model can be viewed as follow:

cv_lasso$lambda.min
## [1] 0.0004156859
coef(cv_lasso, cv_lasso$lambda.min)
## 43 x 1 sparse Matrix of class "dgCMatrix"
##                                     s1
## (Intercept)               3.026683e+00
## (Intercept)               .           
## SpeciesParkki            -9.455128e-01
## SpeciesPerch             -7.465284e-01
## SpeciesPike               1.722408e-01
## SpeciesRoach             -9.428091e-01
## SpeciesSmelt             -2.731597e+00
## SpeciesWhitefish          .           
## Length1                   .           
## Length2                   .           
## Length3                   5.735172e-02
## Height                    5.038639e-02
## Width                     6.826585e-02
## SpeciesParkki:Length1     .           
## SpeciesPerch:Length1      1.144083e-05
## SpeciesPike:Length1       .           
## SpeciesRoach:Length1      2.022911e-03
## SpeciesSmelt:Length1      .           
## SpeciesWhitefish:Length1  .           
## SpeciesParkki:Length2     .           
## SpeciesPerch:Length2      2.739257e-02
## SpeciesPike:Length2      -1.780415e-03
## SpeciesRoach:Length2      .           
## SpeciesSmelt:Length2      1.097543e-04
## SpeciesWhitefish:Length2  .           
## SpeciesParkki:Length3     .           
## SpeciesPerch:Length3      8.787732e-06
## SpeciesPike:Length3      -1.037714e-03
## SpeciesRoach:Length3      .           
## SpeciesSmelt:Length3      .           
## SpeciesWhitefish:Length3  .           
## SpeciesParkki:Height      .           
## SpeciesPerch:Height       .           
## SpeciesPike:Height        .           
## SpeciesRoach:Height       2.018311e-02
## SpeciesSmelt:Height       3.708011e-01
## SpeciesWhitefish:Height   .           
## SpeciesParkki:Width       2.663791e-01
## SpeciesPerch:Width        .           
## SpeciesPike:Width        -4.492663e-02
## SpeciesRoach:Width        1.805385e-01
## SpeciesSmelt:Width        2.196758e-01
## SpeciesWhitefish:Width    4.774716e-02

What is the difference between the above two procedures?


ANOVA

Reference: Two-way ANOVA test in R

Let’s use R to perform a Two-way ANOVA test on the honeybee data to test what you have calculated in class.

Data preparation:

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.6     ✓ dplyr   1.0.8
## ✓ tidyr   1.2.0     ✓ stringr 1.4.0
## ✓ readr   2.1.2     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()
## x tidyr::pack()   masks Matrix::pack()
## x tidyr::unpack() masks Matrix::unpack()
honey.bee <- matrix(c(20, 20, 3.1, 3.7, 4.7,
        20, 40, 5.5, 6.7, 7.3,
        20, 60, 7.9, 9.2, 9.3,
        30, 20, 6, 6.9, 7.5,
        30, 40, 11.5, 12.9, 13.4,
        30, 60, 17.5, 15.8, 14.7,
        40, 20, 7.7, 8.3, 9.5,
        40, 40, 15.7, 14.3, 15.9,
        40, 60, 19.1, 18.0, 19.9), 9, 5, byrow = TRUE)
colnames(honey.bee) <- c("Temp", "Suc", "rep.1", "rep.2", "rep.3")

honey.bee.df <- as_tibble(honey.bee) %>%  # %>% is the pipe operator, read more about it here: https://r4ds.had.co.nz/pipes.html
  pivot_longer(cols = c("rep.1", "rep.2", "rep.3"), names_to = NULL, values_to = "Consumption") %>%
  mutate(Temp = as.factor(Temp), Suc = as.factor(Suc))  # change variable types to factor

Use the function interaction.plot to generate an interaction plot with Temp as the x factor and Suc as the trace factor.

Use the function ggline from ggpubr packge to generate the same interaction plot.

library("ggpubr")

[Optional practice] Implementing the Metropolis-Hastings algorithm

Requested by Sang-Eun (last year).

You may practice implementing the Metropolis-Hastings Algorithm and apply it to sample from some distributions.

Dr. Vladimir Minin has a great lab session on doing so.