Formulas in R

reference R4DS

R uses formulas for most modelling functions.

A simple example: y ~ x translates to: \(y = \beta_0 + x\beta_1\).

Below is a summary for some popular uses

Use Example Translation
Simple example y ~ x \(y = \beta_0 + x\beta_1\)
Exclude intercept y ~ x - 1 or y ~ x + 0 \(y = x\beta\)
Add a continuous variable y ~ x1 + x2 \(y = \beta_0 + x_1 \beta_1 + x_2 \beta_2\)
Add a categorical variable y ~ x1 + sex \(y = \beta_0 + x_1 \beta_1 + \mbox{sex_male} \beta_2\)
Interactions (continuous and categorical) y ~ x1 * sex \(y = \beta_0 + x_1 \beta1 + \mbox{sex_male} \beta_2 + x_1 \mbox{sex_male} \beta_{12}\)
Interactions (continuous and continuous) y ~ x1 * x_2 \(y = \beta_0 + x_1 \beta1 + x_2 \beta_2 + x_1 x_2 \beta_{12}\)

You can specify transformations inside the model formula too (besides creating new transformed variables in your data frame).

For example, log(y) ~ sqrt(x1) + x2 is transformed to \(\log(y) = \beta_0 + \sqrt{x_1} \beta_1 + x_2 \beta_2\).

One thing to note is that if your transformation involves +, *, ^ or -, you will need to wrap it in I() so R doesn’t treat it like part of the model specification.

For example, y ~ x + I(x ^ 2) is translated to \(y = \beta_0 + x \beta_1 + x^2 \beta_2\).

And remember, R automatically drops redundant variables so x + x become x.

Q1 write out the translations for the formulas below:

  • y ~ x ^ 2 + x

  • y ~ x + x ^ 2 + x ^ 3

  • y ~ poly(x, 3)

  • y ~ (x1 + x2 + x3) ^2 + (x2 + x3) * (x4 + x5)

Model selection

Let’s put more thinking into modeling the fish data and try out several formulas. We know that from physics, the Weight is the proportional to the density multiply volume. As in the mid-term exam, let’s include two “basic” models first:

Inspired from what we learn in physics, we may expect Species to be a good variable to model the differences in density (and probably other unintended effects) among species of fish. How do we model volume?

fish.data <- read.csv("Fish.csv")
# before fitting your model, check the weight parameter values to see if all of them make sense.
# you might want to disgard the row with weight = 0.
fish.data <- fish.data[fish.data[, "Weight"] > 0, ]

Q2a Fit the above models and record their adjusted-\(R^2\) values.

Which model obtains the highest adjusted-\(R^2\) value? Do all coefficient estimates make sense? Do you get any estimates as NA? Why? (Hint: you may want to check with rank. Also, compare the number of parameters in the model with the number of parameters.)

finish the code below

lmod.model.1 <- lm(Weight ~ Length1 + Length2 + Length3 + Height + Width, data = fish.data)
(summary(lmod.model.1))
## 
## Call:
## lm(formula = Weight ~ Length1 + Length2 + Length3 + Height + 
##     Width, data = fish.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -243.96  -63.57  -25.82   57.90  448.69 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -496.802     29.616 -16.775  < 2e-16 ***
## Length1       63.969     40.169   1.592  0.11335    
## Length2       -9.109     41.749  -0.218  0.82759    
## Length3      -28.119     17.343  -1.621  0.10701    
## Height        27.926      8.721   3.202  0.00166 ** 
## Width         23.412     20.355   1.150  0.25188    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 123 on 152 degrees of freedom
## Multiple R-squared:  0.8855, Adjusted R-squared:  0.8817 
## F-statistic: 235.1 on 5 and 152 DF,  p-value: < 2.2e-16
lmod.model.2 <- lm(Weight ~ Species + Length1 + Length2 + Length3 + Height + Width, data = fish.data)
(summary(lmod.model.2))
## 
## Call:
## lm(formula = Weight ~ Species + Length1 + Length2 + Length3 + 
##     Height + Width, data = fish.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -212.56  -52.52  -11.70   36.55  419.97 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -912.7110   127.4597  -7.161 3.64e-11 ***
## SpeciesParkki     160.9212    75.9591   2.119 0.035824 *  
## SpeciesPerch      133.5542   120.6083   1.107 0.269969    
## SpeciesPike      -209.0262   135.4911  -1.543 0.125061    
## SpeciesRoach      104.9243    91.4636   1.147 0.253188    
## SpeciesSmelt      442.2125   119.6944   3.695 0.000311 ***
## SpeciesWhitefish   91.5688    96.8338   0.946 0.345901    
## Length1           -79.8443    36.3322  -2.198 0.029552 *  
## Length2            81.7091    45.8395   1.783 0.076746 .  
## Length3            30.2726    29.4837   1.027 0.306233    
## Height              5.8069    13.0931   0.444 0.658057    
## Width              -0.7819    23.9477  -0.033 0.974000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 93.95 on 146 degrees of freedom
## Multiple R-squared:  0.9358, Adjusted R-squared:  0.931 
## F-statistic: 193.6 on 11 and 146 DF,  p-value: < 2.2e-16

Yes, model 4, 5, 6 have estimates being NA (Not available). By checking the rank of the models (for example, lmod.model.4$rank), we see that these models suffer from rank deficiency (in other words, linear dependencies between explanatory variables). At the same time, with the 158 observations in the data, these model consist too many parameters that we might simply do not have enough data to estimate them. We probably should not consider model 4, 5, 6 at the first place given the size of our data.

Q2b Use the function extractAIC to obtain the AIC score for the above models (the second output). Which model has the lowest AIC score?

finish the code below

(lmod.model.1.AIC <- extractAIC())

Model 8 has the lowest AIC value.


Now let’s play with sequential methods.

Just recall…

  1. Forward selection:

    1. Determine a shreshold significance level for entering predictors into the model

    2. Begin with a model without any predictors, i.e., \(y = \beta_0 + \epsilon\)

    3. For each predictor not in the model, fit a model that adds that predictor to the working model. Record the t-statistic and p-value for the added predictor.

    4. Do any of the predictors considered in step 3 have a p-value less than the shreshold specified in step 1?

      • Yes: Add the predictor with the smallest p-value to the working model and return to step 3

      • No: Stop. The working model is the final model.

  2. Backwards elimination

    1. Determine a threshold significance level for removing predictors from the model.

    2. Begin with a model that includes all predictors.

    3. Examine the t-statistic and p-value of the partial regression coefficients associated with each predictor.

    4. Do any of the predictors have a p-value greater than the threshold specified in step 1?

      • Yes: Remove the predictor with the largest p-value from the working model and return to step 3.

      • No: Stop. The working model is the final model.

  3. Stepwise selection

    Combines forward selection and backwards elimination steps.

Now perform Model selection on the Fish dataset.

Q2c. perform backward elimination initiated from the model 3, use AIC as your ranking function.

Checkout the documentation of function step in R. Weight ~ Species * (Length1 + Length2 + Length3 + Height + Width)

insert your code below

Q2d What terms are in your final model?

Q2e perform backward model selection from the model 8

log(Weight) ~ Species * (Length1 + Length2 + Length3 + Height + Width)

insert your code below

Q2f How different is your result for Q2e from that for Q2d?