Acknowledgement

Dr. Hua Zhou’s slides

Goal of this lab

Please try to run the code below and understand them.

GA 2000 US Presidential Election Data

The gavote data contains the voting data of Georgia (GA) in the 2000 presidential election. It is available as a dataframe.

# equivalent to head(gavote, 10)
gavote %>% head(10)
##          equip   econ perAA rural    atlanta gore  bush other votes ballots
## APPLING  LEVER   poor 0.182 rural notAtlanta 2093  3940    66  6099    6617
## ATKINSON LEVER   poor 0.230 rural notAtlanta  821  1228    22  2071    2149
## BACON    LEVER   poor 0.131 rural notAtlanta  956  2010    29  2995    3347
## BAKER    OS-CC   poor 0.476 rural notAtlanta  893   615    11  1519    1607
## BALDWIN  LEVER middle 0.359 rural notAtlanta 5893  6041   192 12126   12785
## BANKS    LEVER middle 0.024 rural notAtlanta 1220  3202   111  4533    4773
## BARROW   OS-CC middle 0.079 urban notAtlanta 3657  7925   520 12102   12522
## BARTOW   OS-PC middle 0.079 urban    Atlanta 7508 14720   552 22780   23735
## BEN.HILL OS-PC   poor 0.282 rural notAtlanta 2234  2381    46  4661    5741
## BERRIEN  OS-CC   poor 0.107 rural notAtlanta 1640  2718    52  4410    4475

We convert it into a tibble for easy handling by tidyverse.

gavote <- gavote %>% 
  as_tibble(rownames = "county") %>%
  print(width = Inf)
## # A tibble: 159 × 11
##    county   equip econ   perAA rural atlanta     gore  bush other votes ballots
##    <chr>    <fct> <fct>  <dbl> <fct> <fct>      <int> <int> <int> <int>   <int>
##  1 APPLING  LEVER poor   0.182 rural notAtlanta  2093  3940    66  6099    6617
##  2 ATKINSON LEVER poor   0.23  rural notAtlanta   821  1228    22  2071    2149
##  3 BACON    LEVER poor   0.131 rural notAtlanta   956  2010    29  2995    3347
##  4 BAKER    OS-CC poor   0.476 rural notAtlanta   893   615    11  1519    1607
##  5 BALDWIN  LEVER middle 0.359 rural notAtlanta  5893  6041   192 12126   12785
##  6 BANKS    LEVER middle 0.024 rural notAtlanta  1220  3202   111  4533    4773
##  7 BARROW   OS-CC middle 0.079 urban notAtlanta  3657  7925   520 12102   12522
##  8 BARTOW   OS-PC middle 0.079 urban Atlanta     7508 14720   552 22780   23735
##  9 BEN.HILL OS-PC poor   0.282 rural notAtlanta  2234  2381    46  4661    5741
## 10 BERRIEN  OS-CC poor   0.107 rural notAtlanta  1640  2718    52  4410    4475
## # … with 149 more rows
gavote <- gavote %>%
  mutate(undercount = (ballots - votes) / ballots) %>%
  print(width = Inf)
## # A tibble: 159 × 12
##    county   equip econ   perAA rural atlanta     gore  bush other votes ballots
##    <chr>    <fct> <fct>  <dbl> <fct> <fct>      <int> <int> <int> <int>   <int>
##  1 APPLING  LEVER poor   0.182 rural notAtlanta  2093  3940    66  6099    6617
##  2 ATKINSON LEVER poor   0.23  rural notAtlanta   821  1228    22  2071    2149
##  3 BACON    LEVER poor   0.131 rural notAtlanta   956  2010    29  2995    3347
##  4 BAKER    OS-CC poor   0.476 rural notAtlanta   893   615    11  1519    1607
##  5 BALDWIN  LEVER middle 0.359 rural notAtlanta  5893  6041   192 12126   12785
##  6 BANKS    LEVER middle 0.024 rural notAtlanta  1220  3202   111  4533    4773
##  7 BARROW   OS-CC middle 0.079 urban notAtlanta  3657  7925   520 12102   12522
##  8 BARTOW   OS-PC middle 0.079 urban Atlanta     7508 14720   552 22780   23735
##  9 BEN.HILL OS-PC poor   0.282 rural notAtlanta  2234  2381    46  4661    5741
## 10 BERRIEN  OS-CC poor   0.107 rural notAtlanta  1640  2718    52  4410    4475
##    undercount
##         <dbl>
##  1     0.0783
##  2     0.0363
##  3     0.105 
##  4     0.0548
##  5     0.0515
##  6     0.0503
##  7     0.0335
##  8     0.0402
##  9     0.188 
## 10     0.0145
## # … with 149 more rows
(gavote <- gavote %>%
  rename(usage = rural) %>%
  mutate(pergore = gore / votes, perbush = bush / votes)) %>%
  print(width = Inf)
## # A tibble: 159 × 14
##    county   equip econ   perAA usage atlanta     gore  bush other votes ballots
##    <chr>    <fct> <fct>  <dbl> <fct> <fct>      <int> <int> <int> <int>   <int>
##  1 APPLING  LEVER poor   0.182 rural notAtlanta  2093  3940    66  6099    6617
##  2 ATKINSON LEVER poor   0.23  rural notAtlanta   821  1228    22  2071    2149
##  3 BACON    LEVER poor   0.131 rural notAtlanta   956  2010    29  2995    3347
##  4 BAKER    OS-CC poor   0.476 rural notAtlanta   893   615    11  1519    1607
##  5 BALDWIN  LEVER middle 0.359 rural notAtlanta  5893  6041   192 12126   12785
##  6 BANKS    LEVER middle 0.024 rural notAtlanta  1220  3202   111  4533    4773
##  7 BARROW   OS-CC middle 0.079 urban notAtlanta  3657  7925   520 12102   12522
##  8 BARTOW   OS-PC middle 0.079 urban Atlanta     7508 14720   552 22780   23735
##  9 BEN.HILL OS-PC poor   0.282 rural notAtlanta  2234  2381    46  4661    5741
## 10 BERRIEN  OS-CC poor   0.107 rural notAtlanta  1640  2718    52  4410    4475
##    undercount pergore perbush
##         <dbl>   <dbl>   <dbl>
##  1     0.0783   0.343   0.646
##  2     0.0363   0.396   0.593
##  3     0.105    0.319   0.671
##  4     0.0548   0.588   0.405
##  5     0.0515   0.486   0.498
##  6     0.0503   0.269   0.706
##  7     0.0335   0.302   0.655
##  8     0.0402   0.330   0.646
##  9     0.188    0.479   0.511
## 10     0.0145   0.372   0.616
## # … with 149 more rows

A model with two quantitative predictors

lm

(lmod <- lm(undercount ~ pergore + perAA, gavote))
## 
## Call:
## lm(formula = undercount ~ pergore + perAA, data = gavote)
## 
## Coefficients:
## (Intercept)      pergore        perAA  
##     0.03238      0.01098      0.02853
  • The regression coefficient \(\widehat{\boldsymbol{\beta}}\) can be retrieved by
# same lmod$coefficients
coef(lmod)
## (Intercept)     pergore       perAA 
##  0.03237600  0.01097872  0.02853314
  • interpreting the partial regression coefficients:

    • Geometric interpretation: The partial regression coefficient \(\beta_j\) associated with the predictor \(x_j\) is the slope of the regression plane in the \(x_j\) direction. Imagine taking a “slice” of the regression plane. In the terminology of calculus, \(\beta_j\) is also the partial derivative of the regression plane with respect to the predictor \(x_j\).

    • Verbal interpretation: The partial regression coefficient \(\beta_j\) associated with predictor \(x_j\) is the slope of the linear association between \(y\) and \(x_j\) while controlling for the other predictors in the model (i.e., holding them constant).

  • The fitted values or predicted values are \[ \widehat{\mathbf{y}} = \mathbf{X} \widehat{\boldsymbol{\beta}} \]

# same as lmod$fitted.values
predict(lmod) %>% head()
##          1          2          3          4          5          6 
## 0.04133661 0.04329088 0.03961823 0.05241202 0.04795484 0.03601558

and the residuals are \[ \widehat{\boldsymbol{\epsilon}} = \mathbf{y} - \widehat{\mathbf{y}} = \mathbf{y} - \mathbf{X} \widehat{\boldsymbol{\beta}}. \]

# same as lmod$residuals
residuals(lmod) %>% head()
##            1            2            3            4            5            6 
##  0.036946603 -0.006994927  0.065550577  0.002348407  0.003589940  0.014267264
  • The residual sum of squares (RSS), also called deviance, is \(\|\widehat{\boldsymbol{\epsilon}}\|^2\).
deviance(lmod)
## [1] 0.09324918
  • The degree of freedom of a linear model is \(n-p\).
nrow(gavote) - length(coef(lmod))
## [1] 156
df.residual(lmod)
## [1] 156

summary

  • The summary command computes some more regression quantities.
(lmodsum <- summary(lmod))
## 
## Call:
## lm(formula = undercount ~ pergore + perAA, data = gavote)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.046013 -0.014995 -0.003539  0.011784  0.142436 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.03238    0.01276   2.537   0.0122 *
## pergore      0.01098    0.04692   0.234   0.8153  
## perAA        0.02853    0.03074   0.928   0.3547  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02445 on 156 degrees of freedom
## Multiple R-squared:  0.05309,    Adjusted R-squared:  0.04095 
## F-statistic: 4.373 on 2 and 156 DF,  p-value: 0.01419
  • An unbiased estimate of the error variance \(\sigma^2\) is \[ \widehat{\sigma} = \sqrt{\frac{\text{RSS}}{\text{df}}} \]
sqrt(deviance(lmod) / df.residual(lmod))
## [1] 0.02444895
lmodsum$sigma
## [1] 0.02444895
  • A commonly used goodness of fit mesure is \(R^2\), or coefficient of determination or percentage of variance explained \[ R^2 = 1 - \frac{\sum_i (y_i - \widehat{y}_i)^2}{\sum_i (y_i - \bar{y})^2} = 1 - \frac{\text{RSS}}{\text{TSS}}, \] where \(\text{TSS} = \sum_i (y_i - \bar{y})^2\) is the total sum of squares.
lmodsum$r.squared
## [1] 0.05308861

An \(R^2\) of about 5% indicates the model has a poor fit. \(R^2\) can also be interpreted as the (squared) correlation between the predicted values and the response

cor(predict(lmod), gavote$undercount)^2
## [1] 0.05308861

A model with both quantitative and qualitative predictors

gavote <- gavote %>%
   mutate(cpergore = pergore - mean(pergore), cperAA = perAA - mean(perAA)) %>%
   print(width = Inf)
## # A tibble: 159 × 16
##    county   equip econ   perAA usage atlanta     gore  bush other votes ballots
##    <chr>    <fct> <fct>  <dbl> <fct> <fct>      <int> <int> <int> <int>   <int>
##  1 APPLING  LEVER poor   0.182 rural notAtlanta  2093  3940    66  6099    6617
##  2 ATKINSON LEVER poor   0.23  rural notAtlanta   821  1228    22  2071    2149
##  3 BACON    LEVER poor   0.131 rural notAtlanta   956  2010    29  2995    3347
##  4 BAKER    OS-CC poor   0.476 rural notAtlanta   893   615    11  1519    1607
##  5 BALDWIN  LEVER middle 0.359 rural notAtlanta  5893  6041   192 12126   12785
##  6 BANKS    LEVER middle 0.024 rural notAtlanta  1220  3202   111  4533    4773
##  7 BARROW   OS-CC middle 0.079 urban notAtlanta  3657  7925   520 12102   12522
##  8 BARTOW   OS-PC middle 0.079 urban Atlanta     7508 14720   552 22780   23735
##  9 BEN.HILL OS-PC poor   0.282 rural notAtlanta  2234  2381    46  4661    5741
## 10 BERRIEN  OS-CC poor   0.107 rural notAtlanta  1640  2718    52  4410    4475
##    undercount pergore perbush cpergore  cperAA
##         <dbl>   <dbl>   <dbl>    <dbl>   <dbl>
##  1     0.0783   0.343   0.646  -0.0652 -0.0610
##  2     0.0363   0.396   0.593  -0.0119 -0.0130
##  3     0.105    0.319   0.671  -0.0891 -0.112 
##  4     0.0548   0.588   0.405   0.180   0.233 
##  5     0.0515   0.486   0.498   0.0777  0.116 
##  6     0.0503   0.269   0.706  -0.139  -0.219 
##  7     0.0335   0.302   0.655  -0.106  -0.164 
##  8     0.0402   0.330   0.646  -0.0787 -0.164 
##  9     0.188    0.479   0.511   0.0710  0.0390
## 10     0.0145   0.372   0.616  -0.0364 -0.136 
## # … with 149 more rows
lmodi <- lm(undercount ~ cperAA + cpergore * usage + equip, gavote)
summary(lmodi)
## 
## Call:
## lm(formula = undercount ~ cperAA + cpergore * usage + equip, 
##     data = gavote)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.059530 -0.012904 -0.002180  0.009013  0.127496 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          0.043297   0.002839  15.253  < 2e-16 ***
## cperAA               0.028264   0.031092   0.909   0.3648    
## cpergore             0.008237   0.051156   0.161   0.8723    
## usageurban          -0.018637   0.004648  -4.009 9.56e-05 ***
## equipOS-CC           0.006482   0.004680   1.385   0.1681    
## equipOS-PC           0.015640   0.005827   2.684   0.0081 ** 
## equipPAPER          -0.009092   0.016926  -0.537   0.5920    
## equipPUNCH           0.014150   0.006783   2.086   0.0387 *  
## cpergore:usageurban -0.008799   0.038716  -0.227   0.8205    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02335 on 150 degrees of freedom
## Multiple R-squared:  0.1696, Adjusted R-squared:  0.1253 
## F-statistic: 3.829 on 8 and 150 DF,  p-value: 0.0004001

The gtsummary package offers a more sensible diplay of regression results.

library(gtsummary)
## #BlackLivesMatter
lmodi %>%
  tbl_regression() %>%
  bold_labels() %>%
  bold_p(t = 0.05)
Characteristic Beta 95% CI1 p-value
cperAA 0.03 -0.03, 0.09 0.4
cpergore 0.01 -0.09, 0.11 0.9
usage
rural
urban -0.02 -0.03, -0.01 <0.001
equip
LEVER
OS-CC 0.01 0.00, 0.02 0.2
OS-PC 0.02 0.00, 0.03 0.008
PAPER -0.01 -0.04, 0.02 0.6
PUNCH 0.01 0.00, 0.03 0.039
cpergore * usage
cpergore * urban -0.01 -0.09, 0.07 0.8
1 CI = Confidence Interval
gavote %>%
  select(cperAA, cpergore, equip, usage) %>%
  head(10)
## # A tibble: 10 × 4
##     cperAA cpergore equip usage
##      <dbl>    <dbl> <fct> <fct>
##  1 -0.0610  -0.0652 LEVER rural
##  2 -0.0130  -0.0119 LEVER rural
##  3 -0.112   -0.0891 LEVER rural
##  4  0.233    0.180  OS-CC rural
##  5  0.116    0.0777 LEVER rural
##  6 -0.219   -0.139  LEVER rural
##  7 -0.164   -0.106  OS-CC urban
##  8 -0.164   -0.0787 OS-PC urban
##  9  0.0390   0.0710 OS-PC rural
## 10 -0.136   -0.0364 OS-CC rural
model.matrix(lmodi) %>% head(10)
##    (Intercept)      cperAA    cpergore usageurban equipOS-CC equipOS-PC
## 1            1 -0.06098113 -0.06515076          0          0          0
## 2            1 -0.01298113 -0.01189493          0          0          0
## 3            1 -0.11198113 -0.08912311          0          0          0
## 4            1  0.23301887  0.17956499          0          1          0
## 5            1  0.11601887  0.07765876          0          0          0
## 6            1 -0.21898113 -0.13918434          0          0          0
## 7            1 -0.16398113 -0.10614032          1          1          0
## 8            1 -0.16398113 -0.07873442          1          0          1
## 9            1  0.03901887  0.07097452          0          0          1
## 10           1 -0.13598113 -0.03643969          0          1          0
##    equipPAPER equipPUNCH cpergore:usageurban
## 1           0          0          0.00000000
## 2           0          0          0.00000000
## 3           0          0          0.00000000
## 4           0          0          0.00000000
## 5           0          0          0.00000000
## 6           0          0          0.00000000
## 7           0          0         -0.10614032
## 8           0          0         -0.07873442
## 9           0          0          0.00000000
## 10          0          0          0.00000000

Hypothesis testing

anova(lmod, lmodi)
## Analysis of Variance Table
## 
## Model 1: undercount ~ pergore + perAA
## Model 2: undercount ~ cperAA + cpergore * usage + equip
##   Res.Df      RSS Df Sum of Sq      F   Pr(>F)   
## 1    156 0.093249                                
## 2    150 0.081775  6  0.011474 3.5077 0.002823 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(lmodi, test = "F")
## Single term deletions
## 
## Model:
## undercount ~ cperAA + cpergore * usage + equip
##                Df Sum of Sq      RSS     AIC F value  Pr(>F)  
## <none>                      0.081775 -1186.1                  
## cperAA          1 0.0004505 0.082226 -1187.2  0.8264 0.36479  
## equip           4 0.0054438 0.087219 -1183.8  2.4964 0.04521 *
## cpergore:usage  1 0.0000282 0.081804 -1188.0  0.0517 0.82051  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We also see \(F\)-test for quantitative variables, e.g., cperAA, conincides with the \(t\)-test reported by the lm function. Question: why drop1 function does not drop predictors cpergore and usage?

Confidence intervals

confint(lmodi)
##                             2.5 %       97.5 %
## (Intercept)          0.0376884415  0.048906189
## cperAA              -0.0331710614  0.089699222
## cpergore            -0.0928429315  0.109316616
## usageurban          -0.0278208965 -0.009452268
## equipOS-CC          -0.0027646444  0.015729555
## equipOS-PC           0.0041252334  0.027153973
## equipPAPER          -0.0425368415  0.024352767
## equipPUNCH           0.0007477196  0.027551488
## cpergore:usageurban -0.0852990903  0.067700182

Diagnostics

plot(lmodi)

gavote %>% 
  mutate(cook = cooks.distance(lmodi)) %>%
  filter(cook >= 0.1) %>%
  print(width = Inf)
## # A tibble: 2 × 17
##   county   equip econ  perAA usage atlanta     gore  bush other votes ballots
##   <chr>    <fct> <fct> <dbl> <fct> <fct>      <int> <int> <int> <int>   <int>
## 1 BEN.HILL OS-PC poor  0.282 rural notAtlanta  2234  2381    46  4661    5741
## 2 RANDOLPH OS-PC poor  0.527 rural notAtlanta  1381  1174    14  2569    3021
##   undercount pergore perbush cpergore cperAA  cook
##        <dbl>   <dbl>   <dbl>    <dbl>  <dbl> <dbl>
## 1      0.188   0.479   0.511   0.0710 0.0390 0.234
## 2      0.150   0.538   0.457   0.129  0.284  0.138

Let’s plot a bubble plot using car package

influencePlot(lmodi)

##        StudRes        Hat      CookD
## 9    6.3305112 0.06215999 0.23413887
## 103 -0.5714242 0.51689352 0.03899306
## 120  3.8143062 0.08489464 0.13754389
## 131  0.5714242 0.51689352 0.03899306

The bubble plot combines the display of Studentized residuals, hat-values, and Cook’s distances, with the areas of the circles proportional to Cook’s \(D_i\).

Another way to generate a Q-Q plot using the car package. The default of the aaPlot() function in the car package plots Studentized residuals against the corresponding quantiles of \(t(n - p - 2)\) and generates a 95% pointwise confidence envelope for the Studentized residuals, using a parametric version of the bootstrap.

qqPlot(lmodi)
## Warning in rlm.default(x, y, weights, method = method, wt.method = wt.method, :
## 'rlm' failed to converge in 20 steps

## [1]   9 120
# this function is available from faraway package
halfnorm(hatvalues(lmodi), ylab = "Sorted leverages")

These two counties have unusually large leverages. They are actually the only counties that use paper ballot.

gavote %>%
  # mutate(hi = hatvalues(lmodi)) %>%
  # filter(hi > 0.4) %>%
  slice(c(103, 131)) %>%
  print(width = Inf)
## # A tibble: 2 × 16
##   county     equip econ  perAA usage atlanta     gore  bush other votes ballots
##   <chr>      <fct> <fct> <dbl> <fct> <fct>      <int> <int> <int> <int>   <int>
## 1 MONTGOMERY PAPER poor  0.243 rural notAtlanta  1013  1465    31  2509    2573
## 2 TALIAFERRO PAPER poor  0.596 rural notAtlanta   556   271     5   832     881
##   undercount pergore perbush cpergore    cperAA
##        <dbl>   <dbl>   <dbl>    <dbl>     <dbl>
## 1     0.0249   0.404   0.584 -0.00458 0.0000189
## 2     0.0556   0.668   0.326  0.260   0.353

Added-variable plots

# avPlots(lmodi)
# change layout
avPlots(lmodi, layout = c(4, 2))

Component-plus-residuals plot

We can generate component-plus-residual plots by crPlots() function

crPlots(lmod)  # from car package