Run your “first” R code

Open RStudio and you should see a window similar to the one below:

The big window on the left with “Console” is the R console that functions the same as the consoles you get from opening R (not RStudio). You can run R command in the console directly. Try with

2 + 3
## [1] 5

Now you see that you can use R as a calculator!

a <- 2 + 3
If you just run the above line the console, you will not see any pop-up. However, your calculation is stored in the element ‘a’. You can find it in the upper-right window of your RStudio under the Environment tab. Take a look:

And, of course you can output the value of a by

a
## [1] 5

Let’s plot some figures (the saying “a figure is worth a thousand words” should echo now). We will draw \(1000\) samples from a normal distribution with mean \(10\) and variance \(100\) (i.e., N(10, 100)).

set.seed(7260)
#Let's take a look at the PDF (probability density function) first
x <- 0:2000 * 0.05 - 50
y <- dnorm(x, mean = 10, sd = sqrt(100))
# X spans over [-50, 50] and y store the pdf value for each element in x.
# The following command generates a line plot
plot(x, y, type = "l")

# search the R Documentation for this "plot" function in the lower right window.
Now, let’s search the help documentation of the default “plot” function in R. Locate the “Help” tab in the lower-right window of your RStudio app.

# Let's draw some random variables from the distribution
normal.samples <- rnorm(length(x), 10, 10)  # avoid using sample as the variable name because it's already used for a built-in function.  The bad behavior of R (python shares this overriding problem too)
# Produce a histogram of the random samples
hist(normal.samples)

The above histogram has the default “Frecuency” as the Y axis. Let’s change it to “Density”

hist(normal.samples, probability = TRUE)

Now let’s place the line plot over the histogram.

hist(normal.samples, probability = TRUE)
lines(x, y, col="red", lwd=4)

#you may want to try the function "curve" too.  Go to its help page to find the usage.

Let’s re-plot the above figures using the ggplot2 package

library(ggplot2)
library(tibble)
# first group data into a data frame (an upgraded one, i.e., tibble)
sampled.data <- tibble(x = x, y = y, normal.samples = normal.samples)

ggplot(data = sampled.data) +
  geom_line(mapping = aes(x = x, y = y)) +
  theme_bw()

ggplot(data = sampled.data) +
  geom_histogram(mapping = aes(x = normal.samples)) +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = sampled.data) +
  geom_histogram(mapping = aes(x = normal.samples, y = ..density..)) +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = sampled.data) +
  geom_histogram(mapping = aes(x = normal.samples, y = ..density..)) +
  geom_line(mapping = aes(x = x, y = y), col = "red", size = 2) +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This concludes our lab 1. Please let me know if you run into any trouble.


replicate figure of Davis’s data

Davis’s data on the measured and reported weight in kilograms of 101 women who were engaged in regular exercise.

# read in data first
weights.data <- read.table("./Davis.txt")
weights.data[12, "height"] <- 166
weights.data[12, "weight"] <- 57
ggplot(data = weights.data, mapping = aes(x = reportedWeight, y = weight)) +
  geom_point(shape = 1) +
  geom_abline(slope = 1.0, intercept = 0.0, linetype = 2) + 
  geom_abline(slope = 0.97722, intercept = 1.7776) +
  coord_cartesian(xlim = c(38, 80), ylim = c(35, 80)) +
  theme_bw()
## Warning: Removed 17 rows containing missing values (geom_point).

ggsave("figure_5_1.pdf", width = 8, height = 6)
## Warning: Removed 17 rows containing missing values (geom_point).

How do we do linear regression and automatically add estimates of intercept and slope into the plot commands?

Do not worry too much about the code syntax right now, we will discuss them in the future.

# read about the documentation of lm() function
slr.mod <- lm(weight ~ reportedWeight, data = weights.data)
summary(slr.mod)
## 
## Call:
## lm(formula = weight ~ reportedWeight, data = weights.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5270 -1.0942 -0.1151  1.1133  6.3834 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.7480     0.8109   3.389 0.000862 ***
## reportedWeight   0.9582     0.0121  79.222  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.248 on 181 degrees of freedom
##   (17 observations deleted due to missingness)
## Multiple R-squared:  0.972,  Adjusted R-squared:  0.9718 
## F-statistic:  6276 on 1 and 181 DF,  p-value: < 2.2e-16
ggplot(data = weights.data, mapping = aes(x = reportedWeight, y = weight)) +
  geom_point(shape = 1) +
  geom_abline(slope = 1.0, intercept = 0.0, linetype = 2) + 
  geom_abline(slope = slr.mod$coefficients[2], intercept = slr.mod$coefficients[1]) +
  coord_cartesian(xlim = c(38, 80), ylim = c(35, 80)) +
  labs(title = "Include linear regression line from lm() function estimates") +
  theme_bw()
## Warning: Removed 17 rows containing missing values (geom_point).