Introduction

Previously, we have discussed simple linear regression and multiple linear regression using ordinary least squares. We introduced diagnostic plots, \(F\) and partial \(F\) tests, \(t\) tests (Wald Tests), as well as interval estimation. In this lesson, we will continue our discussion of linear regression and extend it to discussions about the functional form, matrix representation and contrasts, as well as tools and techniques that can help us to satisfy assumptions.

The Functional Form

The functional form of a model is the relationship between the outcome or dependent variable \(Y\) and the predictor or independent variables \(x_1, x_2, \dots, x_k\) or the right-hand side of the equation: \(Y=f(x_1, \dots, x_k)\), where \(f(\cdot)\) is a function. The functional form is important for satisfying assumption 1: linearity (correct model structure) and is a mathematical representation of how we believe the process that we observed in our experiment or observational study works. For example, if we have the functional form of \(\log(Y) = \beta_0 + \beta_1 x_1\), then it is \(\log(Y)\) that has a linear relationship with \(x_1\). If instead I have the functional form of \(Y = \beta_0 + \beta_1 \log(x_1)\), then it is \(Y\) that has a linear relationship with \(\log(x_1)\).

Let the general form for a model with a quantitative response \(Y\) and quantitative predictor \(x\) take the form:

\[Y = \mu(x) + \varepsilon,\]

where \(\mu\) is the mean function and \(\varepsilon\) is the error (the difference between the observed outcome \(Y\) and mean response \(\mu(x)\)). The mean function, \(\mu(x) = E(Y|x)\) or the mean of the distribution of the responses for the population with predictor \(x\).

Polynomial Regression Model

So far we have explored linear relationships between an outcome variable \(y\) and predictor variable \(x\). For example, when we are interested in modeling tumor thickness as a function of age, our model is \(thickness = \beta_0 + \beta_1 age + \varepsilon\). This model assumes that a linear relationship between age and thickness - for every one year increase in age, we expect \(\hat{\beta}_1\) increase in thickness. In other words, the increase in tumor thickness is expected to be the same for age 9 to 10 as it is for ages 90 to 91, which may not be scientifically appropriate. We can allow for more flexibility in the effect of age in several ways, including using polynomial regression.

We will first explore a quadratic regression model. We can assume that the mean function is a quadratic function of \(x\): \(\mu(x) = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2\), which gives the model:

\[Y = \beta_0 + \beta_1 x_1 + \beta_2 x_1^2 + \varepsilon\]

The quadratic regression model above is still a linear model. This is because the unknown parameters (\(\beta_0, \beta_1, \beta_2\)) enter the formula linearly.

Example 1

We will again work with the Melanoma data set from the R package MASS

library(MASS)
data("Melanoma")

Recall, the Melanoma data set contains data collected on 205 patients in Denmark on malignant melanoma. The variables are defined as below:

  • time: survival time in days (possibly censored)
  • status: 1=died from melanoma; 2=alive; 3=dead from other causes
  • sex: 1=male; 0=female
  • age: age in years
  • year: year of operation
  • thickness: tumor thickness in \(mm\)
  • ulcer: 1=presence; 0=absence

Suppose outcome variable of interest is thickness and our predictor of interest is age. We can visually explore the relationship between thickness and age. Using the ggplot2 and patchwork packages, we create the plots below. There are three plots: (1) raw data, (2) imposed linear relationship, (3) imposed quadratic relationship:

#load packages
library(ggplot2)
library(patchwork)

#raw data
p1 <- ggplot(data=Melanoma, aes(x=age, y=thickness)) + 
  geom_point() +
  theme_bw() +
  xlab("Thickness (mm)") +
  ylab("Age (years)") +
  ggtitle("Raw Data")

#imposed linear relationship
p2 <- ggplot(data=Melanoma, aes(x=age, y=thickness)) + 
  geom_point() +
  geom_smooth(method="lm", se = FALSE)+
  theme_bw() +
  xlab("Thickness (mm)") +
  ylab("Age (years)") +
  ggtitle("Linear Relationship")

#imposed quadratic relationship
p3 <- ggplot(data=Melanoma, aes(x=age, y=thickness)) + 
  geom_point() +
  geom_smooth(method="lm", se = FALSE,formula = "y~I(x^2)", )+
  theme_bw() +
  xlab("Thickness (mm)") +
  ylab("Age (years)") +
  ggtitle("Quadratic Relationship")


p1/p2/p3

Splines

Splines are piecewise polynomial functions that are used in fitting curves. They are polynomials within certain intervals of a predictor variable (for example, age), and are connected at points called knots. Splines are extremely flexible at modeling curves, with the one disadvantage being that they can be challenging to interpret.

Suppose that variable \(x\) is divided into intervals with knots at points \(a\), \(b\), and \(c\). We could use a linear spline function:

\[Y = \beta_0 + \beta_1 X + \beta_2 (X -a)_+ + \beta_3(X-b)_+ + \beta_4(X-c)_+,\]

where \((u)_+ = u, u>0\) and \(0, u \leq 0\).

In addition to linear spline functions, cubic spline functions are also commonly used to model a non-linear relationship between a predictor and outcome. Cubic polynomial functions are particularly good for fitting curves. A smooth cubic spline with knots again at \(a\), \(b\), and \(c\) are given by:

\[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 + \beta_4(X-a)^3_+ \beta_5(X_b)^3_+ + \beta_6(X-c)^3_+\] One issue with cubic splines is that they do not behave well at the tails (before the first knot and after the last knot). For this reason, restricted cubic splines are preferred, as they constructed such that they are linear in the tails instead.

As a reminder, even though we will be using splines these are all still linear models.

Example 1

We will first explore restricted1 linear splines to get better understand splines. We will use the lspline package and the lspline() function. To decide on the change-points or knots, typically quantiles are used (for example below: the 5%, 27.5%, 50%, 77.5%, and 95% quantiles) though the knots should also be driven by scientific interest.

library(lspline)
#fit the linear model with restricted linear splines for age
mlinspline <- lm(thickness ~ lspline(age, quantile(age, c(0, .05, .275, .5, .775, .95, 1),
                             include.lowest = TRUE)), data=Melanoma)
#extract fitted values from linear model
Melanoma$lspline_fitted <- fitted(mlinspline)

#plot
ggplot() +
  geom_point(data=Melanoma,aes(x=age, y=thickness)) +
  geom_line(data=Melanoma,aes(x=age, y=lspline_fitted), col="red", size=1)+
  theme_bw() +
  xlab("Thickness (mm)") +
  ylab("Age (years)") +
  ggtitle("Restricted Linear Splines")

From the plot above, we observe a linear piecewise function, with knots at the 5%, 27.5%, 50%, 77.5%, and 95% quantiles.

We can get even more flexibility by using restricted cubic splines. For restricted cubic splines, we will use the rcs() function from the rms package:

library(rms)

#fit the linear model with restricted cubic splines for age
mcubspline <- lm(thickness ~ rcs(age, quantile(age, c(0, .05, .275, .5, .775, .95, 1),
                             include.lowest = TRUE)), data=Melanoma)
#extract fitted values from linear model
Melanoma$cspline_fitted <- fitted(mcubspline)

#plot
ggplot() +
  geom_point(data=Melanoma,aes(x=age, y=thickness)) +
  geom_line(data=Melanoma,aes(x=age, y=cspline_fitted), col="red", size=1)+
  theme_bw() +
  xlab("Thickness (mm)") +
  ylab("Age (years)") +
  ggtitle("Restricted Cubic Splines")

Notice that with the restricted cubic splines, instead of piece-wise lines, we now have piece-wise cubic curves, giving the entire function a lot of flexibility as well as smoothness.

Partial \(F\) Test

Sometimes we are interested in simultaneously testing whether a set of coefficients are equal to zero or not. For example, in the model \(Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \varepsilon\), we may be interested in testing if \(\beta_2 = \beta_3 =0\). To do this, we use the partial F-Test.

The hypothesis in the example above would be:

\[H_0: \beta_2 = \beta_3 =0\] vs. 

\[H_A: \text{at least one of } \beta_2 \text{ and } \beta_3 \text{ are not equal to 0}\]

Example 1

We would like to know whether using a cubic regression provides us with a better fit than using a linear model with age. To perform the partial F-test, we will create a full model (that has all of the variables we are interested) and a reduced model (that has the variables being tested removed):

#cubic regression model (full model):
mfull <- lm(thickness ~ age + I(age^2) + I(age^3), data=Melanoma)
summary(mfull)
## 
## Call:
## lm(formula = thickness ~ age + I(age^2) + I(age^3), data = Melanoma)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2243 -1.6867 -0.9796  0.8813 14.1878 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  2.409e+00  2.378e+00   1.013    0.312
## age          1.041e-02  1.616e-01   0.064    0.949
## I(age^2)    -8.501e-04  3.420e-03  -0.249    0.804
## I(age^3)     1.361e-05  2.255e-05   0.603    0.547
## 
## Residual standard error: 2.878 on 201 degrees of freedom
## Multiple R-squared:  0.06792,    Adjusted R-squared:  0.05401 
## F-statistic: 4.882 on 3 and 201 DF,  p-value: 0.00268
#simple linear regression model (reduced model):
mreduced <- lm(thickness ~ age, data=Melanoma)
summary(mreduced)
## 
## Call:
## lm(formula = thickness ~ age, data = Melanoma)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6853 -1.7727 -0.9155  0.9558 14.0273 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  0.94105    0.67004   1.404  0.16170   
## age          0.03772    0.01217   3.098  0.00222 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.899 on 203 degrees of freedom
## Multiple R-squared:  0.04515,    Adjusted R-squared:  0.04044 
## F-statistic: 9.598 on 1 and 203 DF,  p-value: 0.002223

To perform the partial F-test, we will use the anova() function, which will first take the reduced model and then the full model:

anova(mreduced, mfull)
## Analysis of Variance Table
## 
## Model 1: thickness ~ age
## Model 2: thickness ~ age + I(age^2) + I(age^3)
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    203 1706.0                              
## 2    201 1665.3  2    40.683 2.4551 0.08842 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output above, we can see that Df is 2 as expected because we are testing if \(\beta_2\) from \(age^2\) and \(\beta_3\) from \(age^3\) are equal to 0 - two parameters. The F-statistic is 2.4551 and the associated p-value is 0.08842. At the 5% level, there is no evidence against \(H_0\), that \(\beta_2 = \beta_3 = 0\). We fail to reject the null and therefore prefer the more parsimonious model, mreduced.

Centering & Scaling

Centering

The intercept \(\beta_0\) is the mean outcome when all other predictors are zero. For example from our model above, \(thickness_i = \beta_0 + \beta_1 age_i\), \(\beta_0\) is the mean thickness when age is 0. The mean thickness is 0.94 \(mm\).

This, however, is not particularly useful because having a patient of age 0 does not make sense. In order to make the intercept more interpretable, we can center \(age\) by its mean so that the intercept will be the mean thickness at the mean age.

Example 1

We first center age and create agec, which will be centered by its mean. We next rerun the same model as mreduced:

Melanoma$agec <- Melanoma$age - mean(Melanoma$age)

mreduced2 <- lm(thickness ~ agec, data=Melanoma)
summary(mreduced2)
## 
## Call:
## lm(formula = thickness ~ agec, data = Melanoma)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6853 -1.7727 -0.9155  0.9558 14.0273 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.91985    0.20247  14.421  < 2e-16 ***
## agec         0.03772    0.01217   3.098  0.00222 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.899 on 203 degrees of freedom
## Multiple R-squared:  0.04515,    Adjusted R-squared:  0.04044 
## F-statistic: 9.598 on 1 and 203 DF,  p-value: 0.002223

The mean thickness at the mean age of 52.46 is 2.92. For every one year increase in age, we can expect a 2.92 \(mm\) increase in tumor thickness.

Scaling

In addition to centering, we can also rescale predictors so that one unit of change in \(x\) represents a meaningful difference. For example, if you have \(x\) as the number of kilometers, you can scale \(x\) by 1000 to instead have \(x\) in meters. If you scale a predictor by its standard deviation, then the units are in standard deviation units. This can be especially useful if the \(x\) values are very small or very large, as sometimes tat computations can become unstable.

Example 1

We will scale age by its standard deviation. For this, we will use the age-centered variable, agec. We again rerun the same model as mreduced:

Melanoma$agec_scaled <- Melanoma$agec/sd(Melanoma$age)

mreduced3 <- lm(thickness ~ agec_scaled, data=Melanoma)
summary(mreduced3)
## 
## Call:
## lm(formula = thickness ~ agec_scaled, data = Melanoma)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6853 -1.7727 -0.9155  0.9558 14.0273 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9199     0.2025  14.421  < 2e-16 ***
## agec_scaled   0.6288     0.2030   3.098  0.00222 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.899 on 203 degrees of freedom
## Multiple R-squared:  0.04515,    Adjusted R-squared:  0.04044 
## F-statistic: 9.598 on 1 and 203 DF,  p-value: 0.002223

The mean thickness at the mean age of 52.46 is the same as in the model before (mreduced2). It is 2.92. For every one standard deviation increase in age, we can expect a 2.92 \(mm\) increase in tumor thickness.

Interactions

Interaction, also known as effect modification, means that the effect of one predictor \(x_1\) on outcome \(y\) depends on the level of another predictor \(x_2\). In all of the models we have explored so far, we have no yet considered any interactions between predictors, meaning that the effect of each predictor has thus far been assumed to be constant over other variables in the model.

Interactions are symmetric in that the effect of \(x_1\) on \(y\) is modified by \(x_2\), and conversely, the effect of \(x_2\) on \(y\) is modified by \(x_1\). Interactions are also scale-dependent, meaning that you may have evidence on the original scale, but no interaction after log-transformation.

The figure below shows three types of interactions. Panel A) shows no interaction. The lines for the exposed and unexposed groups are parallel. The effect of age on the relative risk of disease X is constant and does not differ by exposure status. Panel B) shows an interaction. The lines between the exposed and unexposed groups are no longer parallel to each other. If they were to be extended to either direction of age, they would eventually cross. The effect of age on the relative risk differs based on exposure status. Lastly in panel C), we have the same data as in panel B), except that we have now taken the log of the response (the log of the relative risk). We can see that on the log-scale, there is little to no evidence of an interaction as the lines of the exposed and unexposed have no become parallel.

Three types of interactions.

Three types of interactions.

Example 1

When considering interactions, interaction plots are a useful tool. In this example, we will continue using the Melanoma data set and will explore an interaction between age and ulcer on thickness.

We create these plots in 3 ways: using the function interaction.plot() and also will show how to create the same plot using the ggplot2 package.

We want to visualize the interaction between ulcer (yes = 1, no =0) and patient age and how the relate to tumor thickness. In the interaction.plot() function, the first two arguments will be our interaction terms of interest, Melanoma$age and Melanoma$ulcer. The third argument will be the outcome variable of interest, Melanoma$thickness. The final argument specifies colors 1 and 2, which correspond to black and red:

interaction.plot(Melanoma$age, Melanoma$ulcer, Melanoma$thickness, col = 1:2)

We create the same plot using ggplot2. We specify x=age, y=thickness, color=factor(ulcer), and we group by ulcer status with group=ulcer. We also use the stat_summary() function which calculates summary statistics with ggplot. We want mean tumor thickness, which we specify with fun.y=mean. We want to plot both the points and connect the lines,

ggplot(data=Melanoma,  aes(x = age, y = thickness, color = factor(ulcer), group = ulcer)) +
  stat_summary(fun.y = mean, geom = "point") +
  stat_summary(fun.y = mean, geom = "line") +
  theme_bw() +
  labs(color="Ulcer Status")

From these two plots above, we can see that those patients with no ulcers (0) had smaller tumor thickness. Those with ulcers (1) had thicker tumors. The effect of age on tumor thickness differs by ulcer status.

We have visually explored and there is some evidence of an interaction between age and ulcer status. Scientifically, this makes sense as well, as ulcers tend to occur later in life. We next want to test the interaction. We will do this with the partial F-test and compare a reduced model (with only age + ulcer as main effects) and a full model (with both age and ulcer main effects and their interaction: age + ulcer + age:interaction)

mreduced2 <- lm(thickness ~ age + ulcer, data=Melanoma)
mfull2 <- lm(thickness ~ age*ulcer, data=Melanoma)
#the above is equivalent to: mfull2 <- lm(thickness ~ age + ulcer + age:ulcer, data=Melanoma)


#use partial F-test 
anova(mreduced2, mfull2)
## Analysis of Variance Table
## 
## Model 1: thickness ~ age + ulcer
## Model 2: thickness ~ age * ulcer
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    202 1418.9                           
## 2    201 1417.8  1    1.1521 0.1633 0.6865

From the partial F-test, we see that Df=1, which is what we expect because the only parameter that is different between the reduced and the full model is the interaction term. The p-value is 0.6865. There is no evidence against the null hypothesis that the beta coefficient on the interaction term between age and ulcer status is not equal to zero.

Satisfying Assumptions

In Linear Regression Part 1, we introduced the various assumptions that must hold. In the following section, these assumptions are further discussed as well as corrective techniques.

Correlated Errors

One of our assumptions is that the residuals follow \(N(0, \sigma^2)\) and are independent. Errors, however can be correlated which would violate this assumption. This can occur in many situations. Suppose you perform a study where you have 100 patients and have blood draws to check blood lead levels for each patient twice, once in the morning and once in the evening. If you do not appropriately consider the repeated measures, your errors will likely be correlated by each person. This is because each person has there own propensity for certain lead levels at baseline. For any given person, their evening measurement is likely highly correlated with their morning measurement. Similarly, consider several measurements over the course of a year - these serial measurements are likely correlated across time. Someone’s blood lead measurement in July likely depends on what their level was in June. Next consider, that several of your patients come from a particular census block with known lead exposure - these patients may now be correlated in space.

It is difficult to check for correlated errors in a model, which is why it is important to think critically about the data, how it was obtained, who was in the study population, who was excluded from the study, and what potential biases may impact your study and how. Correlation can be adjusted for either in the model structure or in the error. If you know that the temporal correlation is due to the 4 time points of your study, you can include time as a predictor in your model. Mixed effects models can also be useful in adjusting for correlated errors. You can also model the error structure using generalized least squares or multi-level models.

Normality

The assumption of normality is particularly important for inference with small samples or prediction of individual responses. In larger samples, however, the assumption of normality can be loosened as the Central Limit Theorem ensures that the distribution of the errors will be approximately normal.

When the assumption of normality is not satisfied, there are two ways to address the issue: 1) use a transformation of the response variable or 2) consider a different model with different distributional assumptions.

Non-Constant Variance

Non-constant variance is also called heteroskedasticity. If there is evidence of non-constant variance, then the estimates of \(\hat{\beta}\) will no longer be efficient and the estimate of the variance of \(\hat{\beta}\) will be both biased and inconsistent. The consequences of this are that the p-values associated with the coefficient estimates will likely be estimated to be smaller than they really are. This means we run the risk of identifying a variable as “significant” when really it is not.

There are a few options when dealing with non-constant variance. The first is to use a variance-stabilizing transformation of the response variable. Three commonly-used variance-stabilizing transformations are using a square root transformation for count data, using an arcsine transformation for proportions or percentages, and using log transformations for log-normal data. Another useful transformation is the Box-Cox transformation, which lets the data determine the transformation.

The second method is to calculate weights from the data and use those weights in weighted least squares.

The third option is to use the \(\hat{\beta}\) coefficients which will be unbiased, and then to correct the incorrect variances using the observed data.

Lastly, generalized linear models can be used to account for variance as a function of the mean.

Weighted Least Squares

The assumption of constant variance can formally be written as \(Var(Y|x)=\sigma^2\). However, this assumption can be relaxed as:

\[Var(Y|x_i) = Var(\varepsilon_i) = \frac{\sigma^2}{w_i},\] where \(w_1, w_2, \dots, w_n\) are known positive values, which we will call weights.

Recall that in ordinary least squares (OLS) the residual sum of squares is:

\[SS_{residual} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\]

In weighted least squares (WLS), the residual sum of squares becomes:

\[SS_{residual}^{WLS} = \sum_{i=1}^n w_i(\hat{y}_i - \bar{y})^2\]

Intuitively, we can see that larger values of \(w_i\) will up-weight the squared difference, \((\hat{y}_i - \bar{y})^2\) and small values of \(w_i\) will down-weight the squared difference. In other words, observations with smaller variances are considered “more important”, which intuitively makes sense because they have more information that makes their variances smaller.

The variance parameter, \(\hat{\sigma}^2\), is estimate by \(\hat{\sigma}^2 = \frac{SS_{residual}^{WLS}}{n-k},\)

where \(n\) is the number of observations and \(k\) is the number of \(\beta\) coefficients being estimated including the intercept.

Residuals in WLS are calculated as: \[r_i = \sqrt{w_i}(y_i - \hat{y_i})\].

Example 1

In this example, we will use continue using the model mreduced2 (mreduced2 <- lm(thickness ~ age + ulcer, data=Melanoma)). We will explore how this our inference and results when using weighted least squares instead of ordinary least squares.

We can formally test for heteroskedasticity using the Studentized Breusch-Pagan Test. To perform this test, we will need the bptest() function from the lmtest package:

library(lmtest)
bptest(mreduced2)
## 
##  studentized Breusch-Pagan test
## 
## data:  mreduced2
## BP = 6.8356, df = 2, p-value = 0.03279

The null hypothesis is homoskedasticity (constant variance). The alternative hypothesis is heteroskedasticity (non-constant variance). From the output above, we can see that the p-value is 0.03279. There is moderate evidence to reject the null in favor of the alternative.

When using WLS, determining the weights is an important step. If the responses \(Y_i\) are averages of \(n_i\) observations, then \(Var(Y_i) = Var(\varepsilon_i)= \sigma^2/n\). The weights to use would be to set \(w_i = n_i\). When observed responses are known to vary, then weights can be assigned such that \(w_i = 1/Var(Y_i)\).

For our example above, we will want to use weights that are inversely proportional to the variance and will estimate these empirical weights. To do so, we will want to use the relationship between the absolute residuals and predicted (ie: fitted) values and use this as a model for the standard deviation. Alternative methods are to instead use the squared residuals and their relationship to the absolute residuals and predicted (ie: fitted) values or to use grouped data and to estimate the variance.

First, we want to visualize this relationship. To do so, will extract the absolute residuals and the predicted (ie: fitted) values and convert this to a dataframe to make it easier to use with ggplot():

absresids <- abs(mreduced2$residuals)
fitvals <- fitted(mreduced2)
out <- cbind.data.frame(absresids=absresids, fitvals=fitvals)

ggplot(data=out, aes(x=fitvals, y=absresids)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, aes(color="Regression Model"))+
  geom_smooth(method="loess", se=FALSE, aes(color="LOESS"))+
  scale_color_manual(values = c("blue", "red"), labels=c("Regression Model", "LOESS"))+
  xlab("Predicted (Fitted) Values") +
  ylab("Abs(Residuals)") +
  labs(color="") +
  ggtitle("Fitted vs. Abs(Residuals) from mreduced2 Model")+
  theme_bw()

From the figure above, we can see that the regression model and LOESS (local polynomial regressing fitting, a smooth curve), both fit the data similarly. The absolute value of the residuals are linearly related to the fitted values. We can use this relationship to estimate \(Var(Y_i)\):

wt <- 1 / lm(abs(mreduced2$residuals) ~ mreduced2$fitted.values)$fitted.values^2 #recall, we want to square this term because we want the variance, not the standard deviation
head(wt)
##         1         2         3         4         5         6 
## 0.1537195 0.5910220 0.7968916 0.4557282 0.1954961 0.2569419

Lastly, we perform WLS by specifying the argument weights = wt in the lm() function:

#perform WLS
mwls <- lm(thickness ~ age + ulcer, data= Melanoma, weights = wt)
summary(mwls)
## 
## Call:
## lm(formula = thickness ~ age + ulcer, data = Melanoma, weights = wt)
## 
## Weighted Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7905 -0.8407 -0.4055  0.1760  8.6663 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.875388   0.432411   2.024   0.0442 *  
## age         0.018529   0.008869   2.089   0.0379 *  
## ulcer       2.406919   0.403197   5.970 1.05e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.516 on 202 degrees of freedom
## Multiple R-squared:  0.1805, Adjusted R-squared:  0.1724 
## F-statistic: 22.25 on 2 and 202 DF,  p-value: 1.856e-09

Compare these results to those obtained from OLS in mreduced2:

summary(mreduced2)
## 
## Call:
## lm(formula = thickness ~ age + ulcer, data = Melanoma)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4392 -1.3459 -0.5537  0.3555 12.7921 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.36010    0.61927   0.581   0.5616    
## age          0.02868    0.01122   2.556   0.0113 *  
## ulcer        2.40389    0.37600   6.393  1.1e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 202 degrees of freedom
## Multiple R-squared:  0.2058, Adjusted R-squared:  0.198 
## F-statistic: 26.18 on 2 and 202 DF,  p-value: 7.77e-11

We observe that the coefficient estimate of age has decreased slightly from 0.028675 to 0.0185294. The standard error has decreased from 0.0112198 to 0.0088688 and the p-value has also decreased from 0.0113308 to 0.0379348. This underscores the importance of the constant variance assumption on your final inference and your findings. The coefficient estimate, standard error, and p-value of ulcer has remained more or less the same.

Robust Standard Errors

Least squares regression is optimal when the errors are normally distributed. However if the errors are not, other methods such as robust regression may be considered. In particular, distributions with long tails may be particularly problematic as only a few extreme cases can have a big impact on the fit. Sometimes, these extreme cases really are outliers and can be excluded if there is scientific reasoning and enough evidence that those points are indeed outlying. In other times, when there are many points that may look like outliers but are not, you want to keep them in your data set. For these cases, robust standard errors are useful.

Example 1

We can calculate robust standard errors for a model that we already have, mreduced2. We use the coeftest() function from the sandwich library. This function will recalculate the summary table using a different variance-covariance matrix. The function vcovHC() is used to calculate robust standard errors and the argument type is used to specify the type of standard errors. Below we explore several types:

library(sandwich)
coeftest(mreduced2, vcov = vcovHC(mreduced2, type="HC1")) # default in Stata
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 0.360098   0.543536  0.6625   0.50840    
## age         0.028675   0.011651  2.4611   0.01469 *  
## ulcer       2.403887   0.396302  6.0658 6.368e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(mreduced2, vcov = vcovHC(mreduced2, type="HC3")) #default in R
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 0.360098   0.553659  0.6504   0.51617    
## age         0.028675   0.011845  2.4209   0.01637 *  
## ulcer       2.403887   0.399651  6.0150 8.315e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(mreduced2, vcov = vcovHC(mreduced2, type="HC"))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 0.360098   0.539545  0.6674   0.50527    
## age         0.028675   0.011566  2.4793   0.01398 *  
## ulcer       2.403887   0.393392  6.1107 5.026e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(mreduced2, vcov = vcovHC(mreduced2, type="HC2"))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 0.360098   0.546539  0.6589   0.51073    
## age         0.028675   0.011704  2.4500   0.01514 *  
## ulcer       2.403887   0.396504  6.0627 6.472e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(mreduced2, vcov = vcovHC(mreduced2, type="HC4"))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.360098   0.554248  0.6497  0.51662    
## age         0.028675   0.011835  2.4228  0.01628 *  
## ulcer       2.403887   0.397373  6.0494 6.94e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(mreduced2, vcov = vcovHC(mreduced2, type="HC0"))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 0.360098   0.539545  0.6674   0.50527    
## age         0.028675   0.011566  2.4793   0.01398 *  
## ulcer       2.403887   0.393392  6.1107 5.026e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To calculate 95% confidence errors, we use the coefci() function from lmtest package:

#getting confidence intervals
coefci(mreduced2, vcov. = vcovHC(mreduced2, type = 'HC1'))
##                   2.5 %    97.5 %
## (Intercept) -0.71163421 1.4318309
## age          0.00570127 0.0516488
## ulcer        1.62246756 3.1853072

Bootstrap

Bootstrapping is a computationally-intensive method that can be used to compute confidence intervals or to perform hypothesis tests when the assumptions of normally-distributed errors or asymptotic normality of the sampling distribution do not hold. Recall from introductory statistics that when you collect data, what you have is a sample (or sometimes many samples) from a population. We use this sample data to then compute summary statistics such as the mean, median, regression coefficients, etc. The distribution of any of these statistics is a sampling distribution. When the sample size is large, the sampling distribution of, for example, the mean is approximately normal. In some cases, however, we are limited to only one sample from the population. In such cases, the bootstrap can be used to draw many many (in theory, infinitely many) samples from the one sample and to estimate the population. From this, a statistic can be calculated. This distribution of the statistic from all of these bootstrap samples is called the bootstrap distribution.

Previously, when we wanted to know what the value of an unknown parameter was, we would use its estimate. For example, since the variance \(\sigma^2\) is unknown, we use its estimate \(\hat{\sigma}^2\). The bootstrap is similar in that when we want to know an unknown parameter, we can now instead use the entire “population” \(F\), that we created using the bootstrap. This “population” is also referred to as an empirical distribution function or EDF.

There are three types of bootstrap procedures:

  1. Non-parametric bootstrap:
  • uses re-sampling to of the data to estimate the empirical distribution function.
  1. Parametric bootstrap:
  • uses simulation
  1. Semi-parametric bootstrap:
  • adds noise to the simulation.

In this lesson, we will only discuss the non-parametric bootstrap.

Suppose that instead of being interested in the mean tumor thickness, we are instead interested in the median tumor thickness. In order to calculate a bootstrap confidence interval for the population median, consider the following steps:

  1. Get a random sample of observed values \(y_1*, y_2*, \dots, y^*_n\) from the “population” (\(\hat{G}\)). We do this by sampling with replacement the observed values \(y_1, y_2, \dots, y_n\). In our example above these are observed tumor thickness values. Because we are sampling with replacement, some \(y_i\) may be sampled multiple times and some will not be sampled at all.
  2. Compute and save the median calculation from the sampled \(y_1*, y_2*, \dots, y^*_n\) from above.
  3. Repeat steps 1-2 \(B\) times, where \(B\) is the number of bootstrap samples. As \(B\) increases, the precision increases.
  4. We can then calculate a \(100(1-\alpha)\) percentile bootstrap confidence interval, which is the interval between the sample \(\alpha/2\) and \(1-\alpha/2\) quantiles (most often this will be the 2.5% and 97.5% quantiles).
  • There is also an approach called the bias-corrected and accelerated method (BCa), which is considered to be more accurate2. It produces and interval based on different sample quantiles that depend on the data. The BCa method generally produces narrower intervals and is often the default in many software packages.

Example 1

First, we will compute the sample mean and the sample median using the mean() and median() functions in R:

mean(Melanoma$thickness)
## [1] 2.919854
median(Melanoma$thickness)
## [1] 1.94

Clearly, the mean and median tumor thickness are different from each other.

In order to perform the non-parametric bootstrap, we will load the boot R package using library(boot). Because the bootstrap uses random sampling with replacement, we want first set the seed so that the analysis is reproducible using set.seed(1234).

library(boot)
set.seed(1234)

To perform the bootstrap, we need to write a small function. For a nice tutorial on writing R functions, check out this helpful link.

The function myboot() takes on two arguments, dat and index; dat is the data that we will be used as an input; index represents the row indices that will be randomly sampled with replacement to create each bootstrap sample. Lastly, the median of tumor thickness is returned for each bootstrap sample.

myboot<- function(dat,index){
    dat2 <- dat[index,] #randomly sample with replacement
    return(median(dat2$thickness)) #return median of tumor thickness of bootstrap sample
}

We use the function boot() which takes three arguments: the data set name (Melanoma), the function myboot, and the number of bootstrap replicate (R=1000)

(bootmedian<- boot(Melanoma, myboot, R=2000))
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = Melanoma, statistic = myboot, R = 2000)
## 
## 
## Bootstrap Statistics :
##     original    bias    std. error
## t1*     1.94 -0.032075   0.1941525

The output of bootmedian tells us that an ordinary non-parametric bootstrap was performed and returns the original function call. From the bootstrap statistics, t1* is the original median (see above), which is 1.94. The bias is the difference between the median from the bootstrap samples and t1* \((bias = E(\hat{\theta}) - \theta)\)), which in this case is -0.02567. The standard error is the standard deviation of the bootstrap distribution of the median in our case3, which in this is case is 0.1964.

We can explore the structure of the bootmedian object using the str() function:

str(bootmedian)
## List of 11
##  $ t0       : num 1.94
##  $ t        : num [1:2000, 1] 1.76 1.78 1.94 1.94 1.94 1.78 1.78 1.94 1.94 1.94 ...
##  $ R        : num 2000
##  $ data     :'data.frame':   205 obs. of  11 variables:
##   ..$ time          : int [1:205] 10 30 35 99 185 204 210 232 232 279 ...
##   ..$ status        : int [1:205] 3 3 2 3 1 1 1 3 1 1 ...
##   ..$ sex           : int [1:205] 1 1 1 0 1 1 1 0 1 0 ...
##   ..$ age           : int [1:205] 76 56 41 71 52 28 77 60 49 68 ...
##   ..$ year          : int [1:205] 1972 1968 1977 1968 1965 1971 1972 1974 1968 1971 ...
##   ..$ thickness     : num [1:205] 6.76 0.65 1.34 2.9 12.08 ...
##   ..$ ulcer         : int [1:205] 1 0 0 0 1 1 1 1 1 1 ...
##   ..$ lspline_fitted: num [1:205] 3.72 2.47 2.8 3.65 2.37 ...
##   ..$ cspline_fitted: num [1:205] 4 2.45 2.84 3.73 2.39 ...
##   ..$ agec          : num [1:205] 23.537 3.537 -11.463 18.537 -0.463 ...
##   ..$ agec_scaled   : num [1:205] 1.4118 0.2121 -0.6876 1.1119 -0.0278 ...
##  $ seed     : int [1:626] 10403 624 -1394370482 -1723143049 2071488076 1659356893 -1081051142 885114163 -614367016 561456377 ...
##  $ statistic:function (dat, index)  
##   ..- attr(*, "srcref")= 'srcref' int [1:8] 2 10 5 1 10 1 2 5
##   .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x00000000212cfe20> 
##  $ sim      : chr "ordinary"
##  $ call     : language boot(data = Melanoma, statistic = myboot, R = 2000)
##  $ stype    : chr "i"
##  $ strata   : num [1:205] 1 1 1 1 1 1 1 1 1 1 ...
##  $ weights  : num [1:205] 0.00488 0.00488 0.00488 0.00488 0.00488 ...
##  - attr(*, "class")= chr "boot"
##  - attr(*, "boot_type")= chr "boot"

We see that bootmedian is a list of size 11. The most important to use are the objects t0 (the median from the original sample) and t, which are the median of each of the 1000 bootstrap samples.

bootmedian$t0
## [1] 1.94
head(bootmedian$t) #first 6 medians from each of the 1000 bootstrap samples
##      [,1]
## [1,] 1.76
## [2,] 1.78
## [3,] 1.94
## [4,] 1.94
## [5,] 1.94
## [6,] 1.78

To better visualize, we can apply the plot() function to the bootmedian object:

plot(bootmedian)

From the histogram on the left, we see that the median of most bootstrap samples was 1.94, with the the smallest median just below 1.4 and largest just above 2.6. The QQ plot on the right shows the median of the bootstrap samples plotted against normal quantiles.

Next, we want to calculate 95% confidence intervals for our median. We will use the function boot.ci() also from the boot package. We specify the argument boot.out = bootmedian and type=c("perc", "bca"), which will give use confidence intervals based on the percentile method (perc) as well as using the adjusted bootstrap percentile method (bca):

boot.ci(boot.out = bootmedian, type=c("perc", "bca"))
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 2000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = bootmedian, type = c("perc", "bca"))
## 
## Intervals : 
## Level     Percentile            BCa          
## 95%   ( 1.62,  2.26 )   ( 1.53,  2.10 )  
## Calculations and Intervals on Original Scale
## Some BCa intervals may be unstable

First, note that the 95% confidence interval using the percentile method is wider ((1.620, 2.260)) than the 95% confidence interval using the adjusted bootstrap percentile method ((1.527, 2.100)). The latter method, however, comes with a warning that some intervals may be unstable. This is because the accuracy of the adjustment requires a large number of bootstrap samples, which can be computationally intensive. Even though both methods are close to each other, the adjusted bootstrap percentile method is preferred.

Prediction

Regression models can also be used for prediction. If we have new observations, we can use the model developed to predict the outcome.

Example 1

Suppose we wish to use mreduced to predict the tumor thickness for three new individuals ages 47,68, and 72. For this, we will create a new data frame called new that contains the ages of the three new individuals we would like to predict. We will then use the predict() function to obtain the predictions:

new <- data.frame(age = c(47,68,72))
str(new)
## 'data.frame':    3 obs. of  1 variable:
##  $ age: num  47 68 72
predict(mreduced, newdata=new)
##        1        2        3 
## 2.713786 3.505859 3.656730

The predicted tumor thicknesses for individuals ages 47, 68, and 72 years old are 2.71, 3.51, 3.66.

We can also obtain 95% confidence intervals that reflect the uncertainty about the mean predictions. We do this by specifying the argument interval="confidence":

predict(mreduced, newdata=new, interval = "confidence")
##        fit      lwr      upr
## 1 2.713786 2.293577 3.133995
## 2 3.505859 2.959538 4.052180
## 3 3.656730 3.040852 4.272608

For example for the first row (associated with age 47), the 95% confidence interval associated with age 47 is (2.29, 3.13). This means that according to our model mreduced, an individual aged 47 years old on average, will have a tumor size between 2.29 and 3.13 \(mm\).

We can also obtain prediction intervals about each individual value by instead specifying the argument interval="prediction":

predict(mreduced, newdata=new, interval = "prediction")
##        fit       lwr      upr
## 1 2.713786 -3.017592 8.445164
## 2 3.505859 -2.236143 9.247860
## 3 3.656730 -2.092307 9.405766

For example for the first row (associated with age 47), the 95% prediction interval associated with age 47 is (-3.02, 8.45). This means that according to our model mreduced, 95% of individuals aged 47 have a tumor with thickness between -3.02 and 8.45. Obviously, negative tumor size does not make sense scientifically, and indicates there are additional variables we should consider.

References

Copyright (c) 2021 Maria Kamenetsky


  1. restricted means the change-points (also called knots) will be connected, giving you a continuous piece-wise function↩︎

  2. Efron, Bradley and Tibshirani, Robert, An Introduction to the Bootstrap (1993), Chapter 14, https://www.google.com/books/edition/An_Introduction_to_the_Bootstrap/MWC1DwAAQBAJ?hl=en&gbpv=0↩︎

  3. Note: here we are interested in the median, but you can bootstrap many other statistics such as correlations, \(\beta\) coefficient estimates, etc.↩︎