PoD in Public Health: Linear Regression Part I
We have previously seen methods for comparing the means of a normally-distributed outcome variable between two populations using \(t\) tests. We had a continuous outcome variable \(y\) and a categorical variable1 \(x\). For example, from the Melanoma
data set we looked at thickness
(\(y\)) and sex
(\(x\)). In this section, we will see that this was a special case of linear regression. In this section, we will see how to relate an outcome variable \(y\) (often called the dependent variable) to one or more predictor variables \(x,\dots, x_k\) (often called the independent variables), where the \(x\)’s can be continuous or categorical.
The simplest relationship between a mean function \(\mu(x)\) and continuous (or quantitative) predictor variable \(x\) is that of a simple linear regression.
\[\mu(x) = \beta_0 + \beta_1x_i\]
Here, the intercept (\(\beta_0\) )is the mean value of the outcome when the predictor is equal to zero (\(x=0\)), or \(\beta_0 = \mu(0)\). The slope (\(\beta_1\)) is the difference in mean outcomes between populations differing by one unit: \(\beta_1 = \mu(x+1) - \mu(x)\). Here, we assume a linear relationship between \(x\) and \(y\) - for every one unit increase in \(x\) we can expect a \(\beta_1\) increase in \(\mu(x)\). If this relationship is in fact linear, then the simple regression model will be efficient because it allows us to pool information from all values of \(x\).
We model the relationship above using a simple linear regression model, which takes the form:
\[y_i = \beta_0 + \beta_1x_i + \varepsilon_i\]
where \(\varepsilon_i\) is the error for each individual observation \(i\).
In the above figures, the line of best fit (or the regression line) is that which minimizes the vertical distance between the each of the points and the lines.
Continuing from the figure above, let the vertical distance from each point \((x_i, y_i)\) to the red line of best fit be denoted as \(d_i\). Let \((x_i, \hat{y}_i)= (x_i, \beta_0 + \beta_1x_i)\) be the point on the estimate regression line or line of best fit. The distance between the point and the line is then \(d_i = y_i - \hat{y}_i = y_i - (\beta_0 + \beta_1x_i)\), and we want to minimize \(d_i\). Let the sum of the absolute distances (or deviations) of the sample points from the line be \(S = \sum_{i=1}^n |d_i|\). In order to avoid negative numbers and for some theoretical reasons, we would actually prefer to consider squaring the deviations, such that we finally obtain:
\[S = \sum_{i=1}^n d_i^2 = (y_i - \beta_0 - \beta_1 x_i)^2\]
This is known as the method of least squares.
Let the sample covariance between predictor \(x\) and outcome \(y\) be \(s_{xy} = \frac{1}{n-1}\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})\). Let the sample variance of predictor \(x\) be \(s_x^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\). Using this information, we can obtain the least squares estimate of \(\beta_1\) (the slope):
\[\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{s_{xy}}{s_x^2}\]
It is important to note that there are no assumptions about the distributional form of the predictor, \(x\). The \(\hat{\beta}_1\) is an unbiased estimator of \(\beta\). If certain assumptions hold (namely independence and constant variance, see below), then this estimator will be the most efficient as well.
The least squares estimate of the intercept \(\beta_0\) is \[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]
The least squares regression line must pass through the point \((\bar{x}, \bar{y})\).
The estimated error variance is given by:
\[\hat{\sigma}^2 = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n-2} = \frac{\sum_{i=1}^n \big\{y_i - \big(\hat{\beta}_0 + \hat{\beta}_1x_i \big) \big\}^2}{n-2}\]
Let us go back to the Melanoma
data set from the R
package MASS
library(MASS)
data("Melanoma")
Recall, the Melanoma
data set contains data collected on 205 patients in Denmark on malignant melanoma. The variables are defined as below:
time
: survival time in days (possibly censored)status
: 1=died from melanoma; 2=alive; 3=dead from other causessex
: 1=male; 0=femaleage
: age in yearsyear
: year of operationthickness
: tumor thickness in \(mm\)ulcer
: 1=presence; 0=absenceSuppose outcome variable of interest is thickness
and our predictor of interest is age
. To fit a linear regression model, we will use the lm()
function in R
and use summary()
to obtain the model summary output:
model1 <- lm(thickness ~ age, data=Melanoma)
summary(model1)
##
## Call:
## lm(formula = thickness ~ age, data = Melanoma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6853 -1.7727 -0.9155 0.9558 14.0273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.94105 0.67004 1.404 0.16170
## age 0.03772 0.01217 3.098 0.00222 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.899 on 203 degrees of freedom
## Multiple R-squared: 0.04515, Adjusted R-squared: 0.04044
## F-statistic: 9.598 on 1 and 203 DF, p-value: 0.002223
The estimate for \(\beta_0\) is 0.9410511 and the estimate for \(\beta_1\) is coef(model1)[2]
. The general interpretation of a regression coefficient is: for every one unit increase in x, we can expect a ____ increase in y . From our model above, this is interpreted as: for every one year increase in age, we can expect a 0.038 \(mm\) increase in tumor thickness.
We can also obtain the coefficient estimates with the coef()
function:
coef(model1)
## (Intercept) age
## 0.94105107 0.03771776
The standard error of the estimated regression slope (\(SE(\hat{\beta}_1)\)) is:
\[SE(\hat{\beta}_1) = \sqrt{\frac{\sigma^2}{\sum_{i=1}^n (x_i - \bar{x})^2}} = \sqrt{\frac{\sigma^2}{s_x^2(n-1)}}\]
From the equation above, we can observe that the standard error of the slope depends on the variance around the regression line (\(\sigma^2\)), the variance of predictor \(x\) (\(s_x^2\)), and the sample size (\(n\)).
The standard error of the estimated regression intercept (\(SE(\hat{\beta}_0)\)) is:
\[SE(\hat{\beta}_0) = \sigma \sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^n (x_i - \bar{x})^2}}\]
When estimating the standard errors, \(\sigma^2\) is replaced with its estimator, \(\hat{\sigma}^2\).
We can extract the standard error estimates from model1
using both the coef()
function and the summary()
function, and selecting the column called Std. Error
:
coef(summary(model1))[, "Std. Error"]
## (Intercept) age
## 0.67003580 0.01217442
For more about indexing data frames in R
, check out this helpful link.
There are 4 key assumptions that must hold:
As stated above, if the independence and constant variance assumptions hold, the estimator \(\hat{\beta}_1\) will be the most efficient estimator of \(\beta_1\). If these assumptions do not hold, \(\hat{\beta}_1\) will still be an unbiased estimator of \(\beta_1\), but it is possible that there are better estimators that may address the violation of the assumptions. If the errors are also normally distributed, then \(\hat{\beta}_1\) is also the maximum likelihood estimator of \(\beta_1\).
In order to build understanding of what to look for, we will first look at an example using simulated data set. There are 4 main diagnostic plots we want to focus on: 1) residuals vs. fitted plot, 2) standardized residuals vs. fitted plot, the scale-location plot , and 4) QQ Plot. The first three plots will help us diagnose assumptions (1) and (3), and the QQ-Plot will help us diagnose assumption (4). The diagnose assumption (2), we first start with learning about the study design. For example, was each patient tested multiple times? If a patient was tested multiple times, then the observations will not all be independent. Were patients identified from multiple clinics? If so, then there is some dependence among patients from Clinic A that is makes them different from the patients in Clinic B. Were patients are residents of a common geographic area? In this case there may be spatial dependence. Were blood samples taken on each patient sequentially every \(x\) days? In this case there may be temporal dependence. For each of these scenarios, there are ways to diagnose and address the dependence, whether it be in the model structure or in the error structure.
set.seed(123)
x <- rnorm(n= 1000, mean=400, sd =25)
y <- rnorm(n=1000, mean=25, sd=5)
simdat <- cbind.data.frame(x,y)
For this example, we will use simdat
, which is created above. First, let’s explore the relationship between \(x\) and \(y\) using summary statistics and visualization using the ggplot2
package:
library(ggplot2)
summary(simdat)
## x y
## Min. :329.8 Min. : 9.761
## 1st Qu.:384.3 1st Qu.:21.734
## Median :400.2 Median :25.274
## Mean :400.4 Mean :25.212
## 3rd Qu.:416.6 3rd Qu.:28.767
## Max. :481.0 Max. :41.952
ggplot(data=simdat, aes(x=x,y=y)) +
geom_point() +
geom_smooth(method="lm") +
theme_bw()
From our exploratory data analysis above, we can see that there appears to be a weak positive relationship between \(x\) and \(y\).
Next, we fit our simple linear regression model:
model0 <- lm(y ~ x, data=simdat)
summary(model0)
##
## Call:
## lm(formula = y ~ x, data = simdat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.1394 -3.4572 0.0213 3.5436 16.4557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.161443 2.576116 7.050 3.34e-12 ***
## x 0.017609 0.006422 2.742 0.00621 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.032 on 998 degrees of freedom
## Multiple R-squared: 0.007479, Adjusted R-squared: 0.006484
## F-statistic: 7.52 on 1 and 998 DF, p-value: 0.006211
Next, we can obtain the residuals vs. fitted and the QQ-Plot diagnostic plots using the plot()
function and specifying which=c(1,3)
2. To get the standardized residuals vs. fitted plot, we will use the stdres()
function, also from the MASS
package and will create the plot manually using the base R
plot()
function:
stdresids <- stdres(model0)
plot(x=fitted(model0),
y=stdresids,
main="Standardized Residuals vs. Fitted",
ylim=c(-3.5,3.5),
xlab="Fitted Values",
ylab="Standardized Residuals")
abline(h=c(-3,0,3), col=c("red","black","red"), lty=2)
plot(model0, which=c(1:3))
From the first plot (standardized residuals vs. fitted), we are looking for a nice random pattern around the 0 line on the y-axis. One of our assumptions is that the residuals follow \(N(0, \sigma^2)\). The residuals should not have any information in them and should be random noise. Hence, we expect these residuals to be randomly distributed along the zero line. We draw red horizontal lines at +/- 3. We do this because one rule of thumb that may indicate a potential outlier is if a standardized residual falls above 3 or below -3 standard deviation units from 0. From our plot, there may be a couple of points worth further investigation, but nothing notable.
From the second plot (residuals vs. fitted), we have almost an identical plot, but we are looking at un-standardized residuals. From this plot (and the previous one), we want to look for any patterns that may indicate the model structure is not correct (assumption(1)) or evidence of non-constant variance (assumption (3)). This may look like a funnel shape or a cone shape.
From the third plot (QQ-Plot), we check the normality assumption of the residuals. We expect that the residuals line up along the 45-degree horizontal line. Curvature in the QQ-plot may indicate skewness, heavy/light tails. Check out this interactive link that demonstrates how QQ-Plots connect to the data.
From the fourth plot (scale-location plot), we can determine whether or not we have non-constant variance. This plot should also not exhibit any pattern.
Overall, this model seems to satisfy all of our assumptions.
Next, we explore the diagnostic plots for model
(lm(thickness ~ age, data=Melanoma)
).
stdresids <- stdres(model1)
plot(x=fitted(model1),
y=stdresids,
main="Standardized Residuals vs. Fitted",
ylim=c(-3.5,3.5),
xlab="Fitted Values",
ylab="Standardized Residuals")
abline(h=c(-3,0,3), col=c("red","black","red"), lty=2)
plot(model1, which=c(1:3))
From the first plot (standardized residuals vs. fitted), we already see that the residuals are not evenly distributed along the 0 line. There are more observations below the zero line than above it and there are several points above our cut off of +3.
From the second plot (residuals vs. fitted), we see that there may be evidence of a funnel shape. As the fitted values are increasing, the y-values (residuals) are spreading out further.
From the third plot (QQ-Plot), we see that the residuals do not nicely line up along the dashed 45 degree line. There is a curve that indicates right skew.
From the fourth plot (scale-location plot), while not too bad, we do see that as the fitted values increase, the y-values (square root of the residuals) is also increasing slightly, mostly between 2.5 and 3.5 on the x-axis.
For any point \((x_i, y_i)\), we can define the residual to be \(r_i = y_i - \hat{y_i}\). We can consider all of the residuals by squaring them about the mean (\(y_i - \bar{y}\)) and summing them up over all of the points. We can further decompose this sum of squares into regression and residual components.
The total sum of squares (\(SS_{total}\)) is the sum of squares of the deviations of the individual sample points from the sample mean:
\[SS_{total} = \sum_{i=1}^n (y_i - \bar{y})^2\]
The regression sum of squares (\(SS_{regression}\)) is the sum of squares of the regression components:
\[SS_{regression} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\]
The residual sum of squares (\(SS_{residual}\)) is the sum of squares of the residual components:
\[SS_{residual} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\]
It can be shown that \(SS_{total}\) can be decomposed into \(SS_{regression}\) and \(SS_{residual}\), or
\[\sum_{i=1}^n (y_i - \bar{y})^2 = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 + \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\]
or
\[SS_{Total} = SS_{regression} + SS_{residual}\].
From these relationships above, we can define the regression mean square (or \(MS_{regression}\)), where
\[MS_{regression} = SS_{regression}/k,\]
where \(k\) is the number of degrees of freedom or the number of predictor variables in the model (not including the intercept). For our example above with only a single predictor \(x_i\), \(k=1\).
We also have the residual mean square (or \(MS_{residual}\)), where
\[MS_{residual} = SS_{residual}/(n-k-1)\]
In our simple linear regression, we can test the hypothesis:
\[H_0: \beta=0\]
\[H_A: \beta \neq 0\]
We can compute the test statistic:
\[F = MS_{regression}/MS_{residual}\]
which will follow an \(F\) distribution with \(1\) and \(n-2\) degrees of freedom (\(F _{1, n-2}\)).
If \(F\) > \(F_{1,n-2, 1-\alpha}\) then we reject \(H_0\). If \(F \leq F_{1,n-2, 1-\alpha}\), then we fail to reject \(H_0\).
From model1
, we call again the summary()
:
summary(model1)
##
## Call:
## lm(formula = thickness ~ age, data = Melanoma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6853 -1.7727 -0.9155 0.9558 14.0273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.94105 0.67004 1.404 0.16170
## age 0.03772 0.01217 3.098 0.00222 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.899 on 203 degrees of freedom
## Multiple R-squared: 0.04515, Adjusted R-squared: 0.04044
## F-statistic: 9.598 on 1 and 203 DF, p-value: 0.002223
From the summary, we are given that the \(F\)-statistic is 9.598 on 1 and 203 degrees of freedom, which comes from the number of rows (205) - 2. The p-value is 0.002223. We reject the null hypothesis that \(\beta = 0\). There is strong evidence against the null. Notice: we have one predictor \(age\). The p-value associated with age
(0.00222) is the same as the p-value from the \(F\)-test. This will not be true once we have multiple predictors.
We can also test the hypothesis: \[H_0: \beta=0\]
\[H_A: \beta \neq 0\]
using a method based on the \(t\)-test. The test statistic is:
\[t = \hat{\beta}/SE(\hat{\beta})\]
For a two-sided test with significance level \(\alpha\), if \(t > t_{n-2, 1-\alpha/2}\) or \(t < t_{n-2, \alpha/2}\), then we reject \(H_0\). If \(t_{n-2, \alpha/2} \leq t \leq t_{n-2, 1-\alpha/2}\), then we fail to reject \(H_0\).
We continue with model1
and start by exploring the summary()
of the model.
summary(model1)
##
## Call:
## lm(formula = thickness ~ age, data = Melanoma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6853 -1.7727 -0.9155 0.9558 14.0273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.94105 0.67004 1.404 0.16170
## age 0.03772 0.01217 3.098 0.00222 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.899 on 203 degrees of freedom
## Multiple R-squared: 0.04515, Adjusted R-squared: 0.04044
## F-statistic: 9.598 on 1 and 203 DF, p-value: 0.002223
From the summary output above, the estimate of \(\beta_0\) is 0.9410511 and the estimate of \(\beta_1\) is 0.0377178. The estimated standard error of \(\beta_1\) is 0.0121744. We can confirm the reported t-value (3.0981151) and p-value (0.0022234) manually:
(tval <- coef(summary(model1))[, "Estimate"][2]/coef(summary(model1))[, "Std. Error"][2])
## age
## 3.098115
(degfreedom <- nrow(Melanoma)-2)
## [1] 203
(pval <- 2*(1-pt(tval, df=degfreedom)))
## age
## 0.002223382
The p-value we have manually calculated here is identical to the p-value reported in the model summary.
The coefficient of determination (\(R^2\)) is defined as:
\[R^2 = \frac{SS_{regression}}{SS_{total}} = 1 - \frac{SS_{residual}}{SS_{total}}\]
\(R^2\) can be interpreted as the proportion of total variation in \(y\) that is explained by the predictor \(x\). If \(R^2=1\), then all of the variation in \(y\) can be explained be the variation in \(x\) - all of the data points will fall along the regression line. If \(R^2 = 0\), then \(x\) gives no information on \(y\).
However, \(SS_{regression}\) depends on not only the relationship between \(x\) and \(y\), but also on the variation in predictor \(x\). For a simple linear regression model, \(SS_{regression} = \hat{\beta_1}^2(n-1)s^2_x\). Therefore if you have a lot of variation in \(x\), you can have a large \(R^2\) value that really reflects the variation in \(x\) more so than the proportion of total variation in \(y\) explained by \(x\).
Most statistical software will also report the adjusted \(R^2\):
\[R^2 = 1- \frac{MS_{residual}}{MS_{total}} = 1- \frac{ SS_{residual}/(n-k-1)}{SS_{total}/(n-1)}\]
The adjusted \(R^2\) replaces the sum of squares with the mean sum of squares.
For simple linear regression, the adjusted \(R^2\) is \(R^2_{adjusted} = 1 - \frac{SS_{residual}/(n-2)}{SS_{total}/(n-1)}\).
If you are to report any \(R^2\) values, you should use the adjusted \(R^2\).
We continue with model1
and start by exploring the summary()
of the model.
summary(model1)
##
## Call:
## lm(formula = thickness ~ age, data = Melanoma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6853 -1.7727 -0.9155 0.9558 14.0273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.94105 0.67004 1.404 0.16170
## age 0.03772 0.01217 3.098 0.00222 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.899 on 203 degrees of freedom
## Multiple R-squared: 0.04515, Adjusted R-squared: 0.04044
## F-statistic: 9.598 on 1 and 203 DF, p-value: 0.002223
From the model summary output, we find that the \(R^2\) value is 0.04515 and the adjusted \(R^2\) is 0.04044.
When both \(x\) and \(y\) are expressed in standard deviation units (standardized), then the regression coefficient coefficient (\(\beta\)) becomes unitless and is a correlation coefficient. The correlation coefficient is a relative measure of association that depends on the population being studied. Let the population coefficient \(\rho\) be defined as:
\[\rho = \beta_1 \frac{\sigma_x}{\sigma_y},\]
where \(\sigma_y\) is the marginal standard deviation of \(y\) and \(\sigma_x\) is the marginal standard deviation of \(x\). We can see here that if the standard deviation of \(x\) (\(\sigma_x\)) increases, so will the correlation coefficient \(\rho\). Therefore, correlation not only reflects the strength of the relationship between \(x\) and \(y\), but also reflects the variability of \(x\).
An estimator of \(\rho\) is the Pearson correlation coefficient:
\[r = \hat{\beta}_1 \frac{s_x}{s_y} = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2} \sqrt{\sum_{i=1}^n (y_i - \bar{y})^2}} = \frac{s_{xy}}{s_x s_y}\]
In fact, it can be shown that \(r^2 = R^2\). The correlation coefficient quantifies the linear portion of the relationship between two variables. In fact, you can have a very high correlation coefficient, but a poor measure of the relationship between two variables if the relationship is in fact not linear. See Anscombe’s quartet.
Because the correlation coefficient is unitless, it can be useful in population-based studies. The table below based on Cohen 1988:
Correlation Coefficient | Interpretation |
---|---|
|\(\rho\)| < 0.1 | trivial |
0.1 <|\(\rho\)| < 0.3 | small |
0.3 <|\(\rho\)| < 0.5 | medium |
0.5 < |\(\rho\)| < 0.7 | large |
0.7 < |\(\rho\)| < 0.9 | very large |
|\(\rho\)|>0.9 | nearly perfect |
It is important to remember, however, that the interpretation of \(\rho\) depends on context. For example in the social sciences, a correlation of 0.2 may be considered large depending on the study, whereas in a physics experiment a correlation of 0.5 may be considered small.
We can calculate the Pearson correlation coefficient between thickness
and age
using the cor()
function:
cor(Melanoma$thickness, Melanoma$age)
## [1] 0.2124798
The Pearson correlation coefficient between \(x\) and \(y\) is 0.21. Based on Cohen’s table (above), we would interpret this as a small correlation.
We can calculate a \(1-\alpha\)% confidence interval for \(\hat{\beta}\) by:
\[\hat{\beta_1} \pm t_{n-2, 1-\alpha/2}\hat{SE}(\hat{\beta}_1)\] For a 95% confidence interval with a sample size of 100, \(t_{100-2, 1-\alpha/2}\) is equal to 1.98:
qt(0.975, 98)
## [1] 1.984467
For large samples, we can use the standard normal distribution \(z_{1-\alpha/2}\) instead of the \(t\) distribution:
\[\hat{\beta_1} \pm z_{1-\alpha/2}\hat{SE}(\hat{\beta}_1)\]
For a 95% confidence interval \(z_{1-\alpha/2}\) is equal to 1.96.
qnorm(0.975)
## [1] 1.959964
We continue with model1
. Confidence intervals can be calculated using the confint()
function:
confint(model1)
## 2.5 % 97.5 %
## (Intercept) -0.38007114 2.2621733
## age 0.01371322 0.0617223
However, we can also verify this bounds manually. Let’s focus on using the \(t\) distribution to calculate 95% confidence bounds for \(\hat{\beta_1}\), the estimate for age
.
summary(model1)
##
## Call:
## lm(formula = thickness ~ age, data = Melanoma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6853 -1.7727 -0.9155 0.9558 14.0273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.94105 0.67004 1.404 0.16170
## age 0.03772 0.01217 3.098 0.00222 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.899 on 203 degrees of freedom
## Multiple R-squared: 0.04515, Adjusted R-squared: 0.04044
## F-statistic: 9.598 on 1 and 203 DF, p-value: 0.002223
(degfreedom <- nrow(Melanoma)-2)
## [1] 203
(est_age <- coef(model1)[2])
## age
## 0.03771776
(se_age <- coef(summary(model1))[, "Std. Error"][2])
## age
## 0.01217442
(t_quantile <- qt(0.975, degfreedom))
## [1] 1.971719
(lb <- est_age - t_quantile*se_age)
## age
## 0.01371322
(ub <- est_age + t_quantile*se_age)
## age
## 0.0617223
cbind(lb, est_age, ub)
## lb est_age ub
## age 0.01371322 0.03771776 0.0617223
We obtain the same lower an upper confidence bounds as the output from confint()
!
When we have more than one predictor variable in a model, we have multiple regression analysis. The general model form is:
\[y = \beta_0 + \sum_{j=1}^k \beta_j x_j + \varepsilon\]
where \(\varepsilon \sim N(0, \sigma^2)\). Here, \(\beta_j\) represents the average increase in \(y\) per unit increase in \(x_j\) with all other variables held equal (held constant).
In model1
, we had previously, had thickness
as the outcome and age
as the predictor variable. Suppose now we also want to consider sex
First, we want to explore the relationship between thickness
and sex
both using descriptive statistics and visualization (using the ggplot2
package). Our first step is to check the data type of sex
. Because it is categorical, we will convert it to a factor. For more about working with factors in R
, check out this helpful link. We will also recode it to be male
and female
so it is less confusing. We will do this using the recode()
function from the dplyr
package.
#load package
library(dplyr)
#check data type
str(Melanoma$sex)
## int [1:205] 1 1 1 0 1 1 1 0 1 0 ...
#convert to a factor because it's categorical
Melanoma <- Melanoma %>%
mutate(sex = as.factor(recode(sex, `0` = "male", `1`="female")))
str(Melanoma$sex)
## Factor w/ 2 levels "female","male": 1 1 1 2 1 1 1 2 1 2 ...
#descriptive stats
table(Melanoma$sex)
##
## female male
## 79 126
tapply(Melanoma$thickness, Melanoma$sex, summary)
## $female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.160 1.050 2.580 3.611 4.840 14.660
##
## $male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.970 1.620 2.486 3.060 17.420
tapply(Melanoma$age, Melanoma$sex, summary)
## $female
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.0 43.5 55.0 53.9 66.5 95.0
##
## $male
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.00 42.00 54.00 51.56 64.75 89.00
#viz
ggplot(data=Melanoma, aes(x=sex, y=thickness)) +
geom_boxplot(aes(fill=sex))+
geom_jitter()+
theme_bw() +
ggtitle("Boxplots of Tumor Thickness by Sex") +
scale_fill_manual(values=c("gold", "purple"))
Next, we create model2
which will now include sex
in addition to age
as a predictor. First, we can write out the model in its mathematical form:
\[y_i = \beta_0 + \beta_1 age_i + \beta_2 Sex_{female,i} + \varepsilon_i\]
#run model
model2 <- lm(thickness ~ age + sex, data=Melanoma)
#explore diagnostic plots
stdresids <- stdres(model2)
plot(x=fitted(model2),
y=stdresids,
main="Standardized Residuals vs. Fitted",
ylim=c(-3.5,3.5),
xlab="Fitted Values",
ylab="Standardized Residuals")
abline(h=c(-3,0,3), col=c("red","black","red"), lty=2)
plot(model2, which=c(1:3))
From the first plot (standardized residuals vs. fitted), we identify several residuals above the +3 line, indicating potential outliers. We also continue to observe a mass of residuals below the 0 line as compared to above the 0 line. There is also still some indication of a funnel shape.
From the second plot (residuals vs. fitted), similar to the previous plot, we see that the dispersion of residuals above the 0 line is greater than that below the zero line
From the third plot (QQ-Plot), we still find curvature in the residuals, particularly in the upper tail. Again, indication of right skew.
From the fourth plot (scale-location plot), we see that the smooth red line is increasing as the fitted values are increasing. Again, indication of non-constant variance.
Overall, this model does not seem to satisfy all of our assumptions.
In multiple linear regression, the \(F\) test or sometimes called the global F-test. The hypothesis is:
\(H_0: \beta_1 = \beta_2 = \dots = \beta_k = 0\)
\(H_A: \text{At least one of } \beta_j \neq 0\)
The \(F\)-statistic is given by:
\[F = MS_{regression}/MS_{residual}\]
which follows \(F_{k, n-k-1}\) under \(H_0\).
Similar to before, if \(F> F_{k, n-k-1}\), then we reject \(H_0\). If \(F \leq F_{k, n-k-1}\) then we fail to reject \(H_0\).
We look at the model summary for model2
:
summary(model2)
##
## Call:
## lm(formula = thickness ~ age + sex, data = Melanoma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0249 -1.7633 -0.7117 0.7900 14.4548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.69046 0.72457 2.333 0.02063 *
## age 0.03563 0.01204 2.959 0.00346 **
## sexmale -1.04149 0.41156 -2.531 0.01215 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.861 on 202 degrees of freedom
## Multiple R-squared: 0.07449, Adjusted R-squared: 0.06533
## F-statistic: 8.129 on 2 and 202 DF, p-value: 0.0004023
The global \(F\)-statistic is 8.129 on 2 and 202 degrees of freedom. The p-value is 0.0004023. There is very strong evidence against the null hypothesis. At least one of \(\beta_1\) and \(\beta_2\) is not equal to zero.
In multiple linear regression, the \(t\) test can still be performed as well. The hypothesis is:
\[H_0: \beta_\ell = 0, \text{ all other } \beta_j \neq0\] \[H_A: \beta_\ell \neq0, \text{ all other } \beta_j \neq 0\]
Recall that in multiple linear regression, each coefficient can be interpreted while holding the others constant. In this sense, the \(t\) test from simple linear regression is an extension to the \(t\) test with multiple linear regression, but holding all other \(\beta_j\) constant.
The \(t\) statistic is given by:
\[t = \hat{\beta}/SE(\hat{\beta})\]
Similar to above, if \(t < t_{n-k-1, \alpha/2}\) or \(t> t_{n-k-1, 1-\alpha/2}\), then we reject \(H_0\). Otherwise if \(t_{n-k-1, \alpha/2} \leq t \leq t_{n-k-1, 1-\alpha/2}\), we fail to reject \(H_0\).
We again explore the model summary for model2
:
summary(model2)
##
## Call:
## lm(formula = thickness ~ age + sex, data = Melanoma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0249 -1.7633 -0.7117 0.7900 14.4548
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.69046 0.72457 2.333 0.02063 *
## age 0.03563 0.01204 2.959 0.00346 **
## sexmale -1.04149 0.41156 -2.531 0.01215 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.861 on 202 degrees of freedom
## Multiple R-squared: 0.07449, Adjusted R-squared: 0.06533
## F-statistic: 8.129 on 2 and 202 DF, p-value: 0.0004023
Let us focus on age
. From the summary output above, the estimate of \(\beta_1\) is 0.035635. The estimated standard error of \(\beta_1\) is 0.0120437. We can confirm the reported t-value (2.9588077) and p-value (0.0034571) manually:
(tval <- coef(summary(model2))[, "Estimate"][2]/coef(summary(model2))[, "Std. Error"][2])
## age
## 2.958808
(degfreedom <- nrow(Melanoma)-length(coef(model2))-1)
## [1] 201
(pval <- 2*(1-pt(tval, df=degfreedom)))
## age
## 0.003458948
The p-value we have manually calculated here (0.0034589) is identical to the p-value reported in the model summary. There is strong evidence against \(H_0\).
Below is a flowchart from Rosner about basic biostatistics methods and how to decide which analysis path may be most appropriate for you given your data and research question:
Copyright (c) 2021 Maria Kamenetsky