PoD in Public Health: Descriptive Statistics and Distributions
Biostatistics is the branch of applied statistics that uses statistical methods to answer medical and biological questions. Descriptive statistics quantitatively summarizes the data collected, with the goal being to summarize the sample and not to make inferences about the population the sample is meant to represent. Exploratory data analysis is the approach of exploring your data both visually and through descriptive statistics prior to any formal modeling or hypothesis testing.
The code in this lesson assumes the reader has a basic understanding of R
. R
is an open-source and free statistical computing language. RStudio
is an integrated development environment (IDE) that makes R
more user-friendly to use. If you would like to follow along, please download both:
To better illustrate the concepts below, we will be using the Melanoma
data set from the R
package MASS
.
library(MASS) #load the package the data set exists in
data("Melanoma") #load in the data set
This Melanoma
dataset contains data collected on 205 patients in Denmark on malignant melanoma. The variables are defined as below:
time
: survival time in days (possibly censored)status
: 1=died from melanoma; 2=alive; 3=dead from other causessex
: 1=male; 0=femaleage
: age in yearsyear
: year of operationthickness
: tumor thickness in \(mm\)ulcer
: 1=presence; 0=absenceWhen a study or survey is conducted, data is collected from various measures. For example in the medical field, we often collect data on sex, age, race and ethnicity, weight, height, etc. These measures are referred to as variables.
The way that data is measured affects the information that can be displayed and summarized. In medical research, the most common scales of measurement are nominal, ordinal, and numerical data.
Nominal Scales: used when the data are categorized or labeled and does not require an order. Nominal data are also known as qualitative or categorical data. When there are only two categories (for example: dead vs. alive), then these variables are called binary.
Ordinal Scales: are used when the categories have a natural order. For example, consider a patient survey where the patient uses a Likert scale to rate their health from 0 to 10, where 0 is the worst health and 10 is the best health. This data is inherently ordered and is therefore measured on the ordinal scale.
Numerical Scales: also known as quantitative data, numerical scales are used to measure the quantity of something measured. There are two types of numerical scales: 1) continuous and 2) discrete. A continuous scale has values that could theoretically span from \(-\infty\) to \(+\infty\). A discrete scale has values that are integers, such as the number of hospital visits (\(0,1,2,...,n\)).
To explore the structure of the Melanoma
data set, we will use the str()
function and apply it to Melanoma
str(Melanoma) #check out the structure of the data set
## 'data.frame': 205 obs. of 7 variables:
## $ time : int 10 30 35 99 185 204 210 232 232 279 ...
## $ status : int 3 3 2 3 1 1 1 3 1 1 ...
## $ sex : int 1 1 1 0 1 1 1 0 1 0 ...
## $ age : int 76 56 41 71 52 28 77 60 49 68 ...
## $ year : int 1972 1968 1977 1968 1965 1971 1972 1974 1968 1971 ...
## $ thickness: num 6.76 0.65 1.34 2.9 12.08 ...
## $ ulcer : int 1 0 0 0 1 1 1 1 1 1 ...
#create some tables to better see these
table(Melanoma$time)
##
## 10 30 35 99 185 204 210 232 279 295 355 386 426 469 493 529
## 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## 621 629 659 667 718 752 779 793 817 826 833 858 869 872 967 977
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 982 1041 1055 1062 1075 1156 1228 1252 1271 1312 1427 1435 1499 1506 1508 1510
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1512 1516 1525 1542 1548 1557 1560 1563 1584 1605 1621 1627 1634 1641 1648 1652
## 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## 1654 1667 1678 1685 1690 1710 1726 1745 1762 1779 1787 1793 1804 1812 1836 1839
## 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 2
## 1854 1856 1860 1864 1899 1914 1919 1920 1927 1933 1942 1955 1956 1958 1963 1970
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2005 2007 2011 2024 2028 2038 2056 2059 2061 2062 2075 2085 2102 2103 2104 2108
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2112 2150 2156 2165 2209 2227 2256 2264 2339 2361 2387 2388 2403 2426 2431 2460
## 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1
## 2467 2492 2493 2521 2542 2559 2565 2570 2660 2666 2676 2738 2782 2787 2984 3032
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 3040 3042 3067 3079 3101 3144 3152 3154 3180 3182 3185 3199 3228 3229 3278 3297
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 3328 3330 3338 3383 3384 3385 3388 3402 3441 3458 3459 3476 3523 3667 3695 3776
## 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 2
## 3830 3856 3872 3909 3968 4001 4103 4119 4124 4207 4310 4390 4479 4492 4668 4688
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 4926 5565
## 1 1
table(Melanoma$status)
##
## 1 2 3
## 57 134 14
table(Melanoma$sex)
##
## 0 1
## 126 79
table(Melanoma$age)
##
## 4 12 14 15 16 19 20 21 22 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 1 1 1 2 2 1 1 1 2 2 1 1 3 1 3 2 1 3 4 3 2 1 1 4
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
## 3 6 4 4 3 6 4 2 6 3 1 6 2 8 6 8 3 8 1 8 4 3 3 4 5 3
## 67 68 69 70 71 72 73 74 75 76 77 78 80 83 84 86 89 95
## 4 7 5 2 3 5 1 2 2 2 4 1 1 2 1 1 1 1
table(Melanoma$year)
##
## 1962 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1977
## 1 1 11 10 20 21 21 19 27 41 31 1 1
table(Melanoma$thickness)
##
## 0.1 0.16 0.24 0.32 0.48 0.58 0.64 0.65 0.81 0.97 1.03 1.13 1.29
## 1 7 1 6 4 1 4 10 11 11 1 4 16
## 1.34 1.37 1.45 1.53 1.62 1.76 1.78 1.94 2.1 2.24 2.26 2.34 2.42
## 2 1 3 1 12 1 2 10 3 1 5 1 1
## 2.58 2.74 2.9 3.06 3.22 3.54 3.56 3.87 4.04 4.09 4.19 4.51 4.82
## 9 1 3 2 10 8 1 6 1 1 2 1 1
## 4.83 4.84 5.16 5.48 5.64 5.8 6.12 6.44 6.76 7.06 7.09 7.41 7.73
## 2 5 3 2 1 2 2 1 1 2 2 1 2
## 7.89 8.06 8.38 8.54 9.66 12.08 12.24 12.56 12.88 13.85 14.66 17.42
## 1 1 1 1 1 1 1 1 2 1 1 1
table(Melanoma$ulcer)
##
## 0 1
## 115 90
We can see that we have 7 variables and 205 observations. Our nominal variables are status
, sex
, and ulcer
. Our ordinal variables are time
and year
(though time can be treated in many different ways, see the discussion here). Our continuous numerical variable is thickness
and our discrete numerical variables is age
.
Exploratory data analysis (EDA) is the process of exploring your data descriptively and visually, prior to performing any testing or making any inferences. The purpose of EDA is to get to know your data and to understand its nuances. For example, how was missingness coded (NA, ., " ")? Are there any potential outliers? Are there any points of leverage? Are there any observations that may have been entered incorrectly? By exploring your data descriptively and visually, you will develop an understanding of the strengths and limitations of your data and this may help to focus your research questions. EDA is one of the most important parts of your data analysis and you should be prepared to spend at least 30% of your data analysis time on EDA prior to moving on to the modeling part of your analysis.
The three measures of central tendency are the mean, median, and mode. We will focus on the mean and median.
The arithmetic mean is the average of the observations. Let \(x_i\) be a single observation and \(x_i,\dots, x_n\) be a series of observations. For example, you are measuring the weight (in kilograms) of every individuals that comes into a clinic on a given day. Then the arithmetic mean, \(\bar{x}\) can be calculated as:
\[\bar{x} = \frac{\sum_{i=1}^n x_i}{n},\] where \(n\) is the total number of observations or individuals that came through the clinic. The mean has the same units as the original units. In our example, this would be kilograms.
We calculate the arithmetic mean for thickness
in the Melanoma
dataset using the mean()
function:
mean(Melanoma$thickness)
## [1] 2.919854
The median is the observation which falls in the middle of an ordered series of observations, where half of the observations are below the median and half are above the median.
The median can be calculated in the following way:
We calculate the median for thickness
in the Melanoma
dataset using the median()
function:
median(Melanoma$thickness)
## [1] 1.94
The geometric mean is the \(n\)th root of the product of the \(n\) observations:
\[GM = \sqrt[\leftroot{-3}\uproot{3}n]{x_1x_2\dots x_n}\]
Consider the following example from the Rosner textbook: In many types of laboratory data, concentrations can be expressed as multiples of 2 or a constant multiplied to the power 2 (outcomes follow the form \(2^kc, k=0,1,...,K\) for a constant \(c\)). Consider values of a minimum inhibitory concentration (MIC) of penicillin G in the urine for N. gonorrhaoeae in a sample of patients. The arithmetic mean would not be appropriate to use here because the distribution is very skewed. However, there is a pattern that we can exploit because the only possible values are of the form \(2^k(x)\) for \(k=0,1,2,\dots,K\). One solution is to work on the log scale: \(\log(2^{k+1}c) - \log(2^kc) = \log(2^{k+1}) + \log c - \log(2^k) -\log c = (k+1)\log 2 - k \log2 = \log 2\). The log concentrations are equally spaced from each other and the resulting distribution will not be as skewed.
The arithmetic mean can be computed on the log scale:
\[\bar{\log x} = \frac{1}{n} \sum_{i=1}^n \log x_i\]
Though any base can be used to compute the logarithms for the geometric mean, the most commonly used ones are base 10 and base \(e\) (natural log).
We calculate the geometric mean for thickness
in the Melanoma
dataset using the base \(e\) log, or natural log, using the mean()
function. In order to report it on the original scale, we will exponentiate the result using the exp()
function.
exp(mean(log(Melanoma$thickness)))
## [1] 1.85553
Note: All three measures of central tendency for thickness
give a slightly different answer. Deciding on which is most appropriate for your data depends on the descriptive statistics and exploratory data analysis you perform first.
The variation or spread of observations can be summarized by the range, variance, standard deviation, coefficient of variation, percentiles, and the interquartile range.
The range is the difference between the largest and smallest observations.
We calculate the geometric mean for thickness
in the Melanoma
dataset using the range()
function:
range(Melanoma$thickness)
## [1] 0.10 17.42
R
actually returns the minimum and maximum values. We can check this manually using the min()
and max()
functions:
min(Melanoma$thickness)
## [1] 0.1
max(Melanoma$thickness)
## [1] 17.42
We can manually calculate the range in two ways. First, we can set the minimum and maximum values as objects:
thickness_min <- min(Melanoma$thickness)
thickness_max <- max(Melanoma$thickness)
thickness_max-thickness_min
## [1] 17.32
So the range is 17.32.
The second way is we can subset the vector of 2 elements returned by range(Melanoma$thickness)
:
range(Melanoma$thickness)[2]-range(Melanoma$thickness)[1]
## [1] 17.32
The standard deviation (SD) is how spread the data are about their mean. The standard deviation is defined as:
\[SD = s = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}\]
The standard deviation is in the same units as the original measurements. If the standard deviation is larger than the mean, this may be indicative of a positively skewed distribution.
The variance is the average of squared deviation from the mean and can be calculated by squaring the standard deviation:
\[Var = s^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}\] The units associated with the variance are the square of the measurement units
Notice that both the standard deviation and variance have \(n-1\) in the denominator instead of \(n\). This is known as Bessel’s correction and corrects the bias in the estimation of the population variance. While the standard deviation is easier to interpret, most statistical theory is built on the variance so it is therefore important to be familiar with both.
We calculate the standard deviation for thickness
in the Melanoma
dataset using the sd()
function:
sd(Melanoma$thickness)
## [1] 2.959433
We calculate the variance for thickness
in the Melanoma
dataset in two ways. First, we can use the var()
function:
var(Melanoma$thickness)
## [1] 8.758242
Second, we can square the standard deviation:
sd(Melanoma$thickness)^2
## [1] 8.758242
Note that we get the same result both ways.
The coefficient of variation (CV) is unitless and is frequently used in many applied fields. It can be calculated as:
\[CV = \frac{SD}{\bar{x}},\]
where \(SD\) is the standard deviation and \(\bar{x}\) is the arithmetic mean.
We calculate the coefficient of variation for thickness
in the Melanoma
dataset by using both the sd()
and mean()
functions:
sd(Melanoma$thickness)/mean(Melanoma$thickness)
## [1] 1.013555
Quantiles or percentiles are the percentage of the distribution that is equal to or below a particular number. They can be very useful in comparing an individual measurement with a norm.
The \(p\)th quantile is defined by:
For example, the median is the 50th percentile.
We first calculate the 50th percentile for thickness
in the Melanoma
dataset using the quantile()
function and show that it is the same as the median. We can then use the same quantile()
function to calculate the 10th, 25th, 50th, 63rd, 75th, and 99th percentiles.
median(Melanoma$thickness)
## [1] 1.94
quantile(Melanoma$thickness, probs=0.50)
## 50%
## 1.94
quantile(Melanoma$thickness, probs=c(0.10,0.25, 0.50,0.63, 0.75, 0.99 ))
## 10% 25% 50% 63% 75% 99%
## 0.6400 0.9700 1.9400 2.6632 3.5600 13.8112
The interquartile range uses quantiles and is defined as the difference between the 75th (third quartile) and 25th percentile (first quartile). The interquartile range is particularly useful because it contains 50% of the observations.
We can calculate the interquartile range for thickness
in the Melanoma
dataset in two ways. First, we can use the IQR()
function:
IQR(Melanoma$thickness)
## [1] 2.59
Second we can use the quantile()
function:
quantile(Melanoma$thickness, probs = 0.75)-quantile(Melanoma$thickness, probs = 0.25)
## 75%
## 2.59
Note that we get the same answer both ways.
Boxplots can be a very useful way to visualize the spread and interquartile range. We use the boxplot()
function, where the first argument is the dataframe Melanoma and the variable thickness, the second argument we create a title for a plot using main="Boxplot of Thickness"
. In the last argument, we make the boxplot purple with the argument col="purple"
.
boxplot(Melanoma$thickness, main="Boxplot of Thickness",col="purple")
Nominal and ordinal data cannot be treated the same as numerical data. To summarize nominal and ordinal data, we often use proportions, percentages, ratios, and rates.
A proportion is the number of observations, \(a\), with a certain characteristic divided by the total number of observations, \(a+b\) in a group:
\[Proportion = \frac{a}{a+b}\]
For example, we can calculate the proportion of individuals in the Melanoma
dataset that were male (sex == 1).
To calculate the proportion of males in the Melanoma
dataset, we will use the sum()
function and nrow()
functions. Since males are coded as sex=1
, if we sum all of the observations in sex
, we will get the total count of observations that are male. We then divide by the total number of individuals or the total number of rows in the dataframe:
sum(Melanoma$sex)/nrow(Melanoma)
## [1] 0.3853659
We can also calculate the proportion of individuals who were alive. We first use the table()
function to see how many observations there are in each of the three categories (1 = died from melanoma, 2=alive, 3=dead from other causes):
table(Melanoma$status)
##
## 1 2 3
## 57 134 14
In this example, because there are 3 groups and the 1,2,3 encodings have meaning, we cannot use the sum()
function. Instead we will use the length()
function to count the number of observations where a certain condition holds. That condition is that Melanoma$status==2
.
length(Melanoma$status[Melanoma$status==2])
## [1] 134
We see that there are 134 observations for status == 2, which corresponds with the results from the table()
function. We can then divide this by the total number of observations or total number of rows:
length(Melanoma$status[Melanoma$status==2])/nrow(Melanoma)
## [1] 0.6536585
A ratio is the number of observations in a group with a give characteristic divided by the number of observations without that characteristic, and is defined as:
\[Ratio = \frac{a}{b}\]
We can calculate the ratio of those alive to those dead by using the length()
function and comparing the number of observations where status is equal to 2 (Melanoma$status[Melanoma$status==2]
) to the number of observations where status is not equal to 2 (Melanoma$status[Melanoma$status!=2]
)
length(Melanoma$status[Melanoma$status==2])/length(Melanoma$status[Melanoma$status!=2])
## [1] 1.887324
Rates are similar to proportions except that they are calculated over a specific period of time. The multiplier is known as a base:
\[Rate = \frac{a}{a+b} \times Base\]
A random variable is a function that assigns numeric values to different events in a sample space (Rosner). A probability mass function or probability distribution is a mathematical relationship that assigns to any possible value \(r\) of a discrete random variable \(X\) to the probability \(Pr(X=r)\).
The expected value of a discrete random variable is defined as:
\[E(X) = \mu = \sum_{i=1}^R x_i Pr(X=x),\]
where \(x_i\)’s are the values of the random variable assumes.
The variance of a discrete random variable is defined as:
\[Var(X) = \sigma^2 = \sum_{i=1}^R (x_i - \mu)^2Pr(X = x_i)\\ =E(X-\mu)^2 = \sum_{i=1}^R x_i^2 Pr(X=x_i) - \mu^2\]
Consider a sample of \(n\) independent trials where there are only 2 possible outcomes: success or failure. For example, consider each member of a basketball team shooting a basketball. Whether or not the ball goes through the hoop can be considered a success (they make it) or a failure (they do not make the basket). The probability of a success at each trial (or for each player) is assumed to be constant \(p\) and the probability of a failure is therefore \(1-p=q\).
The probability distribution of the binomial distribution is:
\[Pr(X=k) = {n \choose k} p^k(1-p)^{n-k}\]
Consider a nest of 10 eggs. What is the probability of randomly selecting 2 females out of the 10 eggs if the probability of a female is 0.55?
We can use the binomial distribution with \(n=10\), \(p=0.55\), and \(k=2\):
\(Pr(X=2) = {10 \choose 2}(0.55)^2 (0.45)^8 = 0.023\)
We can use R
to calculate this using the dbinom()
function:
dbinom(2,10, 0.55)
## [1] 0.02288959
We can plot the binomial distribution with \(n=10\) and with probability 0.55:
plot(0:10, dbinom(0:10,10, 0.55), type="h",
main="Binomial Distribution (n=10, p=0.55)",
ylab = "Probability", xlab="# Successes")
The Poisson distribution is usually associated with rare events. For example, breast cancer cases in a county. The probability of \(k\) events occurring in time period \(t\) with rate of the expected number of events per unit time,\(\lambda\), give rise to the Poisson distribution:
\[Pr(X=k) = \frac{e^{-\lambda}\lambda^k}{k!}, k=0,1,2,\dots\] With the Poisson distribution, both the mean and variance are equal to \(\lambda\).
Consider a food-borne disease outbreak where during the span of 1 month in Dane County, where 10 individuals contracted the rare disease. Based on previous records of about a ten percent of the Dane County population (50,000 people), 5.2 cases were expected. Was the observed number of cases excessive?
In this case we know that \(\lambda = 5.2\) and want to calculate \(Pr(X \geq 10)\). So we want to calculate \(Pr(X \geq 10) = 1 - Pr(X \leq 10)\).
We can calculate the probability for \(Pr(X = 1)\) as \(\frac{e^{-5.2}(5.2)^1}{1!}\). we can use the dpois()
function:
dpois(1, 5.2)
## [1] 0.02868613
We can plot the Poisson distribution with \(\lambda=5.2\):
plot(0:10, dpois(0:10, lambda = 5.2), type="h",
main="Poisson Distribution (lambda=5.2)",
ylab = "Probability", xlab="# Cases")
However, we want \(Pr(X \leq 10)\). We need to calculate \(Pr(X = 0) + Pr(X =1) + \dots + Pr(X = 10)\).
dpois(1:10, 5.2)
## [1] 0.02868613 0.07458395 0.12927885 0.16806250 0.17478500 0.15148034
## [7] 0.11252825 0.07314336 0.04226061 0.02197552
sum(dpois(1:10, 5.2))
## [1] 0.9767845
Finally, we take 1 minus \(Pr(X \leq 10)\):
1-sum(dpois(1:10, 5.2))
## [1] 0.02321549
The observed number of cases were excessive for this group.
The probability density function of random variable \(X\) is a function where the area under the density function curve between any two points \(a\) and \(b\) is equal to the probability that the random variable \(X\) falls between \(a\) and \(b\). The total area under the density function curve over the range of possible values sums to 1.
The probability density function of the normal distributions is:
\[f(x) = \frac{1}{\sqrt{2\pi \sigma}} \exp \bigg[- \frac{1}{2 \sigma^2} (x-\mu)^2 \bigg], -\infty < x < + \infty\]
Suppose that in a population-based case-control study (\(n= 1000\)), a blood measurement is assumed to follow a standard normal distribution (\(N(0,1)\)), with mean zero and variance one. A blood measurement is considered to indicate a serious medical condition if it is less than -1.5. What percentage of participants are indicated to be in the serious medical condition?
We want to calculate \(Pr(X < -1.5)\) and will use the pnorm()
function with arguments mean=0
and sd=1
:
pnorm(-1.5, mean=0, sd=1)
## [1] 0.0668072
Approximately 7% of participants are indicated to have the serious medical condition
Now consider that the measurement is in the normal range if it is within 1.5 standard deviations of the mean. What proportion of the participants are within the normal range?
We want to calculate \(Pr(-1.5 \leq X \leq 1.5)\)
pnorm(1.5)-pnorm(-1.5)
## [1] 0.8663856
Approximately 87% of participants are within the normal range.
We can plot the standard normal distribution with mean = 0 and standard deviation = 1:
plot(seq(-10,10, length.out=100), y=dnorm(seq(-10,10, length.out=100), mean=0, sd=1),
type="l",
main="Standard Normal Distribution (mean = 0, sd = 1)",
ylab = "Probability", xlab="Mean")
Copyright (c) 2021 Maria Kamenetsky