Introduction

Epidemiology is defined as the study of the distribution and determinants of health-related states or events in specified populations and the application of this study to control health problems (Szklo & Nieto).

Epidemiology can be classified into descriptive and analytical epidemiology. Descriptive epidemiology uses available data to examine how rates of disease or mortality vary by other demographic variables. When the rates are not uniform across demographic groups of interest, for example by race, sex, class, etc., epidemiologists can use that information to identify high-risk groups.

Analytical epidemiology uses study designs to assess hypotheses of associations between suspected risk factors and health outcomes. Study designs are key to both observational epidemiologic studies as well as in randomized clinical trials .

In both descriptive and analytical epidemiology, we focus on two distinct measures of disease: 1) incidence and 2) prevalence. The incidence rate of a disease is the number of new disease cases that occur during a specified period of time in a population at risk for developing the disease.

The cumulative incidence is then the total number of new cases of the disease in the specified period of time during which all individuals in the population were considered to be at risk of developing the disease. The cumulative incidence is a measure of risk. The two key points about incidence are that 1) the focus is on new cases and 2) the time period must be clearly defined. When not all individuals have been observed for the same amount of time, you can instead calculate the incidence density (or incidence rate) by instead using person-time as the time denominator. You may have come across person-time in terms of person-years or person-months in some studies already. For example, if we want to calculate the incidence rate per 1,000 individuals:

\[\text{Incidence rate per 1,000 ppl} = \frac{\text{#new cases occuring in the population during specified time period}}{\text{Total person-time}},\]

where \(\text{Total person-time}\) is the sum of the time periods of observation of each person who has been observed for all or part of the time period.

Prevalence is the number of affected individuals present in the population at a specific time divided by the number of individuals in the population at that specific time. For example, if we want to calculate the prevalence rate per 1,000 individuals:

\[\text{Prevalence per 1,000 ppl} = \frac{\text{# cases present in the population at specified time period}}{\text{# individuals in the population during specified time}}\] Prevalence can be further broken down into point prevalence and period prevalence. Point prevalence is the prevalence of disease at a certain point in time. Unless otherwise specified, when we discuss prevalence we are talking about point prevalence. Period prevalence is the number of people who have had the disease at any point during a specified time. For example, suppose you are conducting a study and calculate the prevalence of asthma between 2015-2020. If during this time period you consider all individuals with asthma (both new cases and previously diagnosed cases) as the number of cases present during this 5-year period, then you are calculating the period prevalence of asthma.

Prevalence can be thought of as a “snapshot” of the population at a point in time. There is no information about when the disease developed and the prevalent cases will include individuals both recently diagnosed with the disease and those diagnosed many years ago. Because of these different durations of disease in the prevalent cases, we do not have a measure of risk. In order to estimate risk, we would need to calculate the incidence, which will include only new cases identified during a specified time period.

Defining Exposures

In order to conduct an epidemiological study, an exposure of interest and disease outcome of interest must be defined. An etiologically-relevant time period for exposure must first be defined, which should take into consideration:

  • Threshold level
  • Duration
  • Dose
  • Mechanism (biological plausibility)

The exposure can then be assessed using:

  • Interviews
    • Protocols must be established. Will multiple observers or interviewers be used?
  • Records and bio-samples

Epidemiologic Measures

In epidemiologic studies, there are relative and absolute measures of effect.

Relative measures quantify the strength of association on a scale that is independent of the magnitude of incidence in the unexposed population. This is because large denominators will cancel out.

Absolute measures are sensitive to the incidence of disease in the unexposed population.

For the measures below, consider the following two-by-two table:

D+ D-
E+ a b
E- c d

where \(E+\) are those exposed, \(E-\) are those unexposed, \(D+\) are those who develop the disease, and \(D-\) are those who do not develop the disease.

Relative Measures

Relative measures range from 0 to \(\infty\) and are dimensionless. In other words, there are no units.

Point Prevalence Ratio (PPR)

The point prevalence ratio is used with cross-sectional studies and can be calculated as:

\[PPR = \big[\frac{a/(a+b)}{c/(c+d)} \big]\]

We assume that the exposure does not affect survival (there are no spillover effects) and that the disease is rare.

Risk Ratios

Risk ratios are the risk of disease in the exposed divided by the risk of disease in the unexposed and can be calculated as:

\[RR = \big[\frac{a/(a+b)}{c/(c+d)} \big]\]

or

\[RR = \frac{CumulativeIncidence_{exposed}}{CumulativeIncidence_{unexposed}}\]

Notice that though the formula is the same as that for the point prevalence ratio, the risk ratio is different from the PPR based on the study design and who is in the study base. For the risk ratio, we assume we are able to calculate the cumulative incidence or risk, meaning we have the whole study base population.

Conversely for PPR, we are taking a sample in time of the study base, and therefore cannot calculate incidence, but only prevalence.

Odds Ratios

Odds ratios are the odds of disease in the exposed divided by the odds of disease in the unexposed.

\[OR = \big[\frac{a/b}{c/d} \big]\]

The odds ratio will approximate the risk ratio when the disease is rare, also known as the rare disease assumption. Otherwise the odds ratio will be larger than the risk ratio and will bias away from the null.

In cross-sectional studies, the equivalent measure is the point prevalence odds ratio (PPOR).

Incidence

Cumulative Incidence

Cumulative incidence is the proportion of people in the risk set that develop the disease outcome of interest. Without any censoring (you know whether each individual developed or did not develop the disease and had no losses to follow-up), the cumulative incidence can be calculated as:

\[CI = d+/N,\]

where \(d+\) is the number of individuals who developed the disease outcome of interest and \(N\) are the total number of individuals in the risk set. This is called the simple method of calculating cumulative incidence.

When you do have censoring (ex: when you have losses to follow-up), the actuarial method can be used to calculate the cumulative incidence under censoring as:

\[CI = d+/(n_i - 1/2 w_i),\]

where \(n_i\) are the number of individuals in time period \(i\) and \(w_i\) is the number of withdrawal individuals (or losses to follow-up) in time period \(i\).

Example

Consider a fictional cohort study where you have a cohort of 2,000 individuals, all disease-free at the beginning of the study. Over the span of 10 years, 100 people develop macular eye degeneration. Then the cumulative incidence is \(CI=100/2000=0.05\), or there is a 5% risk of developing macular eye degeneration over 10 years. The incidence proportion (sometimes referred to as incidence rate, see a detailed explanation here) is 50 cases per 1,000 individuals over 10 years.

Absolute Measures

Absolute measures can range from \(-\infty\) to \(\infty\) and are not dimensionless. In other words, the units for interpretation remain the same as the original measure.

Attributable Risk

Attributable risk (AR) is the incidence of disease risk in a population (Gordis) and can be calculated as:

\[AR=Risk_{exposed} - Risk_{unexposed}\]

A variation is the percent attributable risk, which can be calculated as:

\[\%AR = \big[\frac{Risk_{exposed} - Risk_{unexposed}}{Risk_{unexposed}} \big] \times 100\]

Population Attributable Risk

Population attributable risk is defined as the proportion of disease incidence in a total population that can be attributed to a specific exposure.

This measure is particularly useful in public health, as it can provide an estimate of the total impact of a public health initiative in a community. PAR can be calculated as:

\[PAR=(\text{Incidence in the total population}) - (\text{Incidence in non-exposed group [background risk]})\]

or

\[PAR = \frac{P_{pop}(RR-1)}{[P_{pop}(RR-1)+1]},\]

where \(P_{pop}= n_{exposed}/N\) and \(RR\) is the relative risk of disease.

Epidemiological Study Designs

Cohort Studies

Cohort studies are considered to be the “gold standard” in observational studies. If unbiased, cohort studies can reflect the “real life” sequence of events in time as they occur.

Cohort studies consider the entire study base, instead of sampling from the study base as case-control or cross-sectional studies do. In this way, case-control and cross-sectional study designs are variants of a cohort study that are less costly.

In cohort studies, a group of healthy people or a cohort is identified and then followed for a certain period of time to discern the occurrence of disease outcomes and other health events. After the study base is defined, individuals are classified by exposure status (exposed vs. unexposed) and the incidence of disease can be calculated and compared across exposure categories. In cohort studies, you want to minimize losses to follow-up and ideally have a high frequency of exposure. You want the exposed and unexposed groups to be exchangeable.

The objective of cohort studies is usually identify if the incidence of a disease outcome is related to a suspected exposure. For this reason, prevalent cases are excluded from the cohort at baseline. This is also why standard cohort studies are prospective studies.

The study populations in cohort studies can be diverse and include a sample of the general population in a certain study region (for example: the Framingham Heart Study. Cohorts can also be defined more narrowly based on the research question. For example in occupational epidemiologic studies, an occupational cohort may be defined as a group of workers in a specific occupation who are then classified according to their exposure status to suspected hazards.

Cohorts can also be formed based on a convenience samples, such as with the Nurses Health Study Cohort. Some advantages of cohorts based on convenience samples are that they are logistically easier such as for follow-up.

Cohort studies, though the “gold standard” in observational studies, are still subject to threats to validity. Losses to follow-up can be a threat to validity if the losses are differential. For example, if individuals with the disease outcome in the exposed group were more likely to drop out of the study than individuals with the disease outcome in the unexposed group, then the losses to follow-up would be affected by both the exposure and disease status.

Cohort studies can also be retrospective, where a cohort is identified in the past based on existing records and then individuals are followed through the record to present time. Retrospective studies can be particularly useful in occupational studies, with occupational records that can be linked to mortality records or cancer registries. Although retrospective cohort studies are useful and may be more efficient than a prospective cohort study, their biggest limitations are that you are limited to the data available.

With cohort studies, there are also several different types of exposure and how it can be defined. Accrual of exposure is the accumulation of the exposure that has not yet reached the threshold to cause disease. The induction period is the period of time required for the exposure to produce the disease. The latent period is the period during which the disease is present, but is not yet detectable. Finally, the time at risk of exposure effects begins at the induction period through the disease being detected. At this point, the the disease outcome can be etiologically attributed to the exposure.

Example

Suppose you have a fictional cohort study where you followed 4,765 individuals for 5 years to see how many individuals develop Type II diabetes. Your main exposure of interest is in the association between the development of Type II diabetes and ever smoking. For this example, assume there are no losses to follow-up. Consider the table below:

Diabetes+ PY
Smoking+ 103 1508
Smoking- 82 3072

where \(PY\) is the total number of person-years. The incidence rate per 1,000 person-years can be calculated as \([103/(103+1508)]*1000= 63.94\), or the incidence rate of Type II diabetes amongst smokers is ~64 cases per 1,000 person-years. The incidence rate of diabetes in non-smokers per 1,000 person-years can be calculated as \([82/(82+3072)]*1000 = 25.99\), or the incidence rate of Type II diabetes amongst non-smokers is ~ 26 cases per 1,000 person-years. The relative risk of smoking can be calculated as it relates to the development of Type II diabetes can be calculated as \(RR = \frac{103/(103+1508)}{82/(82+3072)}=1.06\). The relative risk of developing Type II diabetes as a smoker is 6% higher than developing Type II diabetes as a non-smoker.

Cohort Study Strengths:

  • The temporal relationship between the exposure and disease outcome can be established
  • Multiple effects of an exposure can be examined
  • Greater control of the measurement of exposure and criteria for disease outcome (better data quality)
  • Special cohorts can be used to study the effects of a rare exposure

Cohort Study Limitations:

  • Prospective cohort studies can be costly both financially and time-wise
  • Prospective cohort studies can be affected by losses to follow-up
  • Retrospective cohort studies are limited by data availability and data quality

How to Minimize Selection Bias in Cohort Studies:

    1. Minimize losses to follow-up
    1. Select exchangeable cohorts, where the exposed and unexposed belong to the same study base
    1. Minimize non-participation (self-selection)

Case-Control Studies

Case-control studies are another study design by which the exposure-disease relationship can be investigated. Where in a cohort study the exposed and unexposed individuals are compared with respect to disease incidence, in a case-control study, cases and controls are compared with respect to their exposure status. In other words, what are the odds that a case was exposed?

If the exposure of interest is binary (ex: exposed to air pollution vs. unexposed to air pollution), then odds ratios are used as the measure of association. If the exposure of interest is continuous (ex: weight), then the mean levels can be compared in cases and controls.

One advantage that case-control studies have over cohort studies is that they do not include follow-up time, making case-control studies faster to undertake and more efficient.

There are also several variants of case-control studies. Case-control studies based on a primary study base, where the study base is defined by the investigator to target a specific research question and with cases being those who develop the disease within the base. The same eligibility criteria are used for selecting both cases and controls. Enrollment occurs either randomly or blinded and is independent of exposure status. If cases and controls are from the same study base, then exchangeability is more likely to hold, where compared groups are considered exchangeable if their outcomes would be the same if they had had the same exposure.

There are several variations of case-control studies:

    1. Case-control Studies with Incidence Density Sampling
    • Controls are the non-cases at the end of the follow-up time (ie: the survivors).
    • Assumes all non-cases have been followed for the same length of time. This may not be true for dynamic cohorts, where individuals can enter are added or withdrawn from the pool of the population at risk as they move in and out of an area.
    • Losses to follow-up may lead to bias (selection bias, where your exposure of interest and disease outcome of interest both independently affect another factor. The bias comes from this other factor inducing or altering the true association between the exposure and outcome).
    1. Case-Cohort Case-Control Studies
    • A sub-cohort is created that is free of the disease of interest at the start of follow-up. This sub-cohort is followed and all new cases are identified from the full cohort as well as the sub-cohort.
    • Control are selected from the sub-cohort.
    • Advantages of this study design are that it allows for estimation of incidence rates, the natural history of disease can still be studied, and the sub-cohort can be used as a comparison group to study multiple diseases.
    1. Nested Case-Control Study with Incidence Density Sampling
    • Controls are selected from the set of individuals at risk at the time of each new case. In this way, cases are matched to controls on follow-up time. An individual can be a control twice or a control and then a case.
    • Advantages of this study design are that there are no losses to follow-up, which minimizes the potential for selection bias, There is no need for the rare disease assumption and there is no survival bias.

Case-control studies can also be set within a secondary study base, where cases are defined before the study base is identified and the study base cannot be measured or enumerated. An example of a secondary study base design is a hospitilary case-control study , where controls are selected from multiple diseases (none of which are known to be associated with or interact with the disease of interest). A big drawback of secondary study base designs is that you cannot be certain that cases and controls are from the same study base and therefore may not be exchangeable between the exposed and unexposed.

Case-Control Study Strengths:

  • They are fairly quick and inexpensive
  • They are good for rare diseases
  • They can be used to assess multiple risk factors

Case-Control Study Limitations:

  • Inefficient for rare exposures
  • Unless using a primary study base, you cannot estimate incidence
  • Lower data quality

How to Minimize Selection Bias in Case-Control Studies:

    1. Select cases and controls independent of their exposure status
    1. Apply the same eligibility criteria to cases and controls
    1. Select all cases at random (or take all cases)
    1. Selected cases and controls from the same study base
    1. Use incident cases, when possible
    1. In hospitilary case-control studies, select controls from multiple diseases not known to be associated with the disease of interest.

Example

Suppose that in a fictional case-control study, you are studying asthma-related hospitalizations and exposure to wildfires. Consider the table below:

Asthma Hospitalization+ Asthma Hospitalization-
Wildfire+ 64 110
Wildfire- 32 214

We can calculate the odds ratio as \(OR=\frac{64/32}{110/214} = 3.89\), or the odds were over three times higher that someone who was hospitalized for an asthma-related exacerbation had also been exposed to wildfire.

Cross-Sectional Studies

Cross-sectional studies consist of a sample (or in some cases the total) of the reference population at a given point in time. Cross-sectional studies can be thought of as a “snapshot” of a cohort study, characterized by analyzing the disease outcome and exposure at a given point in time.

When cross-sectional studies do come from a cohort study, the study can be analyzed by comparing point prevalence rates of the disease outcome across exposed and unexposed individuals. Alternatively, we can consider adapting the case-control study analysis and compare prevalent cases and noncases to the odds of exposure.

Cross-sectional studies may be particularly useful when analyzing baseline data (at time = 0) of a cohort study. This is because subclinical outcomes are less likely to be subject to survival bias at this point. Establishing a baseline from a cohort study may be useful as a way to check that they cross-sectional analysis results are consistent with the rest of the cohort study.

With cross-sectional studies, there are two measures of association. The point prevalence odds ratio (PPOR) can be used when the disease is rare:

\[PPOR = \frac{a/b}{c/d}\]

When the disease is not rare, use the point prevalence ratio (PPR):

\[PPR = \frac{a/(a+b)}{c/(c+d)}\]

The point prevalence ratio (PPR) will approximate the risk ratio when:

    1. The population is in a steady state (the case inflow is equal to the case outflow)
    1. The disease outcome is rare
    1. Survival is the same in the exposed and unexposed
    1. The exposure occurs before the disease

How to Minimize Selection Bias in Case-Control Studies:

    1. Select a random sample of the population
    1. Use data on the disease to correct prevalence
    1. Check that the disease does not change with exposure
    1. Use both the PPR and PPOR

Example

Consider the following fictional study. You are interested in the association between running and arthritis in the knee. You perform a cross-sectional study of patients across 2 hospitals in Madison and identify all patients with arthritis in the knee and those who self-report as runners. Consider the table below:

Arthritis+ Arthritis-
Runner+ 202 523
Runner- 178 485

Assuming that the disease (arthritis in the knee) is not rare, we calculate the point prevalence odds ratio as \(PPOR= \frac{202/523}{178/485} = 1.05\), or the odds of being a runner and having knee arthritis are 5% higher than having arthritis and not being a runner.

Case-Crossover Design

Case-crossover designs compare an exposure status of a case immediately before its occurrence to the exposure status of the case at some other prior time. Case-crossover designs are particularly useful in studying acute exposures that may vary over time and that have a change in risk that lasts only for a short time.

For example, Szklo and Nieto discuss how case-crossover designs have been used to study acute triggers of intracranial aneurysms such as vigorous exercise or traffic-related air pollution and asthma exacerbations.

In case-crossover studies, individuals serve as their own control and all individual characteristics that may confound the association between the exposure and disease outcome are therefore controlled for. Case-crossover studies, however, assume that the disease does not have an undiagnosed stage that could inadvertently affect the exposure of interest. Additionally, it assumes that the exposure does not have a spillover effect across time and does not have a cumulative effect.

Data on exposures are often obtained either from other studies (ex: environmental exposure studies of particulate matter) or by relying on participants’ recall. In the case of participant’s recall, this information may be susceptible to recall bias.

Case-crossover studies can be analyzed by calculating the ratio of discrepant pairs to estimate the odds ratio or by using conditional logistic regression. An example of a case-crossover design can be found here.

Randomized Clinical Trials

Randomized clinical trials (RCT) are useful for studying drug treatments or new medical technology or even to assess screening and early detection programs. In an RCT, you start with your study population or study base. Then half of the study participants are randomized to the new treatment arm and the other half are randomized to the current treatment or placebo arm. In both arms, the outcome is assessed.

Before beginning an RCT, clear inclusion and exclusion criteria must be defined. Interventions, or the treatment being assessed must be 1) specific and well-defined, and 2) consistent. There must always be a placebo group or “standard of care” group to evaluate the treatment beyond the placebo effect.

Randomization is the process of making something random. In the case of RCT’s, it is the assignment of each individual to different treatment arms. The main purpose of randomization is to prevent any potential biases on the part of the investigator from influencing the assignment of participants to different treatment groups. It increases the likelihood that prognostic factors are distributed equally between treatment groups, and therefore increases exchangeability between treatment and control. The problem with simple randomization, such as using a random numbers table, is that it may lead to imbalances in prognostic factors and large imbalances, particularly in smaller trials, may lessen credibility. It is important to note that with randomization, there are equal probabilities across individuals not across treatments. Therefore the random assignment does not need to result in an equal number of individuals across the treatment arms.

The three main goals of randomization are:

    1. Produce exchangeable groups
    1. Remove investigator bias
    1. Provide the proper background for estimating statistical errors.

There are two types of analysis that can be performed with RCTs: 1) Intention-to-treat vs. 2) Treatment received.

Intention-to-treat (ITT) analyzes the study the same way the participants were originally randomized, regardless of if they switched treatments or not. Intention-to-treat analysis:

    1. preserves randomization
    1. provides a more conservative estimate of the treatment effect
    1. demonstrates the “effectiveness” of the treatment or ow it could be expected to work in the “real-world”

Treatment received (TR) analyzes the study based on the treatment that participants ultimately received. Treatment received analysis:

    1. demonstrates the efficacy of the treatment or how it would be expected to work in an “ideal world”
    1. provides a more liberal estimate of the treatment effect, and
    1. undoes randomization.

Because treatment received analysis undoes the randomization, the results may be biased. Generally, intention-to-treat is preferred. See here for a more detailed discussion.

Example

Suppose you are interested in Treatment A and how it compares to a placebo treatment. In an RCT, the outcomes of the trial are in the table below. Calculate the relative risk of being alive under Treatment A compared to the placebo

Trt A Placebo
Alive 900 2090
Dead 75 313

Solution: \(RR = \frac{900/(900+75)}{2090/(2090+313)} = \frac{0.923}{0.870}= 1.06\). Individuals receiving Treatment A had a 6% more likely to be alive at the end of the trial compared to individuals who received the placebo.

References

Additional Resources

Copyright (c) 2021 Maria Kamenetsky