Example of Simulating Educational Inequalities in Depression Using the MicSim Package

Introduction

This vignette demonstrates a simulation study of educational inequalities in major depressive disorder (MDD) using the MicSim package. The simulation uses sample data derived from the study by Lepe et al. (2024), which used data from the Lifelines Cohort Study.

In this vignette we will take you through a standard MicSim workflow. First, we define the simulation framework and the initial population, then we specify transition rates, build the transition matrix and run the simulation. Lastly, we process and analyze the output.

# loading libraries
library(MicSim)
library(knitr)
library(dplyr)
library(tidyr)
library(ggplot2)

# Clearing the environment
rm(list = ls())
invisible(gc())

Definition of Basic Simulation Frame

We start with defining the simulation horizon (in a yyyymmdd format). Next, we define desired sample size and minimum and maximum ages. In this example, sample size is kept intentionally small to reduce necessary computation time. A larger population would lead to better, more reliable results, but cannot be implemented in this example.

startDate <- 20180101
endDate   <- 20651231
simHorizon <- c(startDate = startDate, endDate = endDate)

N <- 1000

minage <- 18
maxage <- 65

The state space tells us which attributes the simulated individuals can consist of. In this case these are sex, education and health status.

Usually, for a simulation of a life course, death is a risk an individual will always be exposed to. Thus, MicSim operates under the assumption that an absorbing state will be defined. If this is definitely not wanted by the user, such as in this case, best practice is to still provide the state, but set the mortality rate to zero (Zinn 2011, 9).

health    <- c("NoMDD", "MDD")
education <- c("Low", "High")
sex   <- c("Male", "Female")
stateSpace <- expand.grid(health = health, education = education, sex = sex)

absStates <- "dead"

Definition of initial population

For the user, the definition of the initial population is largely left up to needs and preference. Here we create birthdates for a cohort born in the year 2000 and then assign a baseline prevalence of MDD as estimated from empirical data (see Lepe et al. 2024).

Next, we split the population equally among sex and education and assign the initial states.

set.seed(1234)

birthDates <- runif(N, min = getInDays(20000101), max = getInDays(20001231))

prev_lo_m <- 0.0122
prev_lo_f <- 0.0191
prev_hi_m <- 0.0014
prev_hi_f <- 0.0022

m_lo_mdd <- sample(
  x = c("MDD/Low/Male", "NoMDD/Low/Male"),
  prob = c(prev_lo_m, 1 - prev_lo_m),
  size = N / 4, replace = TRUE
)

f_lo_mdd <- sample(
  x = c("MDD/Low/Female", "NoMDD/Low/Female"),
  prob = c(prev_lo_f, 1 - prev_lo_f),
  size = N / 4, replace = TRUE
)

m_hi_mdd <- sample(
  x = c("MDD/High/Male", "NoMDD/High/Male"),
  prob = c(prev_hi_m, 1 - prev_hi_m),
  size = N / 4, replace = TRUE
)

f_hi_mdd <- sample(
  x = c("MDD/High/Female", "NoMDD/High/Female"),
  prob = c(prev_hi_f, 1 - prev_hi_f),
  size = N / 4, replace = TRUE
)

initStates <- c(m_lo_mdd, f_lo_mdd, m_hi_mdd, f_hi_mdd)

Based on these initial states, we can then create an initial population with birthDates reformatted to yyyymmdd format. The initial population should then look like the one in table below.

initPop <- data.frame(ID = 1:N, birthDate = birthDates, initState = initStates)

initPop$birthDate <- getInDateFormat(initPop$birthDate)

initPop %>% 
  head() %>% 
  kable("pipe", caption = "**Example of what initPop should look like**")
Example of what initPop should look like
ID birthDate initState
1 20000212 NoMDD/Low/Male
2 20000814 NoMDD/Low/Male
3 20000810 NoMDD/Low/Male
4 20000815 NoMDD/Low/Male
5 20001110 NoMDD/Low/Male
6 20000821 NoMDD/Low/Male

Definition of Transition Rates

We only define two sets of rates: 1. Incidence rate ??? how often people move from no depression to depression. 2. Remission rate ??? how often people move from depression to no depression.

We begin by estimating the transition rates using the observed data. We then generate counterfactual rates under the assumption that individuals with low education have the same quality of social contacts as individuals with high education. This allows us to assess the potential impact of addressing modifiable factors that contribute to educational inequalities in MDD.

In the original study we estimated these rates separately by sex, education level, and age group, then fitted smooth curves so a rate could be retrieved for any exact age (see Lepe et al. 2024). In this example we load a simplified dataset: the rates are constant within each single-year age band, and we examine only one of the counterfactual scenarios from the paper.

Here we set the mortality rate to 0 as mentioned above and load the rates data in.

mortRates <- function(age, calTime) {
  return(0)
}

rates <- rates_mdd

rates %>%
  head(n = c(6,8)) %>%
  kable("html", caption = "**Subset of transition rates per age**")
Subset of transition rates per age
age incidence_lo_m incidence_lo_f incidence_hi_m incidence_hi_f remission_lo_m remission_lo_f remission_hi_m
17 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
18 0.0108544 0.0138612 0.0040147 0.0051590 0.1763242 0.1942780 0.1949342
19 0.0106856 0.0136467 0.0039518 0.0050782 0.1755351 0.1936338 0.1942947
20 0.0105168 0.0134322 0.0038889 0.0049974 0.1747961 0.1930314 0.1936966
21 0.0103481 0.0132176 0.0038259 0.0049166 0.1741070 0.1924708 0.1931400
22 0.0101793 0.0130031 0.0037630 0.0048358 0.1734680 0.1919519 0.1926249

The definition of transition rates has to follow the specific format required by MicSim, as age and calTime have to always be supplied as arguments. Duration is variable and only needs to be specified as a function argument if used in the transition rate function.

In this example the rate function simply looks up the correct value in the rate table we loaded earlier. We subtract 16 from each person???s age to select the right row because the first data row in the table represents age 17. All rates at age 17 are set to zero because individuals are not allowed to transition until they turn 18.

For a more detailed explanation and examples of rate definition, refer to the documentation of MicSim() and the vignette on Migration Flows in this package.

create_rate_function <- function(vec) {
  # Ensures the correct column of transition rates called on by the function
  force(vec)
  
  function(age, calTime) {
    rate <- vec[as.integer(age) - 16]
    return(rate)
  }
}

# Loop over relevant columns to create rate functions.
for (col in names(rates)[-1]) {
  # Extracting relevant transition rates
  vec <- as.numeric(rates[[col]])

  # Creating a function for each transition rate
  f <- create_rate_function(vec)

  # Assign the function to the global environment
  assign(col, f, envir = .GlobalEnv)
}

Definition of transition matrix

Lastly, we need to assign these defined functions to the corresponding state transitions. This means constructing a transition matrix. We do this also for the absorbing state, “dead”, which no individual can transition into as the mortality rate is 0.

Then all other possible transitions between states are defined. Note that we define transitions to states not only one one attribute, such as NoMDD -> MDD, but supply transitions that only trigger for a specific combination of attributes. As long as transition rate functions are assigned to these transitions, it is possible to supply highly specific transitions that depend on more than one attribute from the state space to MicSim.

Any transition not defined here is impossible and will not occur.

To construct the transition matrix, we combine all the information into one transition matrix.

absTransitions <- c("dead", "mortRates")

allTransitions <- cbind(
  c("NoMDD/Low/Male->MDD/Low/Male", "NoMDD/Low/Female->MDD/Low/Female",
    "NoMDD/High/Male->MDD/High/Male", "NoMDD/High/Female->MDD/High/Female",
    "MDD/Low/Male->NoMDD/Low/Male", "MDD/Low/Female->NoMDD/Low/Female",
    "MDD/High/Male->NoMDD/High/Male", "MDD/High/Female->NoMDD/High/Female"),
  c("incidence_lo_m", "incidence_lo_f",
    "incidence_hi_m", "incidence_hi_f",
    "remission_lo_m", "remission_lo_f",
    "remission_hi_m", "remission_hi_f")
)

transitionMatrix <- buildTransitionMatrix(
  allTransitions = allTransitions,
  stateSpace = stateSpace,
  absTransitions = absTransitions
)

In this example, we consider a counterfactual case, for which we also define transitions. These are the same, the only difference is between the transition rates. For this reason, we just adjust the column that assigns the rate functions.

Same as above, these transitions are then combined into a transition matrix.

cf_allTransitions <- allTransitions
cf_allTransitions[, 2] <- paste0("cf_", cf_allTransitions[, 2])

cf_transitionMatrix <- buildTransitionMatrix(
  allTransitions = cf_allTransitions,
  stateSpace = stateSpace,
  absTransitions = absTransitions
)

Running the simulations

We set a seed here to ensure reproducibility and then simulate educational inequalities in depression under observed conditions.

We first run the main simulation and then the counterfactual (cf) simulation.

set.seed(1234)

simpop <- micSim(initPop=initPop, transitionMatrix=transitionMatrix,
                 absStates=absStates, maxAge=maxage, simHorizon=simHorizon)
## Initialization ... 
## [1] "Starting at:  2025-08-29 10:08:48.8835"
## [1] "Ending at:  2025-08-29 10:08:53.742976"
## Simulation is running ... 
## Year:  2018 
## Year:  2019 
## Year:  2020 
## Year:  2021 
## Year:  2022 
## Year:  2023 
## Year:  2024 
## Year:  2025 
## Year:  2026 
## Year:  2027 
## Year:  2028 
## Year:  2029 
## Year:  2030 
## Year:  2031 
## Year:  2032 
## Year:  2033 
## Year:  2034 
## Year:  2035 
## Year:  2036 
## Year:  2037 
## Year:  2038 
## Year:  2039 
## Year:  2040 
## Year:  2041 
## Year:  2042 
## Year:  2043 
## Year:  2044 
## Year:  2045 
## Year:  2046 
## Year:  2047 
## Year:  2048 
## Year:  2049 
## Year:  2050 
## Year:  2051 
## Year:  2052 
## Year:  2053 
## Year:  2054 
## Year:  2055 
## Year:  2056 
## Year:  2057 
## Year:  2058 
## Year:  2059 
## Year:  2060 
## Year:  2061 
## Year:  2062 
## Year:  2063 
## Year:  2064 
## Year:  2065 
## Simulation has finished.
## ------------------
cf_simpop <- micSim(initPop=initPop, transitionMatrix=cf_transitionMatrix,
                    absStates=absStates, maxAge=maxage, simHorizon=simHorizon)
## Initialization ... 
## [1] "Starting at:  2025-08-29 10:08:56.204804"
## [1] "Ending at:  2025-08-29 10:09:00.299825"
## Simulation is running ... 
## Year:  2018 
## Year:  2019 
## Year:  2020 
## Year:  2021 
## Year:  2022 
## Year:  2023 
## Year:  2024 
## Year:  2025 
## Year:  2026 
## Year:  2027 
## Year:  2028 
## Year:  2029 
## Year:  2030 
## Year:  2031 
## Year:  2032 
## Year:  2033 
## Year:  2034 
## Year:  2035 
## Year:  2036 
## Year:  2037 
## Year:  2038 
## Year:  2039 
## Year:  2040 
## Year:  2041 
## Year:  2042 
## Year:  2043 
## Year:  2044 
## Year:  2045 
## Year:  2046 
## Year:  2047 
## Year:  2048 
## Year:  2049 
## Year:  2050 
## Year:  2051 
## Year:  2052 
## Year:  2053 
## Year:  2054 
## Year:  2055 
## Year:  2056 
## Year:  2057 
## Year:  2058 
## Year:  2059 
## Year:  2060 
## Year:  2061 
## Year:  2062 
## Year:  2063 
## Year:  2064 
## Year:  2065 
## Simulation has finished.
## ------------------

Processing the output

After the standard workflow required to run MicSim, the data can be processed in different ways depending on what one is interested in.

In this example, we first convert the output to long format, where each row represents a specific episode that an individual experienced. We additionally create variables that indicate the age at the start and end of each episode, as well as a variable indicating whether the data is from the observed or counterfactual simulation. For convenience, the output from both simulations is merged into one data frame, which is then used for further analysis.

simpop_long <- simpop %>%
  convertToLongFormat() %>%
  mutate(
    agestart = getAgeInDays(Tstart, birthDate) / 365.25,
    agestart = case_when(
      agestart < minage ~ minage,
      agestart > maxage ~ maxage,
      TRUE ~ agestart
    ),
    agestop = getAgeInDays(Tstop, birthDate) / 365.25,
    agestop = case_when(
      agestop < minage ~ minage,
      agestop > maxage ~ maxage,
      TRUE ~ agestop
    ),
    sim = "Observed"
  )

cf_simpop_long <- cf_simpop %>%
  convertToLongFormat() %>%
  mutate(
    agestart = getAgeInDays(Tstart, birthDate) / 365.25,
    agestart = case_when(
      agestart < minage ~ minage,
      agestart > maxage ~ maxage,
      TRUE ~ agestart
    ),
    agestop = getAgeInDays(Tstop, birthDate) / 365.25,
    agestop = case_when(
      agestop < minage ~ minage,
      agestop > maxage ~ maxage,
      TRUE ~ agestop
    ),
    sim = "Counterfactual"
  )

mydata <- simpop_long %>%
  bind_rows(cf_simpop_long) %>%
  mutate(
    sim = factor(sim, levels = c("Observed", "Counterfactual")),
    table_order = case_when(
      sim == "Observed" & education == "High" ~ 0,
      sim == "Observed" & education == "Low" ~ 1,
      sim == "Counterfactual" & education == "Low" ~ 2
    )
  )

Analyzing output

In this study, we aimed to simulate the development of educational inequalities in MDD across the life course, and to estimate the potential impact of intervening on modifiable factors. In this example, we focus on the educational inequalities in both the age-specific prevalence of MDD and the life course prevalence of MDD. It should be noted that we are not limited to only assessing the prevalence of MDD. For example, in the publication by Lepe et al. (2024) we also look at the age of onset of MDD and the duration of MDD episodes.

Age-specific prevalence

We calculate the age-specific prevalence by dividing the number of people with MDD in each one-year age band by the total size of that subgroup (N = 250). These values are then plotted for easier interpretation.

We clearly show that individuals with low education are more likely to experience MDD than individuals with high education, which agrees with the literature. We would also expect the female curves to lie above the male curves throughout, but here the lines cross within each education level. This crossover is simply Monte-Carlo noise due to the number of synthethic individuals in the simulation (N = 1000). In the published study we avoided this problem by simulating 500000 individuals, but that size is too computationally heavy for this example.

prev <- list()
for (i in minage:maxage){
  prev[[i]] <- simpop_long %>%
    filter((health == "MDD" & agestart <= i & agestop >= i)) %>%
    group_by(sex, education) %>%
    summarise(prev = n()/(N*.25) * 100, .groups = "keep") %>%
    mutate(age = i) %>%
    ungroup()
}

prev <- bind_rows(prev)

prev %>%
  ggplot() +
  geom_line(
    aes(age, prev, color = sex, linetype = education), 
    linewidth = 1
  ) +
  scale_y_continuous(breaks = c(2, 4, 6)) +
  labs(
    y = "Prevalence of MDD %",
    x = "Age (years)",
    colour = "Sex",
    linetype = "Education"
  ) +
  theme_bw() +
  theme(
    legend.position = "right",
    legend.box.margin = margin(0, 0, 0, 0),
    plot.title.position = "plot",
    plot.caption.position = "plot",
    plot.margin = margin(0, 0, 0, 0)
  )

Life course prevalence

Lastly, we calculate life-course prevalence, which is the proportion of people who experience MDD at any point in their life, using both the observed rates and the counterfactual rates. The code here filters by Episode <= 2 to avoid duplicate individuals. This works as people can only transition between the two states, so anyone with depression will have had it in either their first or second episode.

Consistent with earlier findings, those with low education are more likely than those with high education to experience MDD at least once. In the counterfactual scenario, life-course prevalence falls for the low-education group, but it still remains well above the level observed for the high-education group.

lc_prev <- mydata %>%
  filter(Episode <= 2, health == "MDD", !is.na(table_order)) %>%
  group_by(sim, education, sex) %>%
  summarise(proportion_MDD = round(n() / (N * .25) * 100, 1), .groups = "drop") %>%
  pivot_wider(
    names_from = c(sex), names_glue = "{sex} (n%)",
    values_from = proportion_MDD
  ) %>%
  rename(Scenario = sim, Education = education)


kable(lc_prev)
Scenario Education Female (n%) Male (n%)
Observed High 14.0 9.6
Observed Low 38.0 28.0
Counterfactual Low 26.8 28.0
Lepe, Alexander, Liza A Hoveling, Michaël Boissonneault, Joop A A de Beer, Sijmen A Reijneveld, Marlou L A de Kroon, and Aart C Liefbroer. 2024. “Educational Inequalities in Major Depressive Disorder Prevalence, Timing and Duration Among Adults over the Life Course: A Microsimulation Analysis Based on the Lifelines Cohort Study.” European Journal of Public Health 34 (4): 723–29. https://doi.org/10.1093/eurpub/ckae066.
Zinn, Sabine. 2011. “A Continuous-Time Microsimulation and First Steps Towards a Multi-Level Approach in Demography.” PhD thesis, Universität Rostock. https://doi.org/10.18453/rosdok_id00000951.