---
title: "Introduction to misl"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to misl}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse  = TRUE,
  comment   = "#>",
  eval      = TRUE,
  warning   = FALSE,
  message   = FALSE
)
```

## Overview

`misl` implements **Multiple Imputation by Super Learning**, a flexible approach
to handling missing data described in:

> Carpenito T, Manjourides J. (2022) MISL: Multiple imputation by super
> learning. *Statistical Methods in Medical Research*. 31(10):1904--1915.
> doi: [10.1177/09622802221104238](https://doi.org/10.1177/09622802221104238)

Rather than relying on a single parametric imputation model, `misl` builds a
stacked ensemble of machine learning algorithms for each incomplete column,
producing well-calibrated imputations for continuous, binary, and categorical
variables.

## Installation

```{r install, eval = FALSE}
# Install from GitHub
remotes::install_github("JustinManjourides/misl")

# Optional backend packages for additional learners
install.packages(c("ranger", "xgboost", "earth"))
```

## Simulated data

We simulate a small dataset with three types of incomplete variables to
demonstrate `misl` across all supported outcome types.

```{r simulate}
library(misl)

set.seed(42)
n <- 300

sim_data <- data.frame(
  # Continuous predictors (always observed)
  age    = round(rnorm(n, mean = 50, sd = 12)),
  bmi    = round(rnorm(n, mean = 26, sd = 4), 1),
  # Continuous outcome with missingness
  sbp    = round(120 + 0.4 * rnorm(n, mean = 50, sd = 12) +
                       0.3 * rnorm(n, mean = 26, sd = 4) + rnorm(n, sd = 10)),
  # Binary outcome with missingness (0 = no, 1 = yes)
  smoker = rbinom(n, 1, prob = 0.3),
  # Categorical outcome with missingness
  group  = factor(sample(c("A", "B", "C"), n, replace = TRUE,
                         prob = c(0.4, 0.35, 0.25)))
)

# Introduce missing values
sim_data[sample(n, 40), "sbp"]    <- NA
sim_data[sample(n, 30), "smoker"] <- NA
sim_data[sample(n, 30), "group"]  <- NA

# Summarise missingness
sapply(sim_data, function(x) sum(is.na(x)))
```

## Basic usage

The primary function is `misl()`. Supply a dataset and specify:

- `m` -- the number of multiply imputed datasets to create
- `maxit` -- the number of Gibbs sampling iterations per dataset
- `con_method`, `bin_method`, `cat_method` -- learners for each outcome type

Use `list_learners()` to see available options:

```{r list-learners}
knitr::kable(list_learners())
```

Running `misl()` with a single learner per outcome type is the fastest option
and is well suited for exploratory work:

```{r basic-usage}
misl_imp <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = "glm",
  bin_method = "glm",
  cat_method = "multinom_reg",
  seed       = 42,
  quiet      = TRUE
)
```

`misl()` returns a list of length `m`. Each element contains:

- `$datasets` -- the fully imputed tibble
- `$trace`    -- a long-format tibble of convergence statistics

```{r inspect-output}
# Number of imputed datasets
length(misl_imp)

# Confirm no missing values remain
anyNA(misl_imp[[1]]$datasets)

# Compare observed vs imputed distribution for sbp
obs_mean <- mean(sim_data$sbp, na.rm = TRUE)
imp_mean <- mean(misl_imp[[1]]$datasets$sbp)

cat("Observed mean sbp (non-missing):", round(obs_mean, 2), "\n")
cat("Imputed mean sbp  (all values): ", round(imp_mean, 2), "\n")

# Check binary imputations are valid
table(misl_imp[[1]]$datasets$smoker)

# Check categorical imputations stay within observed levels
levels(misl_imp[[1]]$datasets$group)
```

### Multiple learners and stacking

When multiple learners are supplied, `misl` uses cross-validation to learn
optimal ensemble weights via the `stacks` package. Use `cv_folds` to reduce
the number of folds and speed up computation:

```{r multi-learner, eval = FALSE}
misl_stack <- misl(
  sim_data,
  m          = 5,
  maxit      = 5,
  con_method = c("glm", "rand_forest"),
  bin_method = c("glm", "rand_forest"),
  cat_method = c("multinom_reg", "rand_forest"),
  cv_folds   = 3,
  seed       = 42
)
```

### Analysing the imputed datasets

After imputation, fit your analysis model to each of the `m` datasets and pool
the results using Rubin's rules. Here we implement pooling manually using base R:

```{r pool-results}
# Fit a linear model to each imputed dataset
models <- lapply(misl_imp, function(imp) {
  lm(sbp ~ age + bmi + smoker + group, data = imp$datasets)
})

# Pool point estimates and standard errors using Rubin's rules
m       <- length(models)
ests    <- sapply(models, function(fit) coef(fit))
vars    <- sapply(models, function(fit) diag(vcov(fit)))

Q_bar   <- rowMeans(ests)                          # pooled estimate
U_bar   <- rowMeans(vars)                          # within-imputation variance
B       <- apply(ests, 1, var)                     # between-imputation variance
T_total <- U_bar + (1 + 1 / m) * B                # total variance

pooled <- data.frame(
  term     = names(Q_bar),
  estimate = round(Q_bar, 4),
  std.error = round(sqrt(T_total), 4),
  conf.low  = round(Q_bar - 1.96 * sqrt(T_total), 4),
  conf.high = round(Q_bar + 1.96 * sqrt(T_total), 4)
)
print(pooled)
```

For a full implementation of Rubin's rules including degrees of freedom and
p-values, the `mice` package provides `pool()` and can be used directly with
`misl` output:

```{r pool-mice, eval = FALSE}
library(mice)
pooled_mice <- summary(pool(models), conf.int = TRUE)
```

### Convergence: trace plots

The `$trace` element records the mean and standard deviation of imputed values
at each iteration, which can be used to assess convergence:

```{r trace-plot}
# Extract trace data across all m datasets
trace <- do.call(rbind, lapply(misl_imp, function(r) r$trace))

# Plot mean of imputed sbp values across iterations for each dataset
sbp_trace <- subset(trace, variable == "sbp" & statistic == "mean")

plot(
  sbp_trace$iteration,
  sbp_trace$value,
  col  = sbp_trace$m,
  pch  = 16,
  xlab = "Iteration",
  ylab = "Mean imputed sbp",
  main = "Trace plot: mean of imputed sbp values",
  xaxt = "n"
)
axis(1, at = 1:5)
legend("topright", legend = paste("m =", 1:5),
       col = 1:5, pch = 16, cex = 0.8)
```

Stable traces that mix well across datasets indicate the algorithm has
converged.

## Parallelism

The `m` imputed datasets are generated independently and can be run in parallel
using the `future` framework. Set a parallel plan before calling `misl()`:

```{r parallel, eval = FALSE}
library(future)

# Use all available cores
plan(multisession)

misl_parallel <- misl(
  sim_data,
  m          = 10,
  maxit      = 5,
  con_method = c("glm", "rand_forest"),
  bin_method = c("glm", "rand_forest"),
  cat_method = c("multinom_reg", "rand_forest"),
  seed       = 42
)

# Always reset the plan when done
plan(sequential)
```

To limit the number of cores:

```{r parallel-limited, eval = FALSE}
plan(multisession, workers = 4)
```

The largest speedup comes from running the `m` datasets simultaneously. On a
machine with 10 cores running `m = 10`, all 10 datasets can be imputed in
parallel. Check how many cores are available with:

```{r detect-cores, eval = FALSE}
parallel::detectCores()
```

## Session info

```{r session-info}
sessionInfo()
```