--- title: "Fitting Hazard Models" subtitle: "From intercept-only to multiphase additive hazard" format: html: code-fold: false toc: true toc-depth: 3 number-sections: true vignette: > %\VignetteIndexEntry{Fitting Hazard Models} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} --- ```{r} #| include: false has_pkg <- requireNamespace("TemporalHazard", quietly = TRUE) && requireNamespace("ggplot2", quietly = TRUE) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = has_pkg ) ``` ```{r} #| label: setup library(TemporalHazard) library(survival) library(ggplot2) ``` This vignette walks the core model-fitting workflow from the inside out: intercept-only fits to establish the baseline hazard shape, then covariates on top, then the multiphase decomposition, then multi-endpoint analyses on the same cohort. Every example uses a clinical dataset shipped with the package. If you haven't seen the basics — what a parametric hazard model is, why we use the `Surv(time, status)` formula — start with `vignette("getting-started")` first; this vignette assumes that context. The progression matters. Single-distribution intercept-only fits tell you whether the baseline hazard shape is monotone (Weibull territory) or has structure that demands a multiphase decomposition. Multivariable fits add covariate effects on top of a shape you already trust. Multiphase fits split the baseline shape into clinically interpretable phases. Multi-endpoint analyses reuse all the above for separate clinical outcomes — death, reoperation, infection — on the same patient cohort. ## Intercept-only model: CABG survival (KU Leuven) The `cabgkul` dataset contains 5,880 patients who underwent primary isolated coronary artery bypass grafting at KU Leuven between 1971 and 1987. With only two columns --- follow-up time and death indicator --- it is the simplest starting point. ```{r} #| label: kul-data data(cabgkul) str(cabgkul) ``` Fit an intercept-only Weibull. With no covariates on the right-hand side of the formula the model estimates only the baseline hazard shape — the scale `mu` and exponent `nu` of a Weibull curve fit to all 5,880 patients pooled. This is the right starting point for any new dataset: before asking which covariates matter, ask whether a single monotone hazard even fits the population-level pattern. ```{r} #| label: kul-weibull fit_kul <- hazard( Surv(int_dead, dead) ~ 1, data = cabgkul, dist = "weibull", theta = c(mu = 0.10, nu = 1.0), fit = TRUE ) fit_kul ``` The summary tells us where the optimizer landed; the picture tells us whether that landing point matches the data. Plot the fitted survival curve on a fine time grid and overlay the Kaplan-Meier step function from the raw cohort. ```{r} #| label: fig-kul-km #| fig-cap: "Weibull parametric survival vs. Kaplan-Meier (CABG, KU Leuven)" #| fig-width: 7 #| fig-height: 4.5 t_grid <- seq(0.01, max(cabgkul$int_dead) * 0.9, length.out = 200) nd <- data.frame(time = t_grid) surv <- predict(fit_kul, newdata = nd, type = "survival") * 100 km <- survfit(Surv(int_dead, dead) ~ 1, data = cabgkul) km_df <- data.frame(time = km$time, survival = km$surv * 100) ggplot() + geom_step(data = km_df, aes(time, survival, colour = "Kaplan-Meier"), linewidth = 0.5) + geom_line(data = data.frame(time = t_grid, survival = surv), aes(time, survival, colour = "Weibull"), linewidth = 1) + scale_colour_manual(values = c("Weibull" = "#0072B2", "Kaplan-Meier" = "#D55E00")) + scale_y_continuous(limits = c(0, 100)) + labs(x = "Months after CABG", y = "Freedom from death (%)", colour = NULL) + theme_minimal() + theme(legend.position = "bottom") ``` A single Weibull captures the broad trend but misses the distinct early operative risk and late attrition that the KM curve reveals. This motivates the multiphase approach below. ## Multivariable model: AVC repair The `avc` dataset has 310 patients who underwent atrioventricular canal repair, with 9 candidate covariates spanning patient demographics (age, NYHA status), anatomical features (malalignment, orifice morphology), intra-operative grading (`inc_surg` = surgical grade of AV valve incompetence), and post-operative complications (`com_iv` = grade IV complications). We drop incomplete rows so the design matrix is rectangular, then look at the column types and ranges. ```{r} #| label: avc-data data(avc) avc <- na.omit(avc) str(avc) ``` Now we put covariates on the right-hand side of the formula and refit. The `theta` vector grows: two Weibull shape parameters (`mu`, `nu`) plus six covariate coefficients (`beta1`..`beta6`), each starting at zero. The optimizer estimates a log-hazard-ratio for every covariate jointly with the Weibull shape — so the shape and the covariate effects are identified from the same likelihood, not sequentially. ```{r} #| label: avc-mv fit_avc <- hazard( Surv(int_dead, dead) ~ age + status + mal + com_iv + inc_surg + orifice, data = avc, dist = "weibull", theta = c(mu = 0.20, nu = 1.0, rep(0, 6)), fit = TRUE, control = list(maxit = 500) ) fit_avc ``` Each coefficient is a log-hazard-ratio: positive means higher risk, negative means lower, zero means no effect. The large positive coefficients on `mal` (anatomical malalignment) and `com_iv` (grade IV post-operative complications) flag these as the dominant risk markers in this cohort. The standard errors and Wald z-statistics in the summary tell you which effects are well identified and which are noise — a coefficient with a z-statistic near zero contributes essentially nothing the data can defend. ## Multiphase model: additive hazard decomposition The single-Weibull fits above gave us a curve that's mediocre everywhere instead of right anywhere. That's a structural limitation of a monotone parametric shape, not something more iterations will fix. The Blackstone–Naftel–Turner framework's key idea is to split the hazard into a *sum* of phase-specific contributions, each with its own temporal shape and its own scale: $$H(t \mid x) = \sum_{j=1}^{J} \mu_j(x) \cdot \Phi_j(t)$$ Each $\Phi_j(t)$ is a phase-specific unit-scaled curve (early-peaking saturating, flat constant, late-rising polynomial) and each $\mu_j(x)$ is the phase-specific scale, possibly modulated by covariates. The phases overlap and add — no switching, no thresholds — so the total instantaneous hazard at any $t$ is the sum of the per-phase rates. See `vignette("getting-started")` for the longer-form motivation; what follows here is the practical workflow for *fitting* one. For AVC we'll use two phases — an early phase to absorb the operative-window mortality, and a constant phase for the background rate. AVC patients don't have a clear late-deterioration regime over this follow-up window, so a third (g3) phase would be unidentified. We fix the shape parameters and estimate only the scales, matching the workflow you'd run against a SAS HAZARD reference fit. ```{r} #| label: avc-mp fit_mp <- hazard( Surv(int_dead, dead) ~ 1, data = avc, dist = "multiphase", phases = list( early = hzr_phase("cdf", t_half = 0.5, nu = 1, m = 1, fixed = "shapes"), constant = hzr_phase("constant") ), fit = TRUE, control = list(n_starts = 5, maxit = 1000) ) summary(fit_mp) ``` The diagnostic that matters is whether the multiphase fit actually out-performs the single Weibull against the data. Plot both parametric curves against the same Kaplan-Meier reference so we can see, by eye, where each model is honest and where each is reaching. ```{r} #| label: fig-avc-compare #| fig-cap: "Single-phase Weibull vs. multiphase model against Kaplan-Meier (AVC)" #| fig-width: 7 #| fig-height: 4.5 t_grid <- seq(0.01, max(avc$int_dead) * 0.95, length.out = 200) nd <- data.frame(time = t_grid) km_avc <- survfit(Surv(int_dead, dead) ~ 1, data = avc) km_df <- data.frame(time = km_avc$time, survival = km_avc$surv * 100) fit_wb <- hazard( Surv(int_dead, dead) ~ 1, data = avc, dist = "weibull", theta = c(mu = 0.20, nu = 1.0), fit = TRUE ) surv_wb <- predict(fit_wb, newdata = nd, type = "survival") * 100 surv_mp <- predict(fit_mp, newdata = nd, type = "survival") * 100 plot_df <- rbind( data.frame(time = t_grid, survival = surv_wb, Model = "Single Weibull"), data.frame(time = t_grid, survival = surv_mp, Model = "Multiphase (2-phase)") ) ggplot() + geom_step(data = km_df, aes(time, survival), colour = "grey50", linewidth = 0.5) + geom_line(data = plot_df, aes(time, survival, colour = Model), linewidth = 1) + scale_colour_manual(values = c("Single Weibull" = "#E69F00", "Multiphase (2-phase)" = "#0072B2")) + scale_y_continuous(limits = c(0, 100)) + annotate("text", x = max(t_grid) * 0.6, y = 95, label = "KM (grey)", size = 3, colour = "grey50") + labs(x = "Months after AVC repair", y = "Freedom from death (%)", colour = NULL) + theme_minimal() + theme(legend.position = "bottom") ``` The multiphase model tracks the KM curve much more closely than the single Weibull, especially across the steep early-mortality window — which is exactly where the single Weibull was forced to compromise. The constant phase then carries the slow post-recovery attrition. The point isn't that multiphase always wins; it's that *when the data has phase structure*, fitting that structure explicitly is strictly more honest than averaging it away into one monotone curve. ## Multi-endpoint models: heart valve replacement The `valves` dataset (1,533 patients) has multiple time-to-event endpoints --- death, prosthetic valve endocarditis (PVE), and reoperation --- each with its own follow-up time and event indicator. The same `hazard()` call fits each endpoint independently: Start with the death endpoint. We use age at operation, NYHA class, and mechanical-valve indicator as covariates — the clinically canonical set for survival after valve replacement. ```{r} #| label: valves-death data(valves) valves <- na.omit(valves) fit_death <- hazard( Surv(int_dead, dead) ~ age_cop + nyha + mechvalv, data = valves, dist = "weibull", theta = c(mu = 0.10, nu = 1.0, rep(0, 3)), fit = TRUE, control = list(maxit = 500) ) fit_death ``` Switch endpoints. Same data, same package, but now we model time to prosthetic valve endocarditis instead of death. The covariate list shifts to match the clinical question: `nve` (native-valve endocarditis history) replaces `nyha` because functional class is less informative for infection risk than prior endocarditis exposure. The fit returns its own MLE, coefficients, and standard errors completely independent of the death model. ```{r} #| label: valves-pve fit_pve <- hazard( Surv(int_pve, pve) ~ age_cop + nve + mechvalv, data = valves, dist = "weibull", theta = c(mu = 0.02, nu = 1.0, rep(0, 3)), fit = TRUE, control = list(maxit = 500) ) fit_pve ``` Each endpoint gets its own model with its own covariates, but the hazard model structure — temporal shape plus covariate effects — stays the same whatever the clinical endpoint. Repeating the workflow for a third endpoint (reoperation, for example) is mechanical: swap the `Surv(...)` columns, swap the covariates, refit. The advantage over running three separate analyses in different tools is that the predictions, diagnostics, and uncertainty quantification all come from the same package — there's no risk of subtle differences in censoring handling or estimator choice between endpoints. ## Interval and left censoring Right censoring is by far the most common censoring type in clinical survival data — a patient is still alive at last follow-up, so all we know is that their event time exceeds the observed window. Two other types arise frequently enough to warrant explicit handling. **Interval censoring** occurs when the event is known to have happened *between* two observation times, but not at a specific time. The canonical example is clinic-visit data: a patient is confirmed alive at month 12 and found dead at month 24, so the event occurred somewhere in $(12, 24)$. A naive analysis treats the event as right-censored at 12 or exactly observed at 24; both introduce bias. **Left censoring** is the mirror: the event occurred *before* the first observation time (time_upper), so it was already established at the moment of first contact — $T \leq t_{\text{upper}}$. ### Status codes The package encodes censoring type in the `status` vector: | Status | Type | Interpretation | |-------:|------|----------------| | `1` | Exact event | Observed at `time` | | `0` | Right-censored | Event did not occur by `time` | | `2` | Interval-censored | Event occurred in `(time_lower, time_upper)` | | `-1` | Left-censored | Event occurred before `time` (upper bound) | For interval-censored rows, supply both `time_lower` and `time_upper` directly to `hazard()`. The formula interface handles `Surv(time, status)` for right-censored data and `Surv(start, stop, event)` for counting-process (repeating-event) data; interval censoring is passed via the direct arguments. ### Mixed censoring example We simulate 400 cardiac surgery patients observed at semi-annual clinic visits (every 6 months, 4--8 scheduled visits per patient). If the event falls between two consecutive visits the observation is interval-censored; if it falls after the last visit the patient is right-censored. ```{r} #| label: interval-simulate set.seed(101) n <- 400 visit_gap <- 6 # months between visits # True event times: Weibull mu = 0.025, nu = 1.0 (median ~28 months) true_time <- rexp(n, rate = 0.025) # nu=1 → exponential n_visits <- sample(4:8, n, replace = TRUE) status <- integer(n) time <- numeric(n) time_lower <- numeric(n) time_upper <- numeric(n) for (i in seq_len(n)) { visits <- seq(visit_gap, n_visits[i] * visit_gap, by = visit_gap) t_ev <- true_time[i] brk <- findInterval(t_ev, c(0, visits)) # which interval contains t_ev if (brk > length(visits)) { # Event after last visit: right-censored. # time_lower = 0: with H(time_lower) = H(0) = 0, the right-censored # contribution reduces to -(H(time) - H(0)) = -H(time), which is correct. status[i] <- 0L time[i] <- max(visits) time_lower[i] <- 0 time_upper[i] <- max(visits) } else { # Event within the visit window: interval-censored in (lower, upper) lower_bound <- if (brk == 1L) 0 else visits[brk - 1L] upper_bound <- visits[brk] status[i] <- 2L time[i] <- lower_bound time_lower[i] <- lower_bound time_upper[i] <- upper_bound } } table(status) ``` ::: {.callout-note} **`time_lower` as a counting-process entry time.** In the Weibull and multiphase likelihoods, `time_lower` doubles as the *entry* (left-truncation) time for right-censored and exact-event rows (`status %in% c(0, 1)`): when `0 < time_lower < time`, the row contributes H(stop) − H(start), the counting-process form used for epoch-decomposed repeated events. For an ordinary right-censored or event row with no entry time, leave `time_lower` at `0` or omit it. Setting `time_lower = time` is treated as *no entry* — `time_lower` acts as an entry time only when strictly less than `time`, so it no longer zeroes the row's contribution. The exponential, log-logistic, and log-normal likelihoods do not use `time_lower` as an entry time and are unaffected. ::: Fit a Weibull model using both censoring types: ```{r} #| label: interval-fit fit_ic <- hazard( time = time, status = status, time_lower = time_lower, time_upper = time_upper, dist = "weibull", theta = c(mu = 0.025, nu = 1.0), fit = TRUE ) fit_ic ``` ### Interval-censored vs naive right-censored A naive analyst who doesn't have visit-bracket data would record the death at the *discovery* visit (`time_upper`) rather than as an interval. This overstates the event time — the patient appears to have survived longer than they did — biasing the estimated hazard shape. ```{r} #| label: interval-vs-naive # Naive: treat each interval-censored row as an exact event at time_upper # (the visit when the death was first recorded) fit_naive <- hazard( time = ifelse(status == 2L, time_upper, time), status = ifelse(status == 2L, 1L, status), dist = "weibull", theta = c(mu = 0.025, nu = 1.0), fit = TRUE ) # Compare MLEs; true parameters are mu = 0.025, nu = 1.0 rbind( interval_censored = round(coef(fit_ic)[1:2], 4), naive_exact_upper = round(coef(fit_naive)[1:2], 4), truth = c(mu = 0.025, nu = 1.0) ) ``` The interval-censored fit recovers both parameters accurately. The naive fit estimates $\mu$ comparably but introduces a substantial positive bias in $\nu$ — it sees events consistently appearing at visit times (every 6 months) and infers a sharper, more periodic hazard shape that doesn't match the underlying exponential structure. ## Convergence troubleshooting Multiphase models have more parameters than single-distribution fits, and the likelihood surface can be flat or multimodal when the data doesn't strongly identify all of them. Most convergence problems trace back to one of three causes: poor starting values, too many free parameters for the data at hand, or a model that asks for a phase the data doesn't contain. This section shows how to diagnose each and what to do about it. ### Reading the KM cumulative hazard for starting values For a Weibull model, the relationship between the starting values and the data is direct: because $H(t) = (\mu t)^\nu$, taking logs gives $\log H(t) = \nu \log t + \nu \log \mu$. Plot $\log H(t)$ (the Nelson-Aalen cumulative hazard on the log scale) against $\log t$ and fit a straight line; the slope is $\hat\nu$ and the intercept is $\hat\nu \log \hat\mu$. ```{r} #| label: starting-values nel <- hzr_nelson(cabgkul$int_dead, cabgkul$dead) # Drop the zero-time boundary point before log-transforming nel_clean <- nel[nel$time > 0 & nel$cumhaz > 0, ] log_t <- log(nel_clean$time) log_H <- log(nel_clean$cumhaz) lm_fit <- lm(log_H ~ log_t) nu_hat <- unname(coef(lm_fit)[2]) mu_hat <- exp(unname(coef(lm_fit)[1]) / nu_hat) c(mu = round(mu_hat, 4), nu = round(nu_hat, 4)) ``` Those values are the Weibull starting point the data itself suggests. They won't be the MLEs --- the log-log line uses every event time equally, whereas the MLE weights by the likelihood --- but they land the optimizer in a sensible neighbourhood and prevent false-convergence to a degenerate solution. For multiphase models the reading is qualitative. Plot the KM cumulative hazard on the natural scale and look for kinks: a steep-then-flattening shape in the first few months signals an early phase whose `t_half` should sit in that early window; a steady linear rise afterwards signals a constant phase; a late upward curve beyond the flat region signals a late phase. Set `t_half` to roughly the time at which the early kink levels off, and set the constant phase scale mu to approximate the slope of the linear mid-section. ```{r} #| label: fig-cumhaz-shape #| fig-cap: "Nelson-Aalen cumulative hazard for CABGKUL — three-phase shape visible as two kinks" #| fig-width: 7 #| fig-height: 4 ggplot(data.frame(time = nel$time, cumhaz = nel$cumhaz), aes(time, cumhaz)) + geom_step(colour = "#D55E00", linewidth = 0.7) + labs(x = "Months after CABG", y = "Cumulative hazard") + theme_minimal() ``` The early steep rise, the mid-range roughly-linear section, and the late upward acceleration map directly onto the three phases in the CABGKUL model. ### When to fix shape parameters `fixed = "shapes"` locks the temporal shape parameters (e.g., `t_half`, `nu`, `m` for a CDF phase; `tau`, `gamma`, `alpha`, `eta` for a g3 phase) and lets the optimizer estimate only the scale (`log_mu`) for each phase. This cuts the parameter count substantially and is almost always what you want in two situations: - **You have a reference fit.** If a SAS HAZARD run already produced shape estimates, replicate those shapes exactly and re-estimate only the scale. This is the standard parity workflow. - **Your sample is small relative to the number of phases.** A rough guide is 50 or more events per free shape parameter. Below that threshold the shapes are weakly identified and the optimizer wanders. Estimate shapes freely (`fixed = "none"`, the default) only when you are exploring a new dataset for the first time and have no prior on the temporal structure — and even then, start with `fixed = "shapes"` and release the shapes one at a time if the fixed-shapes fit shows systematic misfit. ### Signs of overparameterization When a model has more phases than the data supports, the symptoms are recognizable: | Symptom | What it means | |---------|--------------| | `fit$fit$converged == FALSE` | Optimizer hit `maxit` without satisfying the gradient tolerance | | `vcov()` returns `NA` | Hessian is singular — parameters are not jointly identified | | A phase scale (`log_mu`) at a boundary value | That phase is contributing essentially zero hazard; it isn't needed | | Enormous standard errors on one or more parameters | Flat likelihood in that direction — weak identification | | Two phases with nearly identical shapes | One can be collapsed into the other | The AVC dataset has a clear two-phase structure (early CDF plus constant). Adding a third late-rising phase asks the data for a pattern it doesn't contain over this follow-up window: ```{r} #| label: overparameterized set.seed(42) fit_3ph <- hazard( Surv(int_dead, dead) ~ 1, data = avc, dist = "multiphase", phases = list( early = hzr_phase("cdf", t_half = 0.5, nu = 1, m = 1, fixed = "shapes"), constant = hzr_phase("constant"), late = hzr_phase("g3", tau = 5, gamma = 3, alpha = 1, eta = 1, fixed = "shapes") ), fit = TRUE, control = list(n_starts = 5, maxit = 1000) ) # Scale magnitudes: exp(log_mu) ≈ 0 for any phase the data doesn't support log_mu_idx <- grep("log_mu", names(coef(fit_3ph))) round(exp(coef(fit_3ph)[log_mu_idx]), 6) ``` The constant and late phase scales are both near zero — the optimizer found no evidence in the AVC data for either of those temporal shapes over this follow-up window. The early phase is absorbing all the identifiable structure. Any phase whose fitted scale is $\mu \lesssim 10^{-4}$ is not contributing meaningful hazard and is a candidate for removal. The right diagnostic question is not "did the AIC improve?" but "does this phase represent real clinical biology?" — late deterioration after AVC repair requires long follow-up to observe; this dataset doesn't have it. When shapes are also free (`fixed = "none"`), a redundant phase can make the Hessian rank-deficient and `vcov()` will return `NA` (the Hessian inversion failed). Fix by adding `fixed = "shapes"` to each phase to reduce the free-parameter count, then release shapes one at a time if the fixed-shapes fit shows systematic misfit. ### Optimizer options Three `control` arguments address most remaining convergence problems: **`n_starts`** (default 5 for multiphase; single-distribution models run one optimizer call and ignore this parameter) runs the optimizer from `n_starts` randomly jittered copies of the initial `theta`, then keeps the best solution. The default of 5 is sufficient for two-phase models; increase to 8–10 for three or more phases where the likelihood surface has more local minima. ```{r} #| label: nstarts set.seed(42) fit_robust <- hazard( Surv(int_dead, dead) ~ 1, data = avc, dist = "multiphase", phases = list( early = hzr_phase("cdf", t_half = 0.5, nu = 1, m = 1, fixed = "shapes"), constant = hzr_phase("constant") ), fit = TRUE, control = list(n_starts = 10, maxit = 1000) ) fit_robust$fit$converged ``` **Optimizer strategy (automatic).** The package always uses BFGS as the primary optimizer. For fixed-shape multiphase models with 2–10 free parameters, a Nelder-Mead pass runs first on each start to find a good basin, then BFGS polishes the solution. This warm-up is transparent — it happens automatically when the conditions are met and there is no user-facing control to switch it on or off. **`maxit`** (default 1000) sets the BFGS iteration cap. Hitting `maxit` without converging usually means either the starting values are far from the optimum (fix with better starts or `n_starts`) or the model is overparameterized (fix by fixing shapes or dropping a phase). Raising `maxit` beyond 2000 rarely helps if the optimizer is genuinely stuck. ## Phase types reference You've now seen each phase type in use: a `"cdf"` early phase for AVC operative mortality, a `"constant"` phase for AVC background rate, and the implicit single shape of every Weibull fit. The package supports three phase types in total, summarized here for quick reference: | Type | Description | Typical use | |------|-------------|-------------| | `"cdf"` | Sigmoidal CDF shape (parameterized by `t_half`, `nu`, `m`) | Early or late phases with transient risk | | `"constant"` | Flat hazard (no temporal shape parameters) | Ongoing background risk | | `"g3"` | Late-phase G3 parameterization (4 parameters: `tau`, `gamma`, `alpha`, `eta`) | Late-rising risk matching C/SAS G3 output | The `"cdf"` type covers the widest range of shapes: setting `t_half` small (e.g., 0.5) creates an early-peaking phase; setting it large (e.g., 10) creates a late-rising phase. The `"constant"` phase needs no shape parameters. The `"g3"` shape is the explicit late-rising parameterization that matches the SAS HAZARD "late" library; use it when you need parity against a C/SAS reference fit, or when the late rise has a clear lag-then-accelerate pattern that a delayed `"cdf"` doesn't capture cleanly. See `vignette("mf-mathematical-foundations")` for the full mathematical treatment of each, including the parameter identifiability constraints.