--- title: "Complete Workflow: From Data to Decision" author: "Deniz Akdemir" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Complete Workflow: From Data to Decision} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} eval_surv <- getRversion() >= "4.0.0" && requireNamespace("survival", quietly = TRUE) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5 ) if (!exists("deparse1", envir = baseenv())) { deparse1 <- function(expr, collapse = " ", width.cutoff = 500L, ...) { paste(deparse(expr, width.cutoff, ...), collapse = collapse) } } ``` ## Overview This vignette demonstrates the complete **causaldef** workflow, from data specification through to policy decision-making. We show how Le Cam deficiency theory translates abstract statistical concepts into actionable clinical insights. The workflow consists of four stages: 1. **Specify** → Define the causal problem 2. **Estimate** → Compute deficiency for different adjustment strategies 3. **Diagnose** → Validate assumptions using negative controls and sensitivity analysis 4. **Decide** → Compute policy regret bounds and make informed decisions --- ## Part 1: Gene Perturbation Study (Continuous Outcome) We begin with the `gene_perturbation` dataset, which simulates a CRISPR knockout experiment. This illustrates the core workflow for continuous outcomes. ### 1.1 Data Description ```{r load-data-gene} library(causaldef) library(ggplot2) data(gene_perturbation) str(gene_perturbation) ``` **Variables:** - `knockout_status`: Treatment (Control vs. Knockout) - `target_expression`: Primary outcome (gene expression level) - `housekeeping_gene`: Negative control outcome (shouldn't be affected by knockout) - `batch`, `library_size`: Technical confounders **Causal Structure:** ``` [Batch, Library Size] | v [Knockout] -----> [Target Expression] \ \---X--> [Housekeeping Gene] (no causal effect) ``` The housekeeping gene is affected by the same technical variations but NOT by the knockout, making it an ideal negative control. ### 1.2 Step 1: Specification ```{r spec-gene} spec_gene <- causal_spec( data = gene_perturbation, treatment = "knockout_status", outcome = "target_expression", covariates = c("batch", "library_size"), negative_control = "housekeeping_gene", estimand = "ATE", outcome_type = "continuous" ) print(spec_gene) ``` ### 1.3 Step 2: Deficiency Estimation We compare three adjustment strategies: 1. **Unadjusted**: Ignores technical confounders 2. **IPTW**: Reweights samples to balance batch and library size 3. **AIPW**: Augmented IPTW (doubly robust) ```{r estim-gene} deficiency_gene <- estimate_deficiency( spec_gene, methods = c("unadjusted", "iptw", "aipw"), n_boot = 100 # Use more for production (e.g., 1000) ) print(deficiency_gene) ``` **Interpretation:** - **δ ≈ 0**: The adjustment creates a "synthetic RCT" (observational data approximates what we'd see if knockout were randomized) - **δ > 0.1**: Substantial distributional mismatch remains; causal conclusions are uncertain ```{r plot-gene, fig.height=4} plot(deficiency_gene, type = "bar") ``` ### 1.4 Step 3: Diagnose with Negative Control The negative control diagnostic tests whether our adjustment removes ALL confounding, not just the measured confounders. ```{r nc-gene} nc_test <- nc_diagnostic( spec_gene, method = "iptw", alpha = 0.05, n_boot = 100 ) print(nc_test) ``` **Decision Logic:** | Result | Interpretation | Action | |--------|----------------|--------| | `falsified = FALSE` | No evidence of residual confounding | Proceed with analysis | | `falsified = TRUE` | Residual confounding detected | Add covariates or acknowledge limitations | ### 1.5 Step 4: Policy Decision Suppose we're deciding whether to pursue this gene target for drug development. The utility is measured on a scale where: - 0 = no promise (no effect on expression) - 10 = maximum promise (strong effect) ```{r policy-gene} bounds_gene <- policy_regret_bound( deficiency_gene, utility_range = c(0, 10) ) print(bounds_gene) ``` **Regret Bounds:** `policy_regret_bound()` reports: - **Transfer penalty** \(M\cdot\delta\): additive worst-case regret inflation term, and - **Minimax safety floor** \((M/2)\cdot\delta\): irreducible worst-case regret when \(\delta>0\). **Decision Rule:** - If `transfer_penalty` < acceptable risk threshold → **Proceed with confidence** - If `transfer_penalty` > acceptable risk threshold → **Seek more evidence** ### 1.6 Effect Estimation Finally, we estimate the causal effect using the best-performing method: ```{r effect-gene} effect_gene <- estimate_effect( deficiency_gene, target_method = "iptw" ) print(effect_gene) ``` **Complete Report:** ```{r report-gene, results='asis'} cat(sprintf( " ## Gene Perturbation Analysis Report **Treatment Effect (IPTW-adjusted):** %.2f log2 expression units **Deficiency (δ):** %.3f **Negative Control Test:** %s (p = %.3f) **Transfer Penalty:** %.3f on [0, 10] scale **Minimax Safety Floor:** %.3f on [0, 10] scale **Conclusion:** %s ", effect_gene$estimate, deficiency_gene$estimates["iptw"], ifelse(nc_test$falsified, "FAILED", "PASSED"), nc_test$p_value, bounds_gene$transfer_penalty, bounds_gene$minimax_floor, ifelse(nc_test$falsified, "Residual confounding detected. Interpret with caution.", ifelse(bounds_gene$transfer_penalty < 0.5, "Strong evidence supporting treatment effect.", "Moderate evidence; consider additional validation." ) ) )) ``` --- ## Part 2: Hematopoietic Cell Transplantation (Survival Outcome) Next, we analyze the `hct_outcomes` dataset, which mimics a retrospective registry study comparing conditioning regimens in HCT. ### 2.1 Data Description ```{r load-data-hct} data(hct_outcomes) str(hct_outcomes) # Summarize key variables summary(hct_outcomes[, c("age", "kps", "time_to_event")]) table(hct_outcomes$conditioning_intensity, hct_outcomes$event_status) ``` **Clinical Context:** - **Myeloablative conditioning**: High-intensity chemotherapy (younger, healthier patients) - **Reduced-intensity conditioning**: Lower dose (older, sicker patients) - **Outcome**: Time to death or relapse The key challenge is **confounding by indication**: doctors assign treatment based on patient status, making naive comparisons biased. ### 2.2 Step 1: Survival Specification Executable survival-effect code in this section requires a compatible `survival` runtime. In the current support matrix, that means `R >= 4.0`. ```{r spec-hct, eval = eval_surv} # Create binary event indicator hct_outcomes$event <- ifelse(hct_outcomes$event_status != "Censored", 1, 0) spec_hct <- causal_spec_survival( data = hct_outcomes, treatment = "conditioning_intensity", time = "time_to_event", event = "event", covariates = c("age", "disease_status", "kps", "donor_type"), estimand = "RMST", horizon = 24 # 24-month restricted mean survival time ) print(spec_hct) ``` ### 2.3 Step 2: Deficiency Estimation ```{r estim-hct, eval = eval_surv} deficiency_hct <- estimate_deficiency( spec_hct, methods = c("unadjusted", "iptw"), n_boot = 50 # Use more for production ) print(deficiency_hct) ``` **Clinical Interpretation:** The deficiency tells us how much our observational evidence differs from what an RCT would provide: - **Unadjusted δ**: High, reflecting strong selection bias (young patients → myeloablative) - **IPTW δ**: Lower, IPTW reweights to balance baseline covariates ### 2.4 Step 3: Confounding Frontier Beyond point estimates, we can map a *sensitivity analysis* showing how deficiency varies with hypothetical unmeasured confounding: ```{r frontier-hct, fig.height=5, eval = eval_surv} frontier <- confounding_frontier( spec_hct, alpha_range = c(-2, 2), # Confounding path: U → Treatment gamma_range = c(-2, 2), # Confounding path: U → Outcome grid_size = 30 ) print(frontier) plot(frontier) ``` **Reading the Frontier Map:** - **Center** (α = 0 or γ = 0): No unmeasured confounding → δ = 0 - **Corners**: Strong confounding on both paths → high δ - **Observed covariates** (dots): Benchmark strengths of measured confounders If an unmeasured confounder would need extreme strength (beyond observed benchmarks) to substantially increase δ, conclusions are robust. ### 2.5 Step 4: Policy Regret and RMST Effect ```{r policy-hct, eval = eval_surv} # Utility = months of survival (horizon = 24) bounds_hct <- policy_regret_bound( deficiency_hct, utility_range = c(0, 24) ) print(bounds_hct) ``` **Clinical Regret Bounds:** If δ = 0.05 with M = 24 months: - Transfer penalty = 24 × 0.05 = **1.2 months** - Minimax safety floor = 0.5 × 24 × 0.05 = **0.6 months** This means even with perfect decision-making, the observational evidence can inflate regret by up to about 1.2 months on the 0--24 month utility scale, and there is an unavoidable worst-case floor of about 0.6 months when $\delta>0$. ```{r effect-hct, eval = eval_surv} # Estimate RMST difference effect_hct <- estimate_effect( deficiency_hct, target_method = "iptw", contrast = c("Myeloablative", "Reduced") ) print(effect_hct) ``` ### 2.6 Complete Decision Framework ```{r decision-hct, results='asis', eval = eval_surv} delta_iptw <- deficiency_hct$estimates["iptw"] transfer_penalty <- bounds_hct$transfer_penalty minimax_floor <- bounds_hct$minimax_floor rmst_diff <- effect_hct$estimate # Decision logic if (delta_iptw < 0.05) { evidence_quality <- "HIGH (approximates RCT)" } else if (delta_iptw < 0.15) { evidence_quality <- "MODERATE (acceptable with caveats)" } else { evidence_quality <- "LOW (substantial bias risk)" } # Benefit-to-risk ratio if (!is.na(rmst_diff) && !is.na(transfer_penalty) && transfer_penalty > 0) { benefit_to_risk <- abs(rmst_diff) / transfer_penalty recommendation <- ifelse(benefit_to_risk > 2, "Recommend treatment with stronger effect", "Evidence inconclusive; consider shared decision-making" ) } else { benefit_to_risk <- NA recommendation <- "Unable to calculate benefit-to-risk ratio" } cat(sprintf( " ## HCT Treatment Decision Report **RMST Difference (IPTW):** %.2f months (%s favored) **Deficiency:** %.3f **Evidence Quality:** %s **Transfer Penalty:** %.2f months **Minimax Safety Floor:** %.2f months **Benefit-to-Risk Ratio:** %.1f:1 **Recommendation:** %s ### Clinical Translation The observational evidence suggests %s conditioning provides approximately %.1f additional months of relapse-free survival within the first 2 years. However, the transfer penalty is %.1f months and the minimax safety floor is %.1f months on the 0--24 month utility scale. Clinicians should weigh this against individual patient factors. ", abs(rmst_diff), ifelse(rmst_diff > 0, "Myeloablative", "Reduced"), delta_iptw, evidence_quality, transfer_penalty, minimax_floor, benefit_to_risk, recommendation, ifelse(rmst_diff > 0, "myeloablative", "reduced-intensity"), abs(rmst_diff), transfer_penalty, minimax_floor )) ``` --- ## Part 3: Comparative Analysis Across Studies ### 3.1 When Is Observational Evidence Sufficient? | Study | δ (IPTW) | Transfer Penalty | Evidence Quality | |-------|----------|------------------|------------------| | Gene Perturbation | Low (~0.01) | Very Low | Excellent | | HCT Outcomes | Moderate (~0.05-0.15) | 1-4 months | Acceptable with caveats | ### 3.2 General Workflow Summary ``` ┌─────────────────────────────────────────────────────────────────┐ │ SPECIFY: causal_spec() / causal_spec_survival() │ │ ↓ Define treatment, outcome, covariates, NC │ ├─────────────────────────────────────────────────────────────────┤ │ ESTIMATE: estimate_deficiency() │ │ ↓ Compare unadjusted, IPTW, AIPW, TMLE, etc. │ │ ↓ Select method with lowest δ │ ├─────────────────────────────────────────────────────────────────┤ │ DIAGNOSE: nc_diagnostic() + confounding_frontier() │ │ ↓ Test whether assumptions are falsified │ │ ↓ Map sensitivity to unmeasured confounding │ ├─────────────────────────────────────────────────────────────────┤ │ DECIDE: policy_regret_bound() + estimate_effect() │ │ ↓ Compute transfer penalty / minimax floor │ │ ↓ Report effect with uncertainty qualification │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## References 1. Akdemir, D. (2026). Constraints on Causal Inference as Experiment Comparison: A Framework for Identification, Transportability, and Policy Learning. DOI: 10.5281/zenodo.18367347 2. Le Cam, L., & Yang, G. L. (2000). Asymptotics in Statistics: Some Basic Concepts. Springer. 3. VanderWeele, T. J., & Ding, P. (2017). Sensitivity Analysis in Observational Research: Introducing the E-value. Annals of Internal Medicine.