--- title: "Quick start with `radEmu`" author: "Sarah Teichman and Amy Willis" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{quickstart} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## Setup First, we will install `radEmu`, if we haven't already. ```{r, eval = FALSE} # if (!require("remotes", quietly = TRUE)) # install.packages("remotes") # # remotes::install_github("statdivlab/radEmu") ``` Next, we can load `radEmu`, as well as `dplyr`. ```{r setup, message = FALSE} library(radEmu) library(dplyr) ``` In this vignette we'll explore a [dataset published by Wirbel et al. (2019)](https://www.nature.com/articles/s41591-019-0406-6). This is a meta-analysis of case-control studies of colorectal cancer, meaning that Wirbel et al. collected raw sequencing data from studies other researchers conducted and re-analyzed it. Wirbel et al. published two pieces of data we'll focus on: * metadata giving demographics and other information about participants * a mOTU (metagenomic OTU) table ```{r} data("wirbel_sample") # encode control as the baseline level wirbel_sample$Group <- factor(wirbel_sample$Group, levels = c("CTR","CRC")) dim(wirbel_sample) data("wirbel_otu") dim(wirbel_otu) ``` We can see that we have $566$ samples and $845$ mOTUs. While in general we would fit a model to all mOTUs, we are going to subset to some specific genera for the purposes of this tutorial. Let's look at *Eubacterium*, *Porphyromonas*, *Faecalibacteria*, and *Fusobacterium* for now. ```{r} mOTU_names <- colnames(wirbel_otu) chosen_genera <- c("Eubacterium", "Faecalibacterium", "Fusobacterium", "Porphyromonas") mOTU_name_df <- data.frame(name = mOTU_names) %>% mutate(base_name = stringr::str_remove(mOTU_names, "unknown ") %>% stringr::str_remove("uncultured ")) %>% mutate(genus_name = stringr::word(base_name, 1)) restricted_mOTU_names <- mOTU_name_df %>% filter(genus_name %in% chosen_genera) %>% pull(name) ``` Again, while we would generally fit a model using all of our samples, for this tutorial we are only going to consider data from a case-control study from China. ```{r} ch_study_obs <- which(wirbel_sample$Country %in% c("CHI")) ``` Next, we want to confirm that all samples have at least one non-zero count across the categories we've chosen and that all categories have at least one non-zero count across the samples we've chosen. ```{r} small_Y <- wirbel_otu[ch_study_obs, restricted_mOTU_names] sum(rowSums(small_Y) == 0) # no samples have a count sum of 0 sum(colSums(small_Y) == 0) # one category has a count sum of 0 category_to_rm <- which(colSums(small_Y) == 0) small_Y <- small_Y[, -category_to_rm] sum(colSums(small_Y) == 0) ``` ## Differential abundance analysis with radEmu We now are ready to fit a `radEmu` model with the `emuFit()` function! This will take 1-2 minutes to run, so we suggest you start running it and then read below about the arguments! ```{r} mod <- emuFit(data = wirbel_sample[ch_study_obs, ], Y = small_Y, formula = ~ Group + Gender, test_kj = data.frame(k = 2, j = 1)) ``` Some of the important arguments for `emuFit()` are the following: - `formula`: This is a formula telling radEmu what predictors to use in its model. We are using Group, which is an indicator for case (CRC) vs control (CTR), and Gender. - `data`: A data frame containing covariate information about our samples. - `Y`: A matrix containing our observed abundance data (e.g., counts or depth measurements). The rows give the observations (samples), and the columns give the categories (taxa/mOTUs). Note that `Y` doesn't have to be integer-valued (counts)! - `test_kj`: a data frame describing which robust score tests you want to run. `k` corresponds with the predictor(s) that you want to test (if you don't know what order your predictors appear in your design matrix, use the function `make_design_matrix()` with your `formula` and `data` to check!) and `j` corresponds with the taxa that you want to test (check `colnames(Y)` to see the order of the taxa). If you wanted to test the Group covariate for all taxa then you could set `test_kj = data.frame(k = 2, j = 1:ncol(Y))`. - `verbose`: do you want the code to give you updates as it goes along? In the example above we only test the first taxon, because each test for a data set this size takes approximately 30 seconds. Typically you'll want to test all taxa that you are interested in, so you may need to leave this running for a few hours, or check out the "parallel_radEmu" vignette to see how you could parallelize these score tests if you have access to additional computing resources. Now that we've fit the model and run a robust score test for the first taxon, we can look at the results! The parameter estimates and any test results are in the `coef` part of the `emuFit` object. ```{r} head(mod$coef) ``` The first taxon in our model is *Fusobacterium nucleatum s. vincentii*, which has a log fold-difference estimate of $1.49$. We will interpret this to mean that we expect that *Fusobacterium nucleatum s. vincentii* is $exp(1.49) \approx 4.44$ times more abundant in cases of colorectal cancer compared to controls of the same gender, when compared to the typical fold-differences in abundances of the taxa in this analysis. We could similarly interpret the log fold-differences for each taxon in our analysis. For *Fusobacterium nucleatum s. vincentii* we have a robust score test p-value of $0.04$. This means that we do have enough evidence to reject the null hypothesis that the fold-difference in abundance of *Fusobacterium nucleatum s. vincentii* between cases and controls is the same as the typical fold-difference in abundance between cases and controls across all taxa in this analysis. When we say ``typical" here we mean approximately the median fold-difference across taxa. Now you are ready to start using `radEmu`! We recommend our other vignettes for a deeper look using `radEmu`, including for `phyloseq` or `TreeSummarizedExperiment` objects, with clustered data, and in parallel.