%\VignetteIndexEntry{metaRNASeq Vignette} %\VignetteKeyword{RNA-seq data} %\VignetteKeyword{differential analysis} %\VignettePackage{metaRNASeq} \documentclass[12pt]{article} \SweaveOpts{eps=FALSE,echo=TRUE,png=TRUE,pdf=FALSE,figs.only=TRUE} \usepackage{times} \usepackage[numbers,sort&compress]{natbib} \usepackage[colorlinks=TRUE,urlcolor=blue,citecolor=blue]{hyperref} \usepackage{subfigure} \usepackage{amsmath} \newcommand{\Rfunction}[1]{{\texttt{#1}}} \newcommand{\Robject}[1]{{\texttt{#1}}} \newcommand{\Rpackage}[1]{{\textit{#1}}} \newcommand{\Rmethod}[1]{{\texttt{#1}}} \newcommand{\Rfunarg}[1]{{\texttt{#1}}} \newcommand{\Rclass}[1]{{\textit{#1}}} \newcommand{\Rcode}[1]{{\texttt{#1}}} \newcommand{\software}[1]{\textsf{#1}} \newcommand{\R}{\software{R}} \newcommand{\metaRNASeq}{\Rpackage{metaRNASeq}} \textwidth=6.2in \textheight=8.5in \oddsidemargin=0.2in \evensidemargin=0.2in \headheight=0in \headsep=0in \begin{document} \SweaveOpts{concordance=TRUE} \title{metaRNASeq: Differential meta-analysis of \\ RNA-seq data } \author{Guillemette Marot, Florence Jaffr{\'e}zic, Andrea Rau} \date{Modified: January 23, 2015. Compiled: \today} \maketitle <>= options(width=60) @ \begin{abstract} This vignette illustrates the use of the \Rpackage{metaRNASeq} package to combine data from multiple RNA-seq experiments. Based both on simulated and real publicly available data, it also explains the way the $p$-value data provided in the package have been obtained. \end{abstract} \tableofcontents %-------------------------------------------------- \section{Introduction} \label{sec:intro} %-------------------------------------------------- High-throughput sequencing (HTS) data, such as RNA-sequencing (RNA-seq) data, are increasingly used to conduct differential analyses, in which gene-by-gene statistical tests are performed in order to identify genes whose expression levels show systematic covariation with a particular condition, such as a treatment or phenotype of interest. Due to their large cost, however, only few biological replicates are often considered in each experiment leading to a low detection power of differentially expressed genes. For this reason, analyzing data arising from several experiments studying the same question can be a useful way to increase detection power for the identification of differentially expressed genes. The \Rpackage{metaRNASeq} package implements two $p$-value combination techniques (inverse normal and Fisher methods); see \cite{Rau2014} for additional details. There are two fundamental assumptions behind the use of these $p$-value combination procedures: first, that $p$-values have been obtained the same way for each experiment (i.e., using the same model and test); and second, that they follow a uniform distribution under the null hypothesis. In this vignette, we illustrate these $p$-value combination techniques after obtaining $p$-values for differential expression in each individual experiment using the \Rpackage{DESeq2} Bioconductor package \cite{Anders2010}. Count data are simulated using the \Rfunction{sim.function} provided in the \Rpackage{metaRNASeq} package; see section \ref{sec:sim} for additional detail. %-------------------------------------------------- \section{Simulation study} \label{sec:sim} %-------------------------------------------------- To begin, we load the necessary packages and simulation parameters: <>= library(metaRNASeq) data(param) dim(param) data(dispFuncs) @ These simulation parameters include the following information: \begin{itemize} \item \Robject{param}: Matrix of dimension (26408 $\times$ 3) containing mean expression in each of two conditions (here, labeled ``condition 1" and ``condition 2") and a logical vector indicating the presence or absence of differential expression for each of 26,408 genes \item \Robject{dispFuncs}: List of length 2, where each list is a vector containing the two estimated coefficients ($\alpha_0$ and $\alpha_1$) for the gamma-family generalized linear model (GLM) fit by \Rpackage{DESeq} (version 1.8.3) describing the mean-dispersion relationship for each of the two real datasets considered in \cite{Rau2014}. These regressions represent the typical relationship between mean expression values $\mu$ and dispersions $\alpha$ in each dataset, where the coefficients $\alpha_0$ and $\alpha_1$ are found to parameterize the fit as $\alpha = \alpha_0 + \alpha_1 / \mu$. \end{itemize} These parameters were calculated on real data sets from two human melanoma cell lines \cite{Strub2011}, corresponding to two different studies performed for the same cell line comparison, with two biological replicates per cell line in the first and three per cell line in the second. These data are presented in greater detail in \cite{Strub2011} and \cite{Statomique2013}, and are freely available in the Supplementary Materials of the latter. Once parameters are loaded, we simulate data. We use the \Rfunction{set.seed} function to obtain reproducible results. <>= set.seed(123) matsim <- sim.function(param = param, dispFuncs = dispFuncs) sim.conds <- colnames(matsim) rownames(matsim) <- paste("tag", 1:dim(matsim)[1],sep="") dim(matsim) @ The simulated matrix data contains $26,408$ genes and $4$ replicates per condition per study. It is possible to change the number of replicates in each study using either the \Rcode{nrep} argument or the \Rcode{classes} argument. Using \Rcode{nrep} simulates the same number of replicates per condition per study. In order to simulate an unbalanced design, the \Rcode{classes} argument may be used. For example, setting \begin{center} \Rcode{classes = list(c(1,2,1,1,2,1,1,2),c(1,1,1,2,2,2,2))} \end{center} leads to 5 and 3 replicates in each condition for the first study, and 3 and 4 replicates in each condition in the second. %-------------------------------------------------- \section{Individual analyses of the two simulated data sets} \label{sec:DEindivstudies} %-------------------------------------------------- Before performing a combination of $p$-values from each study, it is necessary to perform a differential analysis of the individual studies (using the same method). In the following example, we make use of the \Rpackage{DESeq2} package to obtain $p$-values for differential analyses of each study independently; however, we note that other differential analysis methods (e.g., \Rpackage{edgeR} or \Rpackage{baySeq}) could be used prior to the meta analysis. \subsection{Differential analysis of each individual study with DESeq2} Inputs to DEseq2 methods can be extracted with \Rcode{extractfromsim} for each individual study whose name appears in the column names of \Rcode{matsim}, see the following example for study1 and study2. <>= colnames(matsim) simstudy1 <- extractfromsim(matsim,"study1") head(simstudy1$study) simstudy1$pheno simstudy2 <- extractfromsim(matsim,"study2") @ Differential analyses for each study are then easily performed using the \Rcode{DESeqDataSetFromMatrix} method. <>= if (requireNamespace("DESeq2", quietly = TRUE)) { dds1 <- DESeq2::DESeqDataSetFromMatrix(countData = simstudy1$study, colData = simstudy1$pheno,design = ~ condition) res1 <- DESeq2::results(DESeq2::DESeq(dds1)) dds2 <- DESeq2::DESeqDataSetFromMatrix(countData = simstudy2$study, colData = simstudy2$pheno,design = ~ condition) res2 <- DESeq2::results(DESeq2::DESeq(dds2)) } @ We recommand to store both p-value and Fold Change results in lists in order to perform meta-analysis and keep track of the potential conflicts (see section \ref{sec:check}) <>= if (exists("res1") && exists("res2")) { rawpval <- list("pval1"=res1[["pvalue"]],"pval2"=res2[["pvalue"]]) FC <- list("FC1"=res1[["log2FoldChange"]],"FC2"=res2[["log2FoldChange"]]) } else { data(rawpval) data(FC) } @ Differentially expressed genes in each individual study can also be marked in a matrix \Rcode{DE}: <>= if (exists("res1") && exists("res2")) { adjpval <- list("adjpval1"=res1[["padj"]],"adjpval2"=res2[["padj"]]) } else { data(adjpval) } studies <- c("study1", "study2") DE <- mapply(adjpval, FUN=function(x) ifelse(x <= 0.05, 1, 0)) colnames(DE)=paste("DE",studies,sep=".") @ \Rcode{DE} returns a matrix with 1 for genes identified as differentially expressed and 0 otherwise (one column per study) Since the proposed p-value combination techniques rely on the assumption that p-values follow a uniform distribution under the null hypothesis, it is necesary to check that the histograms of raw-pvalues reflect that assumption: <>= par(mfrow = c(1,2)) hist(rawpval[[1]], breaks=100, col="grey", main="Study 1", xlab="Raw p-values") hist(rawpval[[2]], breaks=100, col="grey", main="Study 2", xlab="Raw p-values") @ \begin{figure}[ht!] \begin{center} \includegraphics[width = .85\textwidth]{metaRNASeq-pvalDESeq2hist.png} \caption{Histograms of raw $p$-values for each of the individual differential analyses performed using the \Rpackage{DESeq2} package.\label{fig:pvalDESeq2hist}} \end{center} \end{figure} The peak near 0 corresponds to differentially expressed genes, no other peak should appear. Sometimes another peak may appear due to genes with very low values of expression which often lead to an enrichment of $p$-values close to 1 as they take on discrete values. As such genes are unlikely to display evidence for differential expression, it is recommended to perform an independent filtering. The application of such a filter typically removes those genes contributing to a peak of $p$-values close to 1, leading to a distribution of $p$-values under the null hypothesis more closely following a uniform distribution. As the proposed $p$-value combination techniques rely on this assumption, it is sometimes necessary to independently filter genes with very low read counts. In this example the \Rfunction{results} function of DESeq2 performs an automatic independent filtering. If a row is filtered by independent filtering, then only the adjusted $p$-value will be set to NA, and the graphic of raw $p$-values does not change. In order to have a distribution of raw $p$-values under the null hypothesis following a uniform distribution, we must manually set the corresponding raw $p$-values to NA. <>= filtered <- lapply(adjpval, FUN=function(pval) which(is.na(pval))) rawpval[[1]][filtered[[1]]]=NA rawpval[[2]][filtered[[2]]]=NA @ To confirm that the raw $p$-values under the null hypothesis are roughly uniformly distributed, we may also inspect histograms of the raw $p$-values from each of the individual differential analyses (see Figure~\ref{fig:pvalDEhist}): <>= par(mfrow = c(1,2)) hist(rawpval[[1]], breaks=100, col="grey", main="Study 1", xlab="Raw p-values") hist(rawpval[[2]], breaks=100, col="grey", main="Study 2", xlab="Raw p-values") @ \begin{figure}[ht!] \begin{center} \includegraphics[width = .85\textwidth]{metaRNASeq-pvalDEhist.png} \caption{Histograms of raw $p$-values for each of the individual differential analyses performed using the independent filtering from \Rpackage{DESeq2} package.\label{fig:pvalDEhist}} \end{center} \end{figure} %-------------------------------------------------- \section{Use of p-value combination techniques} \label{sec:pvalcombi} %-------------------------------------------------- The code in this section may be used independently from the previous section if $p$-values from each study have been obtained using the same differential analysis test between the different studies. Vectors of $p$-values must have the same length; \Rcode{rawpval} is a list (or data.frame) containing the vectors of raw $p$-values obtained from the individual differential analyses of each study. The $p$-value combination using the Fisher method may be performed with the \Rcode{fishercomb} function, and the subsequent $p$-values obtained from the meta-analysis may be examined (Figure~\ref{fig:pvalcomb}, left): <>= fishcomb <- fishercomb(rawpval, BHth = 0.05) hist(fishcomb$rawpval, breaks=100, col="grey", main="Fisher method", xlab = "Raw p-values (meta-analysis)") @ \begin{figure}[t!] \begin{center} \includegraphics[width = .45\textwidth]{metaRNASeq-pvalfishcomb.png} \includegraphics[width = .45\textwidth]{metaRNASeq-pvalinvnorm.png} \caption{(Left) Histogram of raw $p$-values obtained after a meta-analysis of all studies, with $p$-value combination performed using the Fisher method. (Right) Histogram of raw $p$-values obtained after a meta-analysis of all studies, with $p$-value combination performed using the inverse normal method.\label{fig:pvalcomb}} \end{center} \end{figure} The use of the inverse normal combination technique requires the choice of a weight for each study. In this example, we choose \Rcode{nrep=8}, since 8 replicates had been simulated in each study. As before, we may examine a histogram of the subsequent $p$-values obtained from the meta-analysis (Figure~\ref{fig:pvalcomb}, right). <>= invnormcomb <- invnorm(rawpval,nrep=c(8,8), BHth = 0.05) hist(invnormcomb$rawpval, breaks=100, col="grey", main="Inverse normal method", xlab = "Raw p-values (meta-analysis)") @ Finally, we suggest summarizing the results of the individual differential analyses as well as the differential meta-analysis (using the Fisher and inverse normal methods) in a data.frame: <>= DEresults <- data.frame(DE, "DE.fishercomb"=ifelse(fishcomb$adjpval<=0.05,1,0), "DE.invnorm"=ifelse(invnormcomb$adjpval<=0.05,1,0)) head(DEresults) @ %-------------------------------------------------- \section{Treatment of conflicts in differential expression} \label{sec:check} %-------------------------------------------------- As pointed out in \cite{Rau2014}, it is not possible to directly avoid conflicts between over- and under- expressed genes in separate studies that appear in differential meta-analyses of RNA-seq data. We thus advise checking that individual studies identify differential expression in the same direction (i.e., if in one study, a gene is identified as differentially over-expressed in condition 1 as compared to condition 2, it should not be identified as under-expressed in condition 1 as compared to condition 2 in a second study). Genes displaying contradictory differential expression in separate studies should be removed from the list of genes identified as differentially expressed via meta-analysis. We build a matrix \Rcode{signsFC} gathering all signs of fold changes from individual studies. <>= signsFC <- mapply(FC, FUN=function(x) sign(x)) sumsigns <- apply(signsFC,1,sum) commonsgnFC <- ifelse(abs(sumsigns)==dim(signsFC)[2], sign(sumsigns),0) @ The vector \Rcode{commonsgnFC} will return a value of 1 if the gene has a positive $\log_2$ fold change in all studies, -1 if the gene has a negative $\log_2$ fold change in all studies, and 0 if contradictory $\log_2$ fold changes are observed across studies (i.e., positive in one and negative in the other). By examining the elements of \Rcode{commonsgnFC}, it is thus possible to identify genes displaying contradictory differential expression among studies. <>= unionDE <- unique(c(fishcomb$DEindices,invnormcomb$DEindices)) FC.selecDE <- data.frame(DEresults[unionDE,],do.call(cbind,FC)[unionDE,], signFC=commonsgnFC[unionDE], DE=param$DE[unionDE]) keepDE <- FC.selecDE[which(abs(FC.selecDE$signFC)==1),] conflictDE <- FC.selecDE[which(FC.selecDE$signFC == 0),] dim(FC.selecDE) dim(keepDE) dim(conflictDE) head(keepDE) @ <>= nbtrueconflicts=as.vector(table(conflictDE$DE)[2]) @ Note that out of all the conflicts, \Sexpr{nbtrueconflicts} represented genes were simulated to be truly differentially expressed. <>= table(conflictDE$DE) @ %-------------------------------------------------- \section{IDD, IRR and Venn Diagram} %-------------------------------------------------- Different indicators can be used to evaluate the performance of the meta-analysis, some of them are described in \cite{Marot2009} and returned by the function \Rcode{IDD.IRR}. DE corresponds to the number of differentially expressed genes. IDD (Integration Driven discoveries) returns the number of genes that are declared DE in the meta-analysis that were not identified in any of the individual studies alone, Loss the number of genes that are identified DE in individual studies but not in meta-analysis. The Integration-driven Discovery Rate (IDR) and Integration-driven Revision Rate (IRR) are the corresponding proportions of IDD and Loss. <>= fishcomb_de <- rownames(keepDE)[which(keepDE[,"DE.fishercomb"]==1)] invnorm_de <- rownames(keepDE)[which(keepDE[,"DE.invnorm"]==1)] indstudy_de <- list(rownames(keepDE)[which(keepDE[,"DE.study1"]==1)], rownames(keepDE)[which(keepDE[,"DE.study2"]==1)]) IDD.IRR(fishcomb_de,indstudy_de) IDD.IRR(invnorm_de ,indstudy_de) @ <>= x=IDD.IRR(fishcomb_de,indstudy_de) y=IDD.IRR(invnorm_de ,indstudy_de) @ In this example, the p-value combination technique with Fisher's method gives $\Sexpr{x[["IDD"]]}$ ($\Sexpr{x[["IDR"]]}\%$) new genes and $\Sexpr{x[["Loss"]]}$ ($\Sexpr{x[["IRR"]]}\%$), are sidetracked. The inverse normal combination technique gives $\Sexpr{y[["IDD"]]}$ ($\Sexpr{y[["IDR"]]}\%$) new genes and $\Sexpr{y[["Loss"]]}$ ($\Sexpr{y[["IRR"]]}\%$), are sidetracked \\ To compare visually the number of differentially expressed genes in individual studies or in meta-analysis, it is also possible to draw a Venn diagram, for example with the \Rpackage{VennDiagram} package. <>= if (require("VennDiagram", quietly = TRUE)) { venn.plot<-venn.diagram(x = list(study1=which(keepDE[,"DE.study1"]==1), study2=which(keepDE[,"DE.study2"]==1), fisher=which(keepDE[,"DE.fishercomb"]==1), invnorm=which(keepDE[,"DE.invnorm"]==1)), filename = NULL, col = "black", fill = c("blue", "red", "purple","green"), margin=0.05, alpha = 0.6) jpeg("venn_jpeg.jpg"); grid.draw(venn.plot); dev.off(); } @ \begin{figure}[t!] \begin{center} \includegraphics[width = .85\textwidth]{venn_jpeg.jpg} \caption{Venn Diagram comparing the list of DE genes at a 5\% BH threshold obtained by each individual study and p-value combination techniques} \end{center} \end{figure} %-------------------------------------------------- \section{Session Info} %-------------------------------------------------- <>= sessionInfo() @ \bibliographystyle{abbrv} % Style BST file \bibliography{metaRNASeq} % Bibliography file (usually '*.bib' ) \end{document}