%\VignetteIndexEntry{SeqFeatR Tutorial} \documentclass[a4paper,10pt]{article} \usepackage[utf8]{inputenc} \let\chapter\section \usepackage[lined]{algorithm2e} \usepackage[numbers]{natbib} \bibliographystyle{plainnat} \usepackage[english]{babel} \selectlanguage{english} \usepackage{graphicx} \usepackage{placeins} \usepackage{amsmath} \usepackage{amscd} \usepackage{ifthen} \usepackage{float} \usepackage{subfig} \usepackage{lscape} \usepackage{hyperref} \usepackage{parskip} \usepackage{multirow} \usepackage{color} \usepackage{colortbl} \definecolor{hellgrau}{rgb}{0.9,0.9,0.9} \definecolor{dunkelgrau}{rgb}{0.8,0.8,0.8} %add-on renews \renewcommand\topfraction{0.85} \renewcommand\bottomfraction{0.85} \renewcommand\textfraction{0.1} \renewcommand\floatpagefraction{0.85} \setlength\floatsep{1.25\baselineskip plus 3pt minus 2pt} \setlength\textfloatsep{1.25\baselineskip plus 3pt minus 2pt} \setlength\intextsep{1.25\baselineskip plus 3pt minus 2 pt} \begin{document} \title{A little \texttt{SeqFeatR} tutorial} \author{Bettina Budeus and Daniel Hoffmann, \\ Bioinformatics, University of Duisburg-Essen, \\ Essen, Germany} \maketitle \tableofcontents This tutorial gives you some of the technical background underlying \texttt{SeqFeatR} that should enable you to understand what \texttt{SeqFeatR} does and also how to use it and how to interpret the output. If you are solely interested in a HowTo, you may find the following two video tutorials helpful: \begin{itemize} \item For the GUI in the \texttt{SeqFeatR} R-package: \url{https://www.youtube.com/watch?v=-CYidGPE6dw} \item For the \texttt{SeqFeatR} web-server: \url{https://www.youtube.com/watch?v=3z4Smk3mI18} \end{itemize} \section[\texttt{SeqFeatR} discovers feature - sequence associations]{\texttt{SeqFeatR} discovers statistically significant feature - sequence associations} \label{sec:texttts-disc-stat} Imagine the following alignment\footnote{The ``alignment'' shown here does not look like a good alignment. Usually, alignments show much more columns where many sequences have the same amino acids, and there may also be gaps, indicated by ``-'' in some sequences. This bad alignment was chosen for one reason only: it demonstrates that it can be difficult to spot relationships between features and sequence positions.} of amino acid sequences in FASTA format, taken from 14 patients that either have a certain feature (``$f$'') or do not have that feature (``$n$''). The feature may for instance be an HLA type\footnote{Human leukocyte antigen (\url{https://en.wikipedia.org/wiki/Human_leukocyte_antigen}), a classification scheme of human immune systems.}, a genetic disease, etc. In the following FASTA formatted input for \texttt{SeqFeatR} you can see a letter ($f$ or $n$) indicating feature or not-feature at the end of each comment line: \label{fasta} \begin{verbatim} >P01_HLA_A01_00_B01_02;f LPDIQGNENMGYQPSWIFCGMETNGSQCLEEMFHCCWINC >P02_HLA_A01_00_B01_02;f MPDWNQKWGNDHLASINLD-WLKTIQQPGIEKHLRFYENW >P03_HLA_A01_02_B01_02;f VPDASGKHGIIGMDVTSSMERRHGMVQLPWPAMVWGRPHW >P04_HLA_A01_00_B01_02;f MPDVRGVGCARRDCLIVHRFCMPFNNQVYCKVWIVYWTYK >P05_HLA_A01_00_B01_02;f QPDTPKITRKEATAIHKCGIHWQTNCQKLSTVHPFHHQVD >P06_HLA_A01_02_B01_02;f SWDDFSDFTMVHQWYAQGTLGPYKAMQLKMIFQGVSIMEV >P07_HLA_A01_02_B01_02;f IPDEPCYCCVKNKILTVEIGVHHAKSQVRRNIDNIRRKTE >P08_HLA_A04_03_B04_03;n HFST-ICPYIWKMYFTWMGQKLVIQKVNGRTPPHCDECNQ >P09_HLA_A04_03_B04_03;n SNFT-TTKLRDQHNLYPAGLQEIEHKVDHQILGIYGQIWY >P10_HLA_A04_03_B04_03;n ETSTALRTQDQTFMLALRANYMVMLKVLDCISVKLFICWR >P11_HLA_A04_03_B04_00;n DSSTMDAECSTLQRFIWWHAHYAWIRVAKKPYCLDCPYAV >P12_HLA_A04_03_B04_03;n KKSTLGIARGIQRSHGWYWRQTHCVMVLTPSQHKMGEKSW >P13_HLA_A04_03_B04_00;n ICSTELCGCLINWPPMQWIVFAHMDDVNDSQTNTCDMRSQ >P14_HLA_A04_03_B04_03;n GPSTNARTMGGQDCAYMTHTLTKHIWVILAFDPIMIVHKP \end{verbatim} Can you discover statistically significant associations of the feature with the presence or absence of certain amino acids at certain sequence\footnote{Strictly, we are not dealing with \emph{sequence} positions but with \emph{sequence alignment} positions.} positions? It is difficult to spot such associations with the naked eye, but they are there: \begin{itemize} \item There is a strong association of feature $f$ with amino acid P at the second position, though patient 14 is an exception as she is $n$ and still has P at second position. \item There is a strong association of $n$ (i.e.\ not having feature $f$) with amino acid T at fourth position, though patient 5 is an exception as he has $f$ and still has a T at fourth position. \end{itemize} In its basic application, \texttt{SeqFeatR} tests \emph{all} sequence positions and quickly identifies the second and fourth positions as being statistically significantly associated with the feature ($f$) or its absence ($n$). \texttt{SeqFeatR} shows these associations graphically in two ways, as Manhattan plot and as odds ratio (OR) plot. \section{The core of \texttt{SeqFeatR}: Fisher's exact test} \subsection{An example: association of a feature with sequence} \label{sec:an-exampl-assoc} We have mentioned that in the above alignment there is seemingly a strong association of the occurrence of amino acid P at position 2 with the feature $f$. The probability and strength of this association can be quantified, respectively, by a p-value computed with Fisher's exact test, a well-known statistical test for association, and by an odds ratio (OR). At its core, \texttt{SeqFeatR} does exactly this. In the above example of the association of P at position 2 with feature $f$, \texttt{SeqFeatR} internally would first collect occurrences in a frequency table and then compute from that frequency table p-value and OR: \begin{itemize} \item 6 sequences with feature $f$ \emph{and} P at position 2 \item 1 sequence with feature $f$ \emph{and not} P at position 2 \item 1 sequence with feature $n$ (= \emph{not} $f$) \emph{and} P at position 2 \item 6 sequences with feature $n$ \emph{and not} P at position 2 \end{itemize} \texttt{SeqFeatR} collects these data in a frequency table: \begin{tabular}{c c c c} & & \multicolumn{2}{c}{Proline}\\\cline{3-4} & & + & - \\\cline{2-4} \multirow{2}{*}{feature $f$}& + & 6 & 1 \\ & - & 1 & 6 \\\cline{1-4} \label{tab:point} \end{tabular} Submitting this table to Fisher's exact test yields a p-value of 0.0291. At a significance level of 0.05 we therefore \emph{reject} the null hypothesis (= \emph{no} association of $f$ and P at position 2) and rather assume an association of $f$ and P at position 2. The \emph{strength} of the association is quantified by the odds ratio (OR) that is computed from the elements of the above contingency table: \begin{displaymath} OR = \frac{N_{f,P2}/N_{f,not\,P2}}{N_{not\,f,P2}/N_{not\,f, not\,P2}} = \frac{6/1}{1/6} = 36. \end{displaymath} (Note: there are several methods to estimate the odds ratio. The simple one shown here is called Wald's method. The one used by \texttt{SeqFeatR} yields about 23.5.) An OR much greater than 1 ($OR \gg 1$) as we have it here ($OR = 36$) means that we have a \emph{strong positive association} of the feature $f$ with $P2$: $f$ and $P2$ occur much more often together than expected if we had no association. If we have \emph{a weak or no association}, the OR lies around 1. Then $f$ and $P2$ would occur together and not together in the same ratios. If we have a \emph{strong negative association}, $0 < OR \ll 1$. In case of $f$ and $P2$, a negative association means that $f$ and $P2$ occur \emph{less} often together than expected if we had no association. \subsection{Another example: association of HLA type and sequence} \label{sec:anoth-exampl-assoc} In the above set of FASTA formatted sequences, we had ended each sequence header with the name of a feature, either $f$ or $n$, separated from the rest of the header line by a semicolon. A specific type of feature that is often used in \texttt{SeqFeatR} analyses is the \emph{HLA type}. (For the HLA type there is an optional way of telling \texttt{SeqFeatR} about this specific feature by giving the positions of the HLA information in the FASTA header, see section \ref{sec:input:-sequ-feat} of this tutorial and first tab of \texttt{SeqFeatR} graphical user interface.) For instance, \texttt{SeqFeatR} will automatically discover in the sequences above a significant association between HLA*B01 and amino acid D at third sequence position: \begin{itemize} \item 7 sequences with HLA*B01 \emph{and} D3 \item 0 sequences with \emph{not} HLA*B01 \emph{and} with D3 \item 0 sequences with HLA*B01 \emph{and} \emph{not} D3 \item 7 sequences with \emph{not} HLA*B01 \emph{and not} D3 \end{itemize} Thus, we obtain the following contingency table: \begin{tabular}{c c c c} & & \multicolumn{2}{c}{D} \\\cline{3-4} & & + & -\\\cline{2-4} \multirow{2}{*}{HLA B*01}& + & 7 & 0\\ & - & 0 & 7 \\\cline{1-4} \label{tab:pointHLA} \end{tabular} Fisher's exact test yields a p-value $< 0.001$ and we have an OR of infinity. Thus we have a \emph{significant and strongly positive} association of HLA*B01 and D3. \section{Graphical output} \subsection{\texorpdfstring{$-\log_{10}$ p-value plot (``Manhattan plot'')}{Manhattan plot}} The so-called Manhattan plot, i.e.\ a plot of $-\log_{10}$ p-values along the sequence, is a convenient means to discover significant associations of sequence alignment positions with features. \texttt{SeqFeatR} produces Manhattan plots consisting of two separate plots (Figure \ref{fig:log}): The top half of the plot focuses on complete epitopes or putative epitopes comprising ``windows'' of several sequence positions (e.g.\ windows of 9 positions), while the bottom half gives a more detailed picture of the same data at the level of single sequence positions. The x-axis for both plots is the same, namely the positions in the input sequence alignment. Let us start the discussion with the more fundamental bottom part of Figure \ref{fig:log}, the simple Manhattan plot. In this plot, \texttt{SeqFeatR} can mark a significance level $\alpha$ (here: $\alpha=0.01$) with a horizontal line. Associations with $-\log_{10}p$-values above that line (i.e.\ p-values $< \alpha$) are shown with a special symbol (here: red stars) and considered significant. To ease the visual localization of the highly significant positions, they are additionally marked with vertical lines that hit the sequence axis at the corresponding positions. (The resolution of the x-axis is usually to coarse to show single positions, but sufficient to localize significantly associated positions in the fully resolved csv-file which is given as second output file.) \begin{figure} \centering \def\svgwidth{0.80\columnwidth} \input{log.pdf_tex} \caption[Manhattan plot.]{Manhattan plot (p-values along sequence). The y-axis scales with the $-\log_{10}(\mbox{p-value})$, i.e.\ the higher the point, the more significant the association. Top half of figure combines three different ways of showing possible epitopes: (1) possible epitopes from a window-wise statistical analysis of your data (black line), (2) known epitopes, e.g.\ from the literature (red line), (3) pieces of sequence alignment that conform with certain sequence patterns (yellow line). Bottom half of figure: Manhattan plot with the p-values for each sequence position. There are additional annotations in this graphic to explain what you see in \textcolor{red}{red}. Those are \textit{not} in the real output from \texttt{SeqFeatR}.} \label{fig:log} \end{figure} Now the top part of the Manhattan plot of Figure \ref{fig:log}, focusing not on single positions but on complete putative epitopes. If the features in your \texttt{SeqFeatR} input have been HLA types, and if you then see several sequence positions in close proximity showing up with high $-\log_{10}p$ values in the Manhattan plot, you may have found a HLA epitope. The top part of the Manhattan plot highlights such position clusters. It can potentially show three different curves, a black, a red, and a yellow curve, each indicating potential epitopes. The red and yellow curves are optional. The black curve directly relates to the bottom part of the plot: it shows the number of sequence positions with significant feature association in a window of 9 amino acids (or any other window length given by the user), divided by the window width (e.g.\ 9). The default window width of 9 amino acids corresponds to a typical length of a MHC I binding peptide. This window is shifted over the whole sequence and the fraction of significantly associated position computed for each window position and plotted as y-value at that position. An HLA epitope will show up as bump of the black curve to high values, similar to the bump in the top part of Figure \ref{fig:log}. The red curve (optional) allows the user to mark known epitopes, e.g.\ published in the literature or in a database. This can be helpful for comparisons. You can enter data for the red curve in an extra csv file (``Known epitopes'' in \texttt{SeqFeatR} GUI and web interface). Here an example of what you could put into such a csv file\footnote{A similar example is part of the R-package \texttt{SeqFeatR} and there called \texttt{Example\_epitopes\_aa.csv}.}: \begin{verbatim} 4;12;A1 2;9;A3 \end{verbatim} This example marks two known epitopes with two lines of the form \texttt{EpitopeStart; EpitopeStop; HLAtype}. The first line (\texttt{4;12;A1}) corresponds to the epitope shown as bump of the red line in the top of Figure \ref{fig:log}. The level of the bump is always at a value of 1.0 for these known epitopes. The bump is only shown if the HLA type in the csv file matches the HLA type in the sequence alignment. With the yellow curve (optional) the user can mark alignment regions that are conforming with given sequence patterns. Such patterns can be defined in another csv file (``Known binding motifs'' in \texttt{SeqFeatR} GUI and web interface) following this format\footnote{A similar example is part of the R-package \texttt{SeqFeatR} and there called \texttt{Example\_HLA\_binding\_motifs\_aa.csv}}.: \begin{verbatim} Genotype;Motif;Reference A*01;x[PV]xxxx[DENQ]EN;SYFPEITHI \end{verbatim} The header line (\texttt{Genotype; Motif; Reference}) describes the structure of the following lines. All elements in one row are separated by semicolons. The first element is the HLA type (here: \texttt{A*01}), the second is the actual definition of the motif (here: \texttt{x[PV]xxxx[DENQ]EN}), and the third gives a reference to the origin of this motif \footnote{SYFPEITHI is just an example of a possible reference -- the motif was not really taken from SYFPEITHI}. The definition of the motif requires a bit of explanation. The motif shown here covers nine amino acid positions. Here the nine amino acid positions are shown as indices: \begin{displaymath} \mathtt{x_{1}\underbrace{[PV]}_{2}x_{3}x_{4}x_{5}x_{6}\underbrace{[DENQ]}_{7}E_{8}N_{9}} \end{displaymath} The letter \texttt{x} stands for: ``this could be any amino acid''. The two square brackets at positions 2 and 7 show which amino acids are allowed at these two positions, e.g.\ proline (P) or valine (V) at position 2. At position 8 there has to be a glutamate (E), and at position 9 must be an asparagine (N). Many sequences conform with this particular motif description, e.g.\ \texttt{APYEILDEN} or \texttt{SVRKTSQEN}. In general, motifs should be expressed in the same way using the elements \texttt{x}, \texttt{[]}, and capitals \texttt{ACDE...Y}. \texttt{SeqFeatR} shows a bump of the yellow line to 1.0 if several conditions are fulfilled simultaneously: (a) the HLA type in the sequence alignment matches the HLA type in the csv file, (b) the motif occurs in one of the aligned sequences, and (c) in the sequence window covered by the motif there is at least one significant association of an alignment position with the HLA. \subsection{Advanced \texttt{SeqFeatR} plotting, e.g.\ odds ratio plot} The R-package \texttt{SeqFeatR} offers a number of more advanced commands that are not yet available through the web interface of \texttt{SeqFeatR}. A nice example is the odds ratio plot that requires the use of the function \texttt{orPlot} of the R-package \texttt{SeqFeatR}. This is why you could be interested in the odds ratio plot: While the Manhattan plot is a useful means to gain an overview over the distribution of significant sequence--feature associations along the alignment, there is still important information missing: Which amino acid is characteristic for the positions with low p-value? Is an amino acid overrepresented or underrepresented at such a position in sequences with a certain feature? All this information can be extracted from the csv file produced by \texttt{SeqFeatR}, but the program also provides a new plot, that we call \emph{odds ratio plot}, to visualize this information (Figure \ref{fig:OR-plot}). We had introduced the odds ratio (OR) in section \ref{sec:an-exampl-assoc} as a way to quantify the strength of the association. The OR tells us whether a feature is over- or underrepresented at a position ($OR > 1$ or $OR < 1$, respectively), with $OR=1$ indicating vanishing association. For easier visual recognition, we show the logarithms ($\log_{10}OR$) of the OR values along the sequence. To given an example: using this logarithmic representation, a tenfold overrepresentation of an amino acid at a sequence position in sequences with a certain feature (e.g.\ HLA type) shows up in the OR plot as \emph{upwards} pointing bar of length 1, a tenfold underrepresentation as \emph{downwards} pointing bar of length 1. \begin{figure}[ht] \centering \def\svgwidth{\columnwidth} \input{OR_new.pdf_tex} \caption[Odds ratio plot.]{Example odds ratio plot. Here we analyze amino acid sequences of HIV-1 gp120 protein variants, and we have as feature the so-called co-receptor tropism of HIV-1, which can be ``R5'' or ``not R5'' (the latter is often called ``X4''). The odds ratio (OR) plot shows for each sequence alignment position the association strength $\log_{10}(OR)$ as bar height and the p-value as bar color. The plot demonstrates that high values of $\log_{10}OR$ (long bars) and high statistical significance (blue color) are not the same.} \label{fig:OR-plot} \end{figure} There is one important caveat: even if a long OR bar indicates strong association, it still may not be \emph{statistically significant}. This may be confusing at first, but think about a situation in which you have a small set of sequences, say two sequences, with a certain feature, and one without the feature. In the sequences with the feature we have amino acid A at the first alignment position, in the sequences without the feature we have G at that position. Thus A is overrepresented and G underrepresented in the sequences with the feature, right? In fact your OR would be infinity (see example in section \ref{sec:anoth-exampl-assoc}). But would you believe this? Probably not, since your sequence set is so small that the p-value from the Fisher's exact test is 0.33. Therefore the odds ratio plot allows you to combine OR information and p-value information. In Figure \ref{fig:OR-plot} OR bars with low p-values (highly significant) are filled with a blue color, while OR bars with higher p-values (not so highly significant) are filled with a green hue. As you can see, by far not all long bars are blue, i.e.\ only a subset of positions may have strong and significant associations with the feature. \section{Input: sequences and features} \label{sec:input:-sequ-feat} The whole \texttt{SeqFeatR} analysis is critically dependent on the input. Therefore we summarize here how to prepare the input data properly. First, before you submit sequences to \texttt{SeqFeatR}, you have to \emph{align} them, i.e.\ do not submit sequences to \texttt{SeqFeatR}, but a multiple sequence alignment (MSA). There exist several popular tools for this task, for \texttt{mafft} (\url{http://www.ebi.ac.uk/Tools/msa/mafft/}). Make sure that the resulting MSA is in FASTA format (or Pearson/FASTA) (for a description of FASTA format, see \url{https://en.wikipedia.org/wiki/FASTA_format}, an example is shown on page \pageref{fasta}). Usually, this output format can be chosen as option in the input form of \texttt{mafft} and other MSA programs, so you do not have to do this manually. In the FASTA formatted MSA, we have one block for each sequence. Such a block consists of a header line (starting with $>$) that can be used to describe the sequence, and the sequence itself on the following lines. \texttt{SeqFeatR} expects in each FASTA header line a label that tells it which feature this sequence carries. There are two different types of features that can be put into FASTA headers: \begin{enumerate} \item A feature can be given by a letter or word at the end of each FASTA header after a semicolon, as in: \begin{itemize} \item $>$some information;feature \item $>$patient 1;f \item $>$HCVA;n \item etc. \end{itemize} Anything after a semicolon in the FASTA-header is interpreted as name of a feature. \item HLA types are a special case of features accepted by \texttt{SeqFeatR}. Here an example snippet from a FASTA file with the encoded HLA type information in the FASTA headers (see also example on page \pageref{fasta}): \begin{verbatim} >P1 HLA_A0403_B0403 donor 1 SNFT-TTKLRDQHNLYPAGLQEIEHKVDHQILGIYGQIWY ETSTALRTQDQTFMLALRANYMVMLKVLDCISVKLFICWR DSSTMDAECSTLQRFIWWHAHYAWIRVAKKPYCLDCPYAV >P2 HLA_A0403_B0404 donor 2 ... \end{verbatim} Now let us focus on the first header and for orientation write numbers \texttt{123...} beneath the header that give us the positions of the characters in that line: \begin{verbatim} >P1 HLA_A04_03_B04_03 donor 1 123456789.123456789.123456789.1234 \end{verbatim} Here, \texttt{HLA\_A04\_03} corresponds to HLA-A*04:03\footnote{See e.g.\ \url{https://en.wikipedia.org/wiki/Human_leukocyte_antigen\#Nomenclature}} with locus A, group number 04, and variant 03. The group number ``04'' covers positions 10 and 11 of the line, the variant ``03'' is at positions 13 and 14. Analogously, for the B locus we have group ``04'' at positions 17 and 18, and variant ``03'' at positions 20 and 21. We give \texttt{SeqFeatR} these four position intervals of A and B groups and variants, as shown in Figure \ref{fig:hlapositions}. Importantly, the HLA types in \emph{all} FASTA blocks in one MSA file have to take the \emph{same} positions in their respective lines to be properly recognized by \texttt{SeqFeatR}. \end{enumerate} \begin{figure}[h]\centering \includegraphics[width=1\columnwidth] {screenshotHLA.png} \caption{Screenshot of \texttt{SeqFeatR} web interface with filled HLA positions.} \label{fig:hlapositions} \end{figure} \section{Multiple comparison correction} \texttt{SeqFeatR} applies the same statistical test to \emph{many} sequence alignment positions. Under these conditions it is likely that some of the statistical test yield a low p-value just by chance, and not because of a real association. An example may illustrate this so-called \emph{multiple testing problem} (or \emph{multiple comparison problem}): imagine you toss a fair coin (fair = fifty-fifty chance for head and tail) four times. The most probably result is that you have 2 times head and 2 times tail. The probability to have 4 times head is only 1 in 16 (1/2 for the first toss times 1/2 for the second ... = $(1/2)^4 = 1/16$). Now let us do 16 of these 4-toss experiments, i.e.\ we toss the coin $16 \times 4$ times. Then we expect that in one of these 4-toss experiments we will see 4 times head, just by chance. If we only consider the experiment with the outcome of 4 times head, we have the wrong impression that the coin is \emph{not} fair. In the same way, a set of sequence alignment positions corresponds to a set of random experiments, and if we carry out many association tests, some of them seemingly indicate an association, but that association is not real. \texttt{SeqFeatR} offers various types of corrections (graphical user interface/web interface: P-value correction) as available through the standard R-stats package function \texttt{p.adjust} (in R call help page of p.adjust for more information). This includes e.g.\ very conservative Bonferroni correction or the probably more useful False Discovery Rate correction. \section{Hints} \begin{itemize} \item Do \textbf{not} use a word processor such as Word or LibreOffice/OpenOffice to prepare the sequence alignment input. These programs likely destroy the FASTA format by (invisible) extra characters and invalidate the input. Instead, use an editor for raw text such as \texttt{notepad} in Windows systems, or \texttt{gedit}, \texttt{vim} etc.\ on Linux systems. \item \texttt{SeqFeatR} can only understand letters from canonical DNA/RNA (A, C, G, T, U) and amino acid alphabets (A, C, D, E, ...) in the FASTA sequences. Do not use special characters (?!; etc.) or characters for wobbels (R, K, Y, ...) in sequences. You can however use X or B for unidentified nucleic acids or amino acids. \end{itemize} \section{Advanced Feature: Bayes Factor in SeqFeatR} If for any reason you do not want to work with p-values, you can use \texttt{SeqFeatR} with Bayes Factor (BF) instead of the above mentioned Fisher's exact test. The BF for two hypotheses $H_0$ and $H_1$, given sequence and feature data $D$, is the ratio of posterior odds and the corresponding prior odds:\\ $BF = \left(p(H_1|D)/p(H_0|D)\right)/\left(\pi_1/\pi_0\right)$. In other words, the BF equals the posterior odds if the prior probabilities $\pi_0,\pi_1$ are equal and thus the prior odds is 1. Imagine the following table: \begin{tabular}{c c c c} & & \multicolumn{2}{c}{certain amino acid}\\\cline{3-4} & & + & - \\\cline{2-4} \multirow{2}{*}{feature $f$}& + & $p_{11}$ & $p_{12}$ \\ & - & $p_{21}$ & $p_{22}$ \\\cline{1-4} \label{tab:exampleBF} \end{tabular} In our case, $H_1$ is the hypothesis that a feature is associated with an amino acid or nucleotide at an alignment position, and $H_0$ is the hypothesis that there is no such association. The higher the BF, the more likely $H_1$ (association) and the less likely $H_0$ (no association). Here we use a BF for the hypothesis $H_1$ that feature and amino acid at an alignment position are \emph{close} to independence vs.\ $H_0$ that they are independent. Albert and Gupta presented a 'close to independence' model which is now used in \texttt{SeqFeatR} \footnote{Albert J. Bayesian Computation with R. Springer Verlag; 2009.}. The prior belief in the independence is expressed by a user chosen $K$: the higher this hyperparameter, the more dominant the independence structure will be in comparison to the observed counts, and for $K\rightarrow \infty$ complete independence is achieved. While \texttt{SeqFeatR} allows for setting an explicit $K$ value, it may not be easy to specify an appropriate value of $K$ that is applicable to all alignment positions. Therefore, \texttt{SeqFeatR} also offers an empirical Bayes variant of this BF. In this variant, an individual value of $K$ is estimated from each contingency table itself. \section{Advanced Feature: Discovering associations between mutation tuples and features} If you have discovered single positions with feature associations, you can use SeqFeatR to discover if those positions correlate and are as a combination associated with a feature. These associations are far from visible in the sequences itself, but they are there: \begin{itemize} \item There is a strong association of feature $HLA*A01$ with amino acid pair PQ at the 3 position and 27 position, though patient 6 is an exception as she is $HLA*A01$ and has not PQ but WQ. \item There is an association of feature $HLA*A02$ with amino acid pair DH at the 3 and 23 position, though patient 6 is an exception as she is $HLA*A02$ and has not DH but DY. \end{itemize} \texttt{SeqFeatR} tests \emph{all} sequence position pairs below a given p-value and quickly identifies the 3+27 pair as being statistically significantly associated with the feature ($HLA*A01$) or its absence. \texttt{SeqFeatR} shows these associations graphically as a heat map and in combination with another feature as a ``Tartan plot''. \subsection{An example: association of HLA type and pairs} \label{sec:exampl-assoc-pair} We have noted that in the above alignment there is seemingly a strong association of the occurrence of amino acids P at position 3 and amino acid Q at position 27 with the feature $HLA*A01$. The probability and strength of this association can again be quantified, by a p-value computed with Fisher's exact test and by an odds ratio. In the above example of the association of P at position 3 and Q at position 27 with feature $HLA*A01$, \texttt{SeqFeatR} internally would first collect occurrences in a frequency table and then compute from that frequency table p-value and OR like in the case of a single position \ref{sec:an-exampl-assoc}: \begin{itemize} \item 6 sequences with HLA*A01 \emph{and} P3Q27 \item 0 sequences with \emph{not} HLA*A01 \emph{and} with P3Q27 \item 1 sequences with HLA*A01 \emph{and} \emph{not} P3Q27 \item 7 sequences with \emph{not} HLA*A01 \emph{and not} P3Q27 \end{itemize} Thus, we obtain the following contingency table: \begin{tabular}{c c c c} & & \multicolumn{2}{c}{DH} \\\cline{3-4} & & + & -\\\cline{2-4} \multirow{2}{*}{HLA A*01}& + & 6 & 1\\ & - & 0 & 7 \\\cline{1-4} \label{tab:pairHLA} \end{tabular} Fisher's exact test yields a p-value $< 0.005$ and we have an OR of Infinity. Thus we have a \emph{significant and strongly positive} association of HLA*A01 and P3Q27. \section{Tartan plot: visual comparison of different associations of features and sequence position tuples} The Tartan plot is a way to visualize the comparison of two different types of associations between pairs of sequence alignment positions (lower left vs. upper right triangle). Association strengths are color coded (color legend on the right) and depend e.g. on the p-value of the association of this pair and the feature. For orientation, axes can be annotated and sequence substructures can be indicated by lines. The different types of associations can be for example two different HLA-types, or HLA-type and distance in the protein of the sequences. Here we show an example of two HLA-types, HLA*A01 and HLA*B03 \ref{fig:tart}. \begin{figure} \centering \def\svgwidth{0.80\columnwidth} \input{tart.pdf_tex} \caption[Tartan plot.]{Tartan plot. This plot combines the information of two position pair and feature associations. The first feature (upper triangle) is HLA*A01, the second one (lower triangle) is HLA*B03. A color-filled field represents the value of a position pair according to the color legend. There are additional annotations in this graphic to explain what you see in \textcolor{red}{red}. Those are \textit{not} in the real output from \texttt{SeqFeatR}.} \label{fig:tart} \end{figure} \end{document}