\documentclass[nojss]{jss} \usepackage{amsfonts,bm,amsmath,amssymb} %%\usepackage{Sweave} %% already provided by jss.cls %%%\VignetteIndexEntry{Finite Mixture Model Diagnostics Using Resampling Methods} %%\VignetteDepends{flexmix} %%\VignetteKeywords{R, finite mixture model, resampling, bootstrap} %%\VignettePackage{flexmix} \title{Finite Mixture Model Diagnostics Using Resampling Methods} <>= options(useFancyQuotes = FALSE) digits <- 3 Colors <- c("#E495A5", "#39BEB1") critical_values <- function(n, p = "0.95") { data("qDiptab", package = "diptest") if (n %in% rownames(qDiptab)) { return(qDiptab[as.character(n), p]) }else { n.approx <- as.numeric(rownames(qDiptab)[which.min(abs(n - as.numeric(rownames(qDiptab))))]) return(sqrt(n.approx)/sqrt(n) * qDiptab[as.character(n.approx), p]) } } library("graphics") library("flexmix") combine <- function(x, sep, width) { cs <- cumsum(nchar(x)) remaining <- if (any(cs[-1] > width)) combine(x[c(FALSE, cs[-1] > width)], sep, width) c(paste(x[c(TRUE, cs[-1] <= width)], collapse= sep), remaining) } prettyPrint <- function(x, sep = " ", linebreak = "\n\t", width = getOption("width")) { x <- strsplit(x, sep)[[1]] paste(combine(x, sep, width), collapse = paste(sep, linebreak, collapse = "")) } @ \author{Bettina Gr{\"u}n\\ Wirtschaftsuniversit{\"a}t Wien \And Friedrich Leisch\\ Universit\"at f\"ur Bodenkultur Wien} \Plainauthor{Bettina Gr{\"u}n, Friedrich Leisch} \Address{ Bettina Gr\"un\\ Institute for Statistics and Mathematics\\ Wirtschaftsuniversit{\"a}t Wien\\ Welthandelsplatz 1\\ 1020 Wien, Austria\\ E-mail: \email{Bettina.Gruen@R-project.org}\\ Friedrich Leisch\\ Institut f\"ur Angewandte Statistik und EDV\\ Universit\"at f\"ur Bodenkultur Wien\\ Peter Jordan Stra\ss{}e 82\\ 1190 Wien, Austria\\ E-mail: \email{Friedrich.Leisch@boku.ac.at} } \Abstract{ This paper illustrates the implementation of resampling methods in \pkg{flexmix} as well as the application of resampling methods for model diagnostics of fitted finite mixture models. Convenience functions to perform these methods are available in package \pkg{flexmix}. The use of the methods is illustrated with an artificial example and the \code{seizure} data set. } \Keywords{\proglang{R}, finite mixture models, resampling, bootstrap} \Plainkeywords{R, finite mixture models, resampling, bootstrap} %%------------------------------------------------------------------------- %%------------------------------------------------------------------------- \begin{document} \SweaveOpts{engine=R, echo=true, height=5, width=8, eps=FALSE, keep.source=TRUE} \setkeys{Gin}{width=0.95\textwidth} \section{Implementation of resampling methods}\label{sec:implementation} The proposed framework for model diagnostics using resampling \citep{mixtures:gruen+leisch:2004} equally allows to investigate model fit for all kinds of mixture models. The procedure is applicable to mixture models with different component specific models and does not impose any limitation such as for example on the dimension of the parameter space of the component specific model. In addition to the fitting step different component specific models only require different random number generators for the parametric bootstrap. The \code{boot()} function in \pkg{flexmix} is a generic \proglang{S4} function with a method for fitted finite mixtures of class \code{"flexmix"} and is applicable to general finite mixture models. The function with arguments and their defaults is given by: <>= cat(prettyPrint(gsub("boot_flexmix", "boot", prompt(flexmix:::boot_flexmix, filename = NA)$usage[[2]]), sep = ", ", linebreak = paste("\n", paste(rep(" ", 2), collapse = ""), sep= ""), width = 70)) @ The interface is similar to the \code{boot()} function in package \pkg{boot} \citep{mixtures:Davison+Hinkley:1997, mixtures:Canty+Ripley:2010}. The \code{object} is a fitted finite mixture of class \code{"flexmix"} and \code{R} denotes the number of resamples. The possible bootstrapping method are \code{"empirical"} (also available as \code{"ordinary"}) and \code{"parametric"}. For the parametric bootstrap sampling from the fitted mixture is performed using \code{rflexmix()}. For mixture models with different component specific models \code{rflexmix()} requires a sampling method for the component specific model. Argument \code{initialize\_solution} allows to select if the EM algorithm is started in the original finite mixture solution or if random initialization is performed. The fitted mixture model might contain weights and group indicators. The weights are case weights and allow to reduce the amount of data if observations are identical. This is useful for example for latent class analysis of multivariate binary data. The argument \code{keep\_weights} allows to indicate if they should be kept for the bootstrapping. Group indicators allow to specify that the component membership is identical over several observations, e.g., for repeated measurements of the same individual. Argument \code{keep\_groups} allows to indicate if the grouping information should also be used in the bootstrapping. \code{verbose} indicates if information on the progress should be printed. The \code{control} argument allows to control the EM algorithm for fitting the model to each of the bootstrap samples. By default the \code{control} argument is extracted from the fitted model provided by \code{object}. \code{k} allows to specify the number of components and by default this is also taken from the fitted model provided. The \code{model} argument determines if also the model and the weights slot for each sample are stored and returned. The returned object is of class \code{"FLXboot"} and otherwise only contains the fitted parameters, the fitted priors, the log likelihoods, the number of components of the fitted mixtures and the information if the EM algorithm has converged. The likelihood ratio test is implemented based on \code{boot()} in function \code{LR\_test()} and returns an object of class \code{"htest"} containing the number of valid bootstrap replicates, the p-value, the double negative log likelihood ratio test statistics for the original data and the bootstrap replicates. The \code{plot} method for \code{"FLXboot"} objects returns a parallel coordinate plot with the fitted parameters separately for each of the components. \section{Artificial data set} In the following a finite mixture model is used as the underlying data generating process which is theoretically not identifiable. We are assuming a finite mixture of linear regression models with two components of equal size where the coverage condition is not fulfilled \citep{mixtures:Hennig:2000}. Hence, intra-component label switching is possible, i.e., there exist two parameterizations implying the same mixture distribution which differ how the components between the covariate points are combined. We assume that one measurement per object and a single categorical regressor with two levels are given. The usual design matrix for a model with intercept uses the two covariate points $\mathbf{x}_1 = (1, 0)'$ and $\mathbf{x}_2 = (1, 1)'$. The mixture distribution is given by \begin{eqnarray*} H(y|\mathbf{x}, \Theta) &=& \frac{1}{2} N(\mu_1, 0.1) + \frac{1}{2} N(\mu_2, 0.1), \end{eqnarray*} where $\mu_k(\mathbf{x}) = \mathbf{x}'\bm{\alpha}_k$ and $N(\mu, \sigma^2)$ is the normal distribution. Now let $\mu_1(\mathbf{x}_1) = 1$, $\mu_2(\mathbf{x}_1) = 2$, $\mu_1(\mathbf{x}_2) = -1$ and $\mu_2(\mathbf{x}_2) = 4$. As Gaussian mixture distributions are generically identifiable the means, variances and component weights are uniquely determined in each covariate point given the mixture distribution. However, as the coverage condition is not fulfilled, the two possible solutions for $\bm{\alpha}$ are: \begin{description} \item[Solution 1:] $\bm{\alpha}_1^{(1)} = (2,\phantom{-}2)'$, $\bm{\alpha}_2^{(1)} = (1,-2)'$, \item[Solution 2:] $\bm{\alpha}_1^{(2)} = (2,-3)'$, $\bm{\alpha}_2^{(2)} = (1,\phantom{-}3)'$. \end{description} We specify this artificial mixture distribution using \code{FLXdist()}. \code{FLXdist()} returns an unfitted finite mixture of class \code{"FLXdist"}. The class of fitted finite mixture models \code{"flexmix"} extends class \code{"FLXdist"}. Each component follows a normal distribution. The parameters specified in a named list therefore consist of the regression coefficients and the standard deviation. Function \code{FLXdist()} has an argument \code{formula} for specifying the regression in each of the components, an argument \code{k} for the component weights and \code{components} for the parameters of each of the components. <<>>= library("flexmix") Component_1 <- list(Model_1 = list(coef = c(1, -2), sigma = sqrt(0.1))) Component_2 <- list(Model_1 = list(coef = c(2, 2), sigma = sqrt(0.1))) ArtEx.mix <- FLXdist(y ~ x, k = rep(0.5, 2), components = list(Component_1, Component_2)) @ We draw a balanced sample with 50 observations in each covariate point from the mixture model using \code{rflexmix()} after defining the data points for the covariates. \code{rflexmix()} can either have an unfitted or a fitted finite mixture as input. For unfitted mixtures data has to be provided using the \code{newdata} argument. For already fitted mixtures data can be optionally provided, otherwise the data used for fitting the mixture is used. <<>>= ArtEx.data <- data.frame(x = rep(0:1, each = 100/2)) suppressWarnings(RNGversion("3.5.0")) set.seed(123) ArtEx.sim <- rflexmix(ArtEx.mix, newdata = ArtEx.data) ArtEx.data$y <- ArtEx.sim$y[[1]] ArtEx.data$class <- ArtEx.sim$class @ In Figure~\ref{fig:art} the sample is plotted together with the two solutions for combining $x_1$ and $x_2$, i.e., this illustrates intra-component label switching. \begin{figure} \centering <>= par(mar = c(5, 4, 2, 0) + 0.1) plot(y ~ x, data = ArtEx.data, pch = with(ArtEx.data, 2*class + x)) pars <- list(matrix(c(1, -2, 2, 2), ncol = 2), matrix(c(1, 3, 2, -3), ncol = 2)) for (i in 1:2) apply(pars[[i]], 2, abline, col = Colors[i]) @ \caption{Balanced sample from the artificial example with the two theoretical solutions.} \label{fig:art} \end{figure} We fit a finite mixture to the sample using \code{stepFlexmix()}. <<>>= set.seed(123) ArtEx.fit <- stepFlexmix(y ~ x, data = ArtEx.data, k = 2, nrep = 5, control = list(iter = 1000, tol = 1e-8, verbose = 0)) @ The fitted mixture can be inspected using \code{summary()} and \code{parameters()}. <<>>= summary(ArtEx.fit) parameters(ArtEx.fit) @ Obviously the fitted mixture parameters correspond to the parameterization we used to specify the mixture distribution. Using standard asymptotic theory to analyze the fitted mixture model gives the following estimates for the standard deviations. <<>>= ArtEx.refit <- refit(ArtEx.fit) summary(ArtEx.refit) @ The fitted mixture can also be analyzed using resampling techniques. For analyzing the stability of the parameter estimates where the possibility of identifiability problems is also taken into account the parametric bootstrap is used with random initialization. Function \code{boot()} can be used for empirical or parametric bootstrap (specified by the argument \code{sim}). The logical argument \code{initialize_solution} specifies if the initialization is in the original solution or random. By default random initialization is made. The number of bootstrap samples is set by the argument \code{R}. Please note that the arguments are chosen to correspond to those for function \code{boot} in package \pkg{boot} \citep{mixtures:Davison+Hinkley:1997}. <>= set.seed(123) ArtEx.bs <- boot(ArtEx.fit, R = 200, sim = "parametric") ArtEx.bs @ <>= if (file.exists("ArtEx.bs.rda")) { load("ArtEx.bs.rda") } else { set.seed(123) ArtEx.bs <- boot(ArtEx.fit, R = 200, sim = "parametric") save(ArtEx.bs, file = "ArtEx.bs.rda") } ArtEx.bs @ Function \code{boot()} returns an object of class \code{"\Sexpr{class(ArtEx.bs)}"}. The default plot compares the bootstrap parameter estimates to the confidence intervals derived using standard asymptotic theory in a parallel coordinate plot (see Figure~\ref{fig:plot.FLXboot-art}). Clearly two groups of parameter estimates can be distinguished which are about of equal size. One subset of the parameter estimates stays within the confidence intervals induced by standard asymptotic theory, while the second group corresponds to the second solution and clusters around these parameter values. \begin{figure}[h!] \centering <>= print(plot(ArtEx.bs, ordering = "coef.x", col = Colors)) @ \caption{Diagnostic plot of the bootstrap results for the artificial example.} \label{fig:plot.FLXboot-art} \end{figure} In the following the DIP-test is applied to check if the parameter estimates follow a unimodal distribution. This is done for the aggregated parameter esimates where unimodality implies that this parameter is not suitable for imposing an ordering constraint which induces a unique labelling. For the separate component analysis which is made after imposing an ordering constraint on the coefficient of $x$ rejection the null hypothesis of unimodality implies that identifiability problems are present, e.g.~due to intra-component label switching. <<>>= require("diptest") parameters <- parameters(ArtEx.bs) Ordering <- factor(as.vector(apply(matrix(parameters[,"coef.x"], nrow = 2), 2, order))) Comp1 <- parameters[Ordering == 1,] Comp2 <- parameters[Ordering == 2,] dip.values.art <- matrix(nrow = ncol(parameters), ncol = 3, dimnames=list(colnames(parameters), c("Aggregated", "Comp 1", "Comp 2"))) dip.values.art[,"Aggregated"] <- apply(parameters, 2, dip) dip.values.art[,"Comp 1"] <- apply(Comp1, 2, dip) dip.values.art[,"Comp 2"] <- apply(Comp2, 2, dip) dip.values.art @ The critical value for column \code{Aggregated} is \Sexpr{round(critical_values(nrow(parameters)), digits = digits)} and for the columns of the separate components \Sexpr{round(critical_values(nrow(Comp1)), digits = digits)}. The component sizes as well as the standard deviations follow a unimodal distribution for the aggregated data as well as for each of the components. The regression coefficients are multimodal for the aggregate data as well as for each of the components. While from the aggregated case it might be concluded that imposing an ordering constraint on the intercept or the coefficient of $x$ is suitable, the component-specific analyses reveal that a unique labelling was not achieved. \section{Seizure} In \cite{mixtures:Wang+Puterman+Cockburn:1996} a Poisson mixture regression is fitted to data from a clinical trial where the effect of intravenous gammaglobulin on suppression of epileptic seizures is investigated. The data used were 140 observations from one treated patient, where treatment started on the $28^\textrm{th}$ day. In the regression model three independent variables were included: treatment, trend and interaction treatment-trend. Treatment is a dummy variable indicating if the treatment period has already started. Furthermore, the number of parental observation hours per day were available and it is assumed that the number of epileptic seizures per observation hour follows a Poisson mixture distribution. The number of epileptic seizures per parental observation hour for each day are plotted in Figure~\ref{fig:seizure}. The fitted mixture distribution consists of two components which can be interpreted as representing 'good' and 'bad' days of the patients. The mixture model can be formulated by \begin{equation*} H(y|\mathbf{x}, \Theta) = \pi_1 P(\lambda_1) + \pi_2 P(\lambda_2), \end{equation*} where $\lambda_k = e^{\mathbf{x}'\bm{\alpha}_k}$ for $k = 1,2$ and $P(\lambda)$ is the Poisson distribution. The data is loaded and the mixture fitted with two components. <<>>= data("seizure", package = "flexmix") model <- FLXMRglm(family = "poisson", offset = log(seizure$Hours)) control <- list(iter = 1000, tol = 1e-10, verbose = 0) set.seed(123) seizMix <- stepFlexmix(Seizures ~ Treatment * log(Day), data = seizure, k = 2, nrep = 5, model = model, control = control) @ The fitted regression lines for each of the two components are shown in Figure~\ref{fig:seizure}. \begin{figure}[h!] \begin{center} <>= par(mar = c(5, 4, 2, 0) + 0.1) plot(Seizures/Hours~Day, data=seizure, pch = as.integer(seizure$Treatment)) abline(v = 27.5, lty = 2, col = "grey") matplot(seizure$Day, fitted(seizMix)/seizure$Hours, type="l", add = TRUE, col = 1, lty = 1, lwd = 2) @ \caption{Seizure data with the fitted values for the \citeauthor{mixtures:Wang+Puterman+Cockburn:1996} model. The plotting character for the observed values in the base period is a circle and for those in the treatment period a triangle.} \label{fig:seizure} \end{center} \end{figure} The parameteric bootstrap with random initialization is used to investigate identifiability problems and parameter stability. The diagnostic plot is given in Figure~\ref{fig:plot.FLXboot-seiz}. The coloring is according to an ordering constraint on the intercept. Clearly the parameter estimates corresponding to the solution where the bad days from the base period are combined with the good days from the treatement period and vice versa for the good days of the base period can be distinguished and indicate the slight identifiability problems of the fitted mixture. <>= set.seed(123) seizMix.bs <- boot(seizMix, R = 200, sim = "parametric") seizMix.bs @ <>= if (file.exists("seizMix.bs.rda")) { load("seizMix.bs.rda") } else { set.seed(123) seizMix.bs <- boot(seizMix, R = 200, sim = "parametric") save(seizMix.bs, file = "seizMix.bs.rda") } seizMix.bs @ \begin{figure}[h!] \centering <>= print(plot(seizMix.bs, ordering = "coef.(Intercept)", col = Colors)) @ \label{fig:plot.FLXboot-seiz} \caption{Diagnostic plot of the bootstrap results for the \code{seizure} data.} \end{figure} <<>>= parameters <- parameters(seizMix.bs) Ordering <- factor(as.vector(apply(matrix(parameters[,"coef.(Intercept)"], nrow = 2), 2, order))) Comp1 <- parameters[Ordering == 1,] Comp2 <- parameters[Ordering == 2,] @ For applying the DIP test also an ordering constraint on the intercept is used. The critical value for column \code{Aggregated} is \Sexpr{round(critical_values(nrow(parameters)), digits = digits)} and for the columns of the separate components \Sexpr{round(critical_values(nrow(Comp1)), digits = digits)}. <<>>= dip.values.art <- matrix(nrow = ncol(parameters), ncol = 3, dimnames = list(colnames(parameters), c("Aggregated", "Comp 1", "Comp 2"))) dip.values.art[,"Aggregated"] <- apply(parameters, 2, dip) dip.values.art[,"Comp 1"] <- apply(Comp1, 2, dip) dip.values.art[,"Comp 2"] <- apply(Comp2, 2, dip) dip.values.art @ For the aggregate results the hypothesis of unimodality cannot be rejected for the trend. For the component-specific analyses unimodality cannot be rejected only for the intercept (where the ordering condition was imposed on) and again the trend. For all other parameter estimates unimodality is rejected which indicates that the ordering constraint was able to impose a unique labelling only for the own parameter and not for the other parameters. This suggests identifiability problems. %%------------------------------------------------------------------------- %%------------------------------------------------------------------------- \bibliography{mixture} \end{document}