--- title: "MECfda: An R package for bias correction due to measurement error in functional and scalar covariates in scalar-on-function regression models" author: - Ji, Heyang - Beyaztas, Ufuk - Escobar, Nicolas - Luan, Yuanyuan - Chen, Xiwei - Zhang, Mengli - Zoh, Roger - Xue, Lan - Tekwe, Carmen output: rmarkdown::html_vignette bibliography: references.bib csl: "biomed-central.csl" nocite: | '@*' abstract: "Functional data analysis is a statistical approach used to analyze data that appear as functions or images. Functional data analysis methods can be used to analyze device-based measures such as physical activity and sleep. Although device-based measures are more objective than self-reported measures of physical activity patterns or sleep activity, device-based measures can still be prone to measurement error. When assessing associations between scalar-valued outcomes and device-based measures, scalar-on-function regression models that correct for measurement error can be applied. We develop the **MECfda** package for several scalar-on-function regression models including multi-level generalized scalar-on-function regression models and functional quantile linear regression models. The developed package implements several bias-correction methods that account for the presence of multiple functional and scalar covariates prone to measurement error in various scalar-on-function regression settings." keywords: "scalar-on-function regression, functional data, measurement error, function-valued covariates." vignette: > %\VignetteIndexEntry{MECfda: An R package for bias correction due to measurement error in functional and scalar covariates in scalar-on-function regression models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # Introduction Functional data analysis (FDA) is an essential statistical approach for handling high-dimensional data that appear in the form of functions or images [@wang2016functional; @ramsay2002applied; @ramsay1991some]. In many applications, data are collected continuously or at frequent time intervals, resulting in complex functions over time. When the objective is to study the relationship between a scalar response and both functional and scalar covariates, scalar-on-function regression models become a natural choice. Although self-reported measures (e.g., dietary intake) are known to suffer from measurement error [@carroll2006measurement], recent studies have demonstrated that even device-based measurements (such as those from wearable devices) can be error prone [@crainiceanu2009generalized; @tekwe2019instrumental]. Failing to correct for measurement error in either scalar or function-valued covariates may lead to biased parameter estimates. The goal of the **MECfda** package is to provide solutions for a range of scalar-on-function regression models—including multi-level generalized scalar-on-function regression models and functional quantile regression models—while incorporating several bias-correction methods for measurement error in scalar-on-function regression problem. This package is developed in [R](https://www.R-project.org/). ```{r setup} library(MECfda) ``` # Scalar-on-Function Linear Regression Models The general form of the scalar-on-function regression model is given by $$T(F_{Y_i|X_i,Z_i}) = \sum_{l=1}^{L} \int_{\Omega_l} \beta_l(t) X_{li}(t) dt + (1,Z_i^T)\gamma$$ where - $Y_i$ is the scalar response variable; - $X_{li}(t)$ denotes the $l$th function-valued covariate defined on the domain $\Omega_l$; - $\beta_l(t)$ is the corresponding function-valued regression coefficient, and both $\beta_l$ and $X_{li}$ belong to $L^2(\Omega_l)$; - $Z_i = (Z_{1i}, \dots, Z_{qi})^T$ represents the scalar covariates; - $\gamma = (\gamma_0, \gamma_1, \dots, \gamma_q)^T$ is the parameter vector for the scalar covariates; - $F_{Y_i\mid X_i,Z_i}$ is the cumulative distribution function of $Y_i$ conditional on $X_i$ and $Z_i$; - $T(\cdot)$ is a statistical functional. A statistical functional is defined as a mapping from a probability distribution to a real number. Statistical functionals represent quantifiable properties of probability distributions, (e.g., an expectation, a variance or a quantile). Two common model types supported in this package are: 1. **Multi-level Generalized Scalar-on-Function Regression:** This model is formulated as $$ T\bigl(F_{Y_i\mid X_i,Z_i}\bigr) = g\Bigl\{E(Y_i\mid X_i,Z_i)\Bigr\} = g\left\{\int_\mathbb{R}ydF_{Y_i|X_i,Z_i}(y)\right\}, $$ where $g(\cdot)$ is a strictly monotonic link function analogous to that used in generalized linear models. 2. **Scalar-on-Function Quantile Regression:** This model is expressed as $$ T\bigl(F_{Y_i\mid X_i,Z_i}\bigr) = Q_{Y_i\mid X_i,Z_i}(\tau) = F_{Y_i\mid X_i,Z_i}^{-1}(\tau), $$ where $\tau \in (0,1)$ denotes the quantile of interest. ## Representation of Functional Data Function-valued variables (functional variables) are typically recorded as measurements at specific time points, resulting in an $n \times m$ data matrix $(x_{ij})$, where each row corresponds to an observation (subject) and each column to a measurement time. ## Functional Data Function-valued variables (functional variables) are typically recorded as the measurements of the values of the functions at specific (time) points in the domain. The data of a function-valued variable are often presented in the form of a matrix, $(x_{ij})_{n\times m}$, where $x_{ij} = f_i(t_j)$, represents the value of $f_i(t_j)$, each row corresponds to an observation (subject) and each column to a measurement time. ### S4 Class `functional_variable` In the **MECfda** package, the s4 class `functional_variable` represents function-valued data stored in matrix form. The main slots include: - **`X`**: The data matrix $(x_{ij})$; - **`t_0`** and **`period`**: The left endpoint and length of the function’s domain, respectively (The class only support the domain in the format of interval); - **`t_points`**: A vector of the time points at which the measurements are recorded. A dedicated method, `dim`, returns the number of subjects and the number of measurement points for a `functional_variable` object. ```{r fd} fv = functional_variable( X = matrix(rnorm(10*24),10,24), t_0 = 0, period = 1, t_points = (0:9)/10 ) dim(fv) ``` ## Basis Expansion Let $\{\rho_{k}\}_{k=1}^\infty$ be a complete basis for $L^2(\Omega)$. For an arbitrary function $\beta(t) \in L^2(\Omega)$, there exists a sequence of (real) number $\{c_k\}$ such that $$ \beta(t) = \sum_{k=1}^\infty c_k \, \rho_k(t). $$ For the function-valued variable $X_i(t)$, we have $$ \int_\Omega \beta(t) X_i(t) \, dt = \int_\Omega X_i(t) \sum_{k=1}^\infty c_{k}\rho_{k}(t) dt = \sum_{k=1}^\infty c_k \int_\Omega \rho_k(t) X_i(t) \, dt. $$ In practice, for $\int_\Omega\beta(t) X_i(t) dt$ in statistical models, we truncate the infinite series to $p$ terms ($p<\infty$), using $c_{k} \int_\Omega X_i(t) \rho_{k}(t) dt$ to approximate $\int_\Omega\beta(t) X_i(t) dt$ so that the problem reduces to estimating a finite number of parameters $c_1, \dots, c_p$, and treating $\int_\Omega \rho_k(t) X_i(t) \, dt$ as new covariates. The number of basis, $p$, can be related to the sample size, $n$, and the $p$ is in a form of $p(n)$. In practice, we may not necessarily use the truncated complete basis of $L^2$ function space. Instead, we can just use a finite sequence of linearly independent functions as the basis function. Common choices for the basis include the Fourier basis, b-spline basis, and eigenfunction basis. In numerical computing level, performing basis expansion methods on function-valued variable data in matrix form as we mentioned is to compute $(b_{ik})_{n\times p}$, where $b_{ik} = \int_\Omega f(t)\rho_k(t) dt$. ### Fourier Basis On the interval $[t_0, t_0+T]$, the Fourier basis consists of $$ \frac{1}{2}, \quad \cos\left(\frac{2\pi}{T}k(x-t_0)\right), \quad \sin\left(\frac{2\pi}{T}k(x-t_0)\right), \quad k = 1, 2, \dots $$ #### S4 Class `Fourier_series` In the **MECfda** package, s4 class `Fourier_series` represents a Fourier series of the form $$ \frac{a_0}{2} + \sum_{k=1}^{p_a} a_k \cos\left(\frac{2\pi k (x-t_0)}{T}\right) + \sum_{k=1}^{p_b} b_k \sin\left(\frac{2\pi k (x-t_0)}{T}\right),\quad x \in [t_0, t_0+T]. $$ Its main slots include: - **`double_constant`**: The value of $a_0$; - **`cos`** and **`sin`**: The coefficients $a_k$ and $b_k$ for the cosine and sine terms, respectively; - **`k_cos`** and **`k_sin`**: The frequency values corresponding to the cosine and sine coefficients; - **`t_0`** and **`period`**: The left endpoint, $t_0$, and length, $T$, of the domain (interval $[t_0, t_0+T]$), respectively. Additional methods such as `plot`, `FourierSeries2fun`, and `extractCoef` are provided for visualization, function evaluation, and coefficient extraction. ```{r c1} fsc = Fourier_series( double_constant = 3, cos = c(0,2/3), sin = c(1,7/5), k_cos = 1:2, k_sin = 1:2, t_0 = 0, period = 1 ) plot(fsc) FourierSeries2fun(fsc,seq(0,1,0.05)) extractCoef(fsc) ``` The object `fsc` represents the summation $$\frac32 + \frac23 \cos(2\pi\cdot2x) + \sin(2\pi x) + \frac75\sin(2\pi\cdot2x).$$ ### B-spline Basis A b-spline basis $\{B_{i,p}(x)\}_{i=-p}^{k}$ on the interval $[t_0, t_{k+1}]$ is defined recursively as follows: - For $p = 0$, $$ B_{i,0}(x) = \begin{cases} I_{(t_i, t_{i+1}]}(x), & i = 0, 1, \dots, k, \\ 0, & \text{otherwise}. \end{cases} $$ - For $p > 0$, the recursion is given by $$ B_{i,p}(x) = \frac{x-t_i}{t_{i+p}-t_i} B_{i,p-1}(x) + \frac{t_{i+p+1}-x}{t_{i+p+1}-t_{i+1}} B_{i+1,p-1}(x). $$ At discontinuities in the interval $(t_0,t_k)$, the basis function is defined as its limit, $B_{i,p}(x) = \lim_{t\to x} B_{i,p}(t)$, to ensure continuity. #### S4 Classes `bspline_basis` and `bspline_series` The s4 class `bspline_basis` represents a b-spline basis, $\{B_{i,p}(x)\}_{i=-p}^{k}$, on the interval $[t_0, t_{k+1}]$. Its key slots include: - **`Boundary.knots`**: The boundary $[t_0, t_{k+1}]$, (default is $[0,1]$); - **`knots`**: The internal spline knots, $(t_1,\dots,t_k)$, (automatically generated as equally spaced, $t_j = t_0 + j\cdot\frac{t_{k+1}-t_0}{k+1}$, if not assigned); - **`intercept`**: Indicates whether an intercept is included (default is `TRUE`); - **`df`**: Degrees of freedom of the basis, which is the number of the splines, equal to $p + k + 1$ (by default $k =0$ and $\text{df} = p+1$); - **`degree`**: The degree of the spline, which is the degree of piecewise polynomials, $p$, (default is 3). The s4 class `bspline_series` represents the linear combination $$ \sum_{i=0}^{k} b_i B_{i,p}(x), $$ where the slot **`coef`** stores the coefficients $b_i$ ($i = 0,\dots,k$) and **`bspline_basis`** specifies the associated basis (represented by a `bspline_basis` object). Methods such as `plot` and `bsplineSeries2fun` are provided for visualization and function evaluation. ```{r c2} bsb = bspline_basis( Boundary.knots = c(0,24), df = 7, degree = 3 ) bss = bspline_series( coef = c(2,1,3/4,2/3,7/8,5/2,19/10), bspline_basis = bsb ) plot(bss) bsplineSeries2fun(bss,seq(0,24,0.5)) ``` The object `bsb` represents $\{B_{i,3}(x)\}_{i=-3}^{0}$, and the object `bss` represents $$2B_{i,-3}(x)+B_{i,-2}(x)+\frac34B_{i,-1}(x)+\frac23B_{i,0}(x) + \frac78B_{i,1}(x) + \frac52B_{i,2}(x) +\frac{19}{10}B_{i,3}(x),$$ where $x\in[t_0,t_4]$ and $t_0=0$, $t_k = t_{k-1}+6$ ,$k=1,2,3,4$. ### Eigenfunction basis Suppose $K(s,t)\in L^2(\Omega\times \Omega)$, $f(t)\in L^2(\Omega)$. Then $K$ induces an linear operator $\mathcal{K}$ by $$(\mathcal{K}f)(x) = \int_{\Omega} K(t,x)f(t)dt$$ If $\xi \in L^2(\Omega)$ such that $$\mathcal{K}\xi = \lambda \xi$$ where $\lambda\in \mathbb{C}$, we call $\xi$ an eigenfunction/eigenvector of $\mathcal{K}$, and $\lambda$ an eigenvalue associated with $\xi$. For a stochastic process $\{X(t),t\in\Omega\}$ we call the orthogonal basis, $\{\xi_k\}_{k=1}^\infty$ corresponding to eigenvalues $\{\lambda_k\}_{k=1}^\infty$ ($\lambda_1\geq\lambda_2\geq\dots$), induced by $$K(s,t)=\operatorname{Cov}(X(t),X(s))$$ a functional principal component (FPC) basis for $L^2(\Omega)$. The eigenfunction basis relies on the covariance function $K(s,t)$. And we usually cannot analytically express the equation of the basis funcitons in eigenfunction basis. In practice, the covariance function is estimated from the data and the eigenfunctions are computed numerically. ### Numerical Basis Sometimes, analytical expressions for the basis functions are not available. In these cases, they can be represented numerically. Let $\{\rho_k\}_{k=1}^\infty$ denote a basis of the function space. We can numerically approximate the basis by storing the values of a finite subset of these functions at a finite set of points in the domain, i.e., $\rho_k(t_j)$ for $k=1,\dots,p$ and $j=1,\dots,m$. #### S4 Class `numeric_basis` and `numericBasis_series` In this package, the s4 class `numeric_basis` is designed to represent a finite sequence of functions $\{\rho_k\}_{k=1}^p$ by their values at a finite set of points within their domain. In this context, all functions share the same domain, which is assumed to be an interval. The key slots include: - **`basis_function`**: The matrix $(\zeta_{jk})_{m \times p}$, where each element is defined as $\zeta_{jk} = \rho_k(t_j)$ for $j = 1, \dots, m$ and $k = 1, \dots, p$. In this matrix, each row corresponds to a point in the domain, and each column corresponds to a specific basis function. - **`t_points`**: A numeric vector that represents the points in the domain at which the basis functions are evaluated. The $j$-th element of this vector corresponds to the $j$-th row of the `basis_function` matrix. - **`t_0`** and **`period`**: The left endpoint and length of the domain, respectively. Additionally, the package provides an s4 class `numericBasis_series`, which represents a linear combination of the basis functions represented by a `numeric_basis` object. The key slots include: - **`coef`**: The linear coefficients for the summation series. - **`numeric_basis`**: Function basis as represented by a `numeric_basis` object. #### Function `basis2fun` The generic function `basis2fun` is provided for `Fourier_series`, `bspline_series`, and `numericBasis_series` objects. For a `Fourier_series` object, it is equivalent to `FourierSeries2fun`; for a `bspline_series` object, it is equivalent to `bsplineSeries2fun`; For a `numericBasis_series` object, it is equivalent to `numericBasisSeries2fun`. ```{r basis2fun} basis2fun(fsc,seq(0,1,0.05)) basis2fun(bss,seq(0,24,0.5)) ``` ### Basis expansion methods for the `functional_variable` class The **MECfda** pakcage provide the methods `fourier_basis_expansion`, `bspline_basis_expansion`, `FPC_basis_expansion`, and `numeric_basis_expansion` for the class `functional_variable`, which perform basis expansion using Fourier basis, b-spline basis, functional principal component basis, and numerical basis respectively. ```{r be} data(MECfda.data.sim.0.0) fv = MECfda.data.sim.0.0$FC[[1]] BE.fs = fourier_basis_expansion(fv,5L) BE.bs = bspline_basis_expansion(fv,5L,3L) ``` ## Numerical Computation of Integrals The package approximates the integral $$ \int_{\Omega} \rho_{k}(t) X_{i}(t) \, dt \approx \frac{\mu(\Omega)}{|T|} \sum_{t \in T} \rho_{k}(t) X_{i}(t), $$ where $\mu(\cdot)$ denote the Lebesgue measure. $T$ denotes the set of measurement time points for $X_i(t)$ and $|T|$ represents the cardinal number of $T$. # Scalar-on-Function Linear Regression in **MECfda** ## fcRegression ```{r shili1, eval = FALSE} fcRegression(Y, FC, Z, formula.Z, family = gaussian(link = "identity"), basis.type = c("Fourier", "Bspline"), basis.order = 6L, bs_degree = 3) ``` The **MECfda** package provides a function, `fcRegression`, for fitting generalized linear mixed-effect models—including ordinary linear models and generalized linear models with fixed and random effects—by discretizing function-valued covariates using basis expansion. The function can solve a linear model of the form $$ g\bigl(E(Y_i\mid X_i,Z_i)\bigr) = \sum_{l=1}^{L} \int_{\Omega_l} \beta_l(t) X_{li}(t) \, dt + (1,Z_i^T)\gamma. $$ It supports one or multiple function-valued covariates as fixed effects, and zero, one, or multiple scalar covariates, which can be specified as either fixed or random effects. The response variable, function-valued covariates, and scalar covariates are supplied separately via the arguments `Y`, `FC`, and `Z`, respectively, allowing great flexibility in data format. - **Response Variable:** The response can be provided as an atomic vector, a one-column matrix, or a data frame. However, a one-column data frame or matrix with a column name is recommended so that the response variable's name is clearly specified. - **Function-Valued Covariates:** The function-valued covariates can be input as a `functional_variable` object, a matrix, a data frame, or a list of these objects. If a single `functional_variable` object (or matrix or data frame) is provided via `FC`, the model includes one function-valued covariate. If a list is provided, the model accommodates multiple function-valued covariates, with each element in the list representing a distinct covariate. - **Scalar Covariates:** Scalar covariates can be provided as a matrix, data frame, atomic vector, or even `NULL` if no scalar covariates are present. If `Z` is omitted or set to `NULL`, no scalar covariate is included and the argument `formula.Z` should also be `NULL` or omitted. When an atomic vector is used for `Z`, it represents a single scalar covariate (without a name). Hence, even when only one scalar covariate is included, it is recommended to supply it as a matrix or data frame with a column name. The `formula.Z` argument specifies which parts of `Z` to use and whether scalar covariates should be treated as fixed or random effects. Other key arguments include: - **`family`:** Specifies the response variable's distribution and the link function to be used. - **`basis.type`:** Indicates the type of basis function for the expansion. Options include `'Fourier'`, `'Bspline'`, and `'FPC'`, corresponding to the Fourier basis, b-spline basis, and functional principal component basis, respectively. - **`basis.order`:** Specifies the number of basis functions to use. For the Fourier basis (i.e., $\frac{1}{2},\ \sin(kt),\ \cos(kt)$ for $k = 1, \dots, p_f$), `basis.order` equals $p_f$. For the b-spline basis $\{B_{i,p}(x)\}_{i=-p}^{k}$, `basis.order` is set to $k+p+1$. For the FPC basis, `basis.order` represents the number of functional principal components. - **`bs_degree`:** Specifies the degree of the piecewise polynomials for the b-spline basis; this argument is only necessary when using the b-spline basis. The function returns an s3 object of class `fcRegression`, which is a list containing the following elements: 1. **regression_result:** The result of the regression analysis. 2. **FC.BasisCoefficient:** A list of `Fourier_series`, `bspline_series`, or `numericBasis_series` objects representing the function-valued linear coefficients of the covariates. 3. **function.basis.type:** The type of basis used. 4. **basis.order:** The number of basis functions used, as specified in the input. 5. **data:** The original data provided to the model. 6. **bs_degree:** The degree of the b-spline basis, same as specified in the input (included only when the b-spline basis is used). Additionally, the `predict` method can be used to obtain predicted values from the fitted model, and the `fc.beta` method is available to evaluate the estimated function-valued coefficient parameters $\beta_l(t)$. ```{r fcglmm} data(MECfda.data.sim.0.0) res = fcRegression(FC = MECfda.data.sim.0.0$FC, Y=MECfda.data.sim.0.0$Y, Z=MECfda.data.sim.0.0$Z, family = gaussian(link = "identity"), basis.order = 5, basis.type = c('Bspline'), formula.Z = ~ Z_1 + (1|Z_2)) t = (0:100)/100 plot(x = t, y = fc.beta(res,1,t), ylab = expression(beta[1](t))) plot(x = t, y = fc.beta(res,2,t), ylab = expression(beta[2](t))) data(MECfda.data.sim.1.0) predict(object = res, newData.FC = MECfda.data.sim.1.0$FC, newData.Z = MECfda.data.sim.1.0$Z) ``` ## fcQR ```{r shili2, eval = FALSE} fcQR(Y, FC, Z, formula.Z, tau = 0.5, basis.type = c("Fourier", "Bspline"), basis.order = 6L, bs_degree = 3) ``` The **MECfda** package provides a function, `fcRegression`, for fitting quantile linear regression models by discretizing function-valued covariates using basis expansion. The function can solve a linear model of the form $$Q_{Y_i|X_i,Z_i}(\tau) = \sum_{l=1}^L\int_{\Omega_l} \beta_l(\tau,t) X_{li}(t) dt + (1,Z_i^T)\gamma.$$ The function supports one or multiple function-valued covariates, as well as zero, one, or multiple scalar-valued covariates. Data input follows the same guidelines as for the `fcRegression` function, and the treatment of scalar covariates is determined by the `formula.Z` argument. The primary difference between `fcQR` and `fcRegression` is that the quantile linear regression model does not incorporate random effects. The quantile $\tau$ is specified by the argument `tau`, and the type and parameters of the basis functions are defined by the `basis.type`, `basis.order`, and `bs_degree` arguments, just as in `fcRegression`. The function returns an s3 object of class `fcQR`, which is a list containing the following elements: 1. **regression_result:** The result of the regression analysis. 2. **FC.BasisCoefficient:** A list of `Fourier_series`, `bspline_series`, or `numericBasis_series` objects representing the function-valued linear coefficients of the covariates. 3. **function.basis.type:** The type of basis used. 4. **basis.order:** The number of basis functions used, as specified in the input. 5. **data:** The original data provided to the model. 6. **bs_degree:** The degree of the b-spline basis, same as specified in the input (included only when the b-spline basis is used). Additionally, the `predict` method can be used to obtain predicted values from the fitted model, and the `fc.beta` method is available to evaluate the estimated function-valued coefficient parameters $\beta_l(t)$. ```{r fcqr} data(MECfda.data.sim.0.0) res = fcQR(FC = MECfda.data.sim.0.0$FC, Y=MECfda.data.sim.0.0$Y, Z=MECfda.data.sim.0.0$Z, tau = 0.5, basis.order = 5, basis.type = c('Bspline'), formula.Z = ~ .) t = (0:100)/100 plot(x = t, y = fc.beta(res,1,t), ylab = expression(beta[1](t))) plot(x = t, y = fc.beta(res,2,t), ylab = expression(beta[2](t))) data(MECfda.data.sim.1.0) predict(object = res, newData.FC = MECfda.data.sim.1.0$FC, newData.Z = MECfda.data.sim.1.0$Z) ``` # Measurement Error Models and Bias Correction Estimation Methods Data collected in real-world settings often include measurement error, especially function-valued data. Measurement error in a data set may result in estimation bias. The *MECfda** package also provides bias correction estimation method functions for certain linear regression models for use with data containing measurement error. ### ME.fcRegression_MEM ```{r shili3, eval = FALSE} ME.fcRegression_MEM( data.Y, data.W, data.Z, method = c("UP_MEM", "MP_MEM", "average"), t_interval = c(0, 1), t_points = NULL, d = 3, family.W = c("gaussian", "poisson"), family.Y = "gaussian", formula.Z, basis.type = c("Fourier", "Bspline"), basis.order = NULL, bs_degree = 3, smooth = FALSE, silent = TRUE ) ``` Wearable monitoring devices permit the continuous monitoring of biological processes, such as blood glucose metabolism, and behaviors, such as sleep quality and physical activity. Continuous monitoring often collect data in 60-second epochs over multiple days, resulting in high-dimensional multi-level longitudinal curves that are best described and analyzed as multi-level functional data. Although researchers have previously addressed measurement error in scalar covariates prone to error, less work has been done for correcting measurement error in multi-level high dimensional curves prone to heteroscedastic measurement error. Therefore, Luan et. al. proposed mixed effects-based-model-based (MEM) methods for bias correction due to measurement error in multi-level functional data from the exponential family of distributions that are prone to complex heteroscedastic measurement error. They first developed MEM methods to adjust for biases due to the presence of measurement error in multi-level generalized functional regression models. They assumed that the distributions of the function-valued covariates prone to measurement error belong to the exponential family. This assumption allows for a more general specification of the distributions of error-prone functional covariates compared to current approaches that often entail normality assumptions for these observed measures. The approach adopted by Luan et al. allows a nonlinear association between the true measurement and the observed measurement prone to measurement error. Second, they treated the random errors in the observed measures as complex heteroscedastic errors from the Gaussian distribution with covariance error functions. Third, these methods can be used to evaluate relationships between multi-level function-valued covariates with complex measurement error structures and scalar outcomes with distributions in the exponential family. Fourth, they treat the function-valued covariate as an observed measure of the true function-valued unobserved latent covariate. Additionally, these methods employ a point-wise method for fitting the multi-level functional MEM approach, avoiding the need to compute complex and intractable integrals that would be required in traditional approaches for reducing biases due to measurement error in multi-level functional data [@luan2023scalable]. Their statistical model is as follows: \begin{align*} &g(E(Y_i|X_i,Z_i)) = \int_{\Omega} \beta(t) X_{i}(t) dt + (1,Z_i^T)\gamma\\ &h(E(W_{ij}(t)|X_i(t))) = X_i(t)\\ &X_i(t) = \mu_x(t) + \varepsilon_{xi}(t) \end{align*} where the response variable $Y_i$ and scalar-valued covariates $Z_i$ are measured without error, function-valued covariate $X_i(t)$ is repeatedly measured with error as $W_{ij}(t)$. The model also includes the following assumptions: 1. $Y_i|X_i,Z_i\sim EF(\cdot)$, $EF$ refers to an exponential family distribution. 2. $g(\cdot)$ and $h(\cdot)$ are known to be strictly monotone, twice continuously differentiable functions. 3. $Cov\{X_i(t),W_{ij}(t)\} \neq 0$, 4. $W_{ij}(t)|X_i(t)\sim EF(\cdot)$ 5. $f_{Y_i|W_{ij}(t),X_i(t)}(\cdot) = f_{Y_i|X_i(t)}(\cdot)$, $f$ refers to probability density function. 6. $X_i(t)\sim GP\{\mu_x(t),\Sigma_{xx}\}$, $GP$ refers to the Gaussian process. They proposed a MEM estimation method to correct for bias caused by the presence of measurement error in the function-valued covariate. This method allows for the investigation of a nonlinear measurement error model, in which the relationship between the true and observed measurements is not constrained to be linear, and the distribution assumption for the observed measurement is relaxed to encompass the exponential family rather than being limited to a Gaussian distribution. The MEM approach is a two stage method that employs functional mixed-effects models. The first stage of the MEM approach involves using a functional mixed-effects model to approximate the true measure $X_i(t)$ with the repeated observed measurement $W_{ij}(t)$. The MEM approach is primarily based on the assumptions that $h[E\{W_{ij}(t)|X_i(t)\}] = X_i(t)$ and $X_i(t) = \mu_x(t) + \varepsilon_{xi}(t)$. That is, in the functional mixed-effects model containing one fixed intercept and one random intercept, the random intercept is assumed to to be the subject-wise deviation of $X_i(t)$ from the mean process $\mu_x(t)$, and the fixed intercept is assumed to represent $\mu_x(t)$. The second stage involves replacing $X_i(t)$ in the regression model with the resulting approximation of $X_i(t)$ from the first stage. The MEM approach employs point-wise (UP_MEM) and multi-point-wise (MP_MEM) estimation procedures to avoid potential computational complexities caused by analyses of multi-level functional data and computations of potentially intractable and complex integrals. The **MECfda** package provides the function `ME.fcRegression_MEM` for applying bias-correction estimation methods. In this function, the response variable, function-valued covariates, and scalar covariates are supplied separately via the arguments `data.Y`, `data.W`, and `data.Z`, respectively. - **`data.Y` (Response Variable):** The response variable can be provided as an atomic vector, a one-column matrix, or a data frame. However, a one-column data frame with a column name is recommended so that the response variable is explicitly identified. - **`data.W` (Measurement Data):** This argument represents $W$, the observed measurement of the true function-valued covariate $X$ in the statistical model. It should be provided as a three-dimensional array where each row corresponds to a subject, each column to a measurement (time) point, and each layer to a separate observation. - **`data.Z` (Scalar Covariates):** The scalar covariates can be input as a matrix, data frame, or atomic vector. If the model does not include any scalar covariates, `data.Z` can be omitted or set to `NULL`. For a single scalar covariate, although an atomic vector is acceptable, a data frame or matrix with a column name is recommended. For multiple scalar covariates, use a matrix or data frame with named columns. Other key arguments include: - **`method`:** Specifies the method for constructing the substitute for $X$; available options are `'UP_MEM'`, `'MP_MEM'`, and `'average'`. - **`t_interval`:** Defines the domain of the function-valued covariate as a two-element vector (default is `c(0,1)`, representing the interval $[0,1]$). - **`t_points`:** Specifies the sequence of measurement time points (default is `NULL`). - **`d`:** When using the `MP_MEM` method, this argument sets the number of measurement points involved (default is 3, which is also the minimum value). - **`family.W`:** Specifies the distribution of $W$ given $X$; available options are `'gaussian'` and `'poisson'`. - **`family.Y`:** Describes the error distribution and link function for the response variable (see `stats::family` for details). - **`formula.Z`:** Indicates which parts of `data.Z` to include in the model and whether to treat the scalar covariates as fixed or random effects. - **`basis.type`:** Indicates the type of basis function for the expansion. Options include `'Fourier'`, `'Bspline'`, and `'FPC'`, corresponding to the Fourier basis, b-spline basis, and functional principal component basis, respectively. - **`basis.order`:** Specifies the number of basis functions to use. For the Fourier basis (i.e., $\frac{1}{2},\ \sin(kt),\ \cos(kt)$ for $k = 1, \dots, p_f$), `basis.order` equals $p_f$. For the b-spline basis $\{B_{i,p}(x)\}_{i=-p}^{k}$, `basis.order` is set to $k+p+1$. For the FPC basis, `basis.order` represents the number of functional principal components. - **`bs_degree`:** Specifies the degree of the piecewise polynomials for the b-spline basis; this argument is only necessary when using the b-spline basis. - **`smooth`:** Specifies whether to smooth the substitution for $X$ (default is `FALSE`). - **`silent`:** Controls whether the function displays its progress during execution (default is `TRUE`). The function `ME.fcRegression_MEM` returns an object of class `fcRegression`. The package also provide a function, `MEM_X_hat`. This function can return the $\hat X(t)$ in this bias-correction method. ```{r MEM, eval = FALSE} data(MECfda.data.sim.0.1) res = ME.fcRegression_MEM(data.Y = MECfda.data.sim.0.1$Y, data.W = MECfda.data.sim.0.1$W, data.Z = MECfda.data.sim.0.1$Z, method = 'UP_MEM', family.W = "gaussian", basis.type = 'Bspline') ``` ### ME.fcQR_IV.SIMEX ```{r shili4, eval = FALSE} ME.fcQR_IV.SIMEX( data.Y, data.W, data.Z, data.M, tau = 0.5, t_interval = c(0, 1), t_points = NULL, formula.Z, basis.type = c("Fourier", "Bspline"), basis.order = NULL, bs_degree = 3 ) ``` Health a major concerns for many people, and as technology advances, wearable devices have become the mainstream method for monitoring and evaluating individual physical activity levels. Despite personal preferences in brands and feature design, the accuracy of the data presented is what makes the device a great product. These functional data collected from devices are generally considered more accurate and subjective compared to other objective methods like questionnaires and self-reports. Because physical activity levels are not directly observable, algorithms are used generate corresponding summary reports of different activity level using complex data (e.g. steps, heart rate). However, these results may differ depending on which device is used. And aside from variation in hardware, data collected could also vary by the combinations between how the device is worn and the activity of interest. Although current devices may be sufficiently accurate to monitor general physical activity levels, more precise data may enable additional functions such as detecting irregular heart rhythms or radiation exposures that would contribute toward improving the health of the general public and elevating the overall well-being of society. Quantile regression is a tool that was developed to meet the need for modeling complex relationships between a set of covariates and quantiles of an outcome. In obesity studies, the effects of interventions or covariates on body mass index (BMI) are most commonly estimated using ordinary least square methods. However, mean regression approaches are unable to address these effects for individuals whose BMI values fall in the upper quantile of the distribution. Compared with traditional linear regression approaches, quantile regression approaches make no assumptions regarding the distribution of residuals and are robust to outlying observations. Tekwe et. al. proposed a simulation extrapolation (SIMEX) estimation method that uses an instrumental variable to correct for bias due to measurement error in scalar-on-function quantile linear regression. They demonstrated the usefulness of this method by evaluating the association between physical activity and obesity using the National Health and Nutrition Examination Survey (NHANES) dataset [@tekwe2022estimation]. Their statistical model is as follows: \begin{align*} &Q_{Y_i|X_i,Z_i}(\tau) = \int_{\Omega} \beta(\tau,t) X_{i}(t) dt + (1,Z_i^T)\gamma(\tau)\\ &W_i(t) = X_i(t) + U_i(t)\\ &M_i(t) = \delta(t) X_i(t) + \eta_i(t) \end{align*} where the response variable $Y_i$ and scalar-valued covariates $Z_i$ are measured without error, the function-valued covariate $X_i(t)$ is measured with error as $W_i(t)$, and $M_i(t)$ is a measured instrumental variable. The model also includes the following assumptions: 1. $Cov\{X_i(t),U_i(s)\} = 0$, 2. $Cov\{M_i(t),U_i(s)\} = 0$, 3. $E(W_{i}(t)|X_i(t)) = X_i(t)$ 4. $U_i(t)\sim GP\{\mathcal{0},\Sigma_{uu}\}$, $GP$ refers to Gaussian process. for $\forall t,s\in[0,1]$ and $i = 1,\dots,n$. Their bias correction estimation method uses a two-stage strategy to correct for measurement error in a function-valued covariate and then fits a linear quantile regression model. In the first stage, an instrumental variable is used to estimate the covariance matrix associated with the measurement error. In the second stage, SIMEX is used to correct for measurement error in the function-valued covariate. The **MECfda** package provides the function `ME.fcQR_IV.SIMEX` for applying its bias-correction estimation method in scalar-on-function quantile regression. The arguments are described as follows: - **`data.Y` (Response Variable):** The response variable can be provided as an atomic vector, a one-column matrix, or a data frame. A one-column data frame with a column name is recommended for clarity. - **`data.W` (Measurement Data):** This argument corresponds to $W$, the observed measurement of $X$ in the statistical model. It should be provided as a data frame or matrix where each row represents a subject and each column corresponds to a measurement (time) point. - **`data.Z` (Scalar Covariates):** The scalar covariate data can be omitted (or set to `NULL`) if no scalar covariates are included in the model. Alternatively, it can be provided as an atomic vector (for a single scalar covariate) or as a matrix/data frame (for multiple scalar covariates). A data frame with column names is recommended. - **`data.M` (Instrumental Variable):** This argument provides the data for $M$, the instrumental variable. It should be supplied as a data frame or matrix where each row represents a subject and each column corresponds to a measurement (time) point. - **`tau` (Quantile Level):** Specifies the quantile level $\tau \in (0,1)$ for the quantile regression model. The default is 0.5. - **`t_interval` (Domain of the Function-Valued Covariate):** A two-element vector defining the interval over which the function-valued covariate is defined (default is `c(0,1)`, representing the interval $[0,1]$). - **`t_points` (Measurement Time Points):** Specifies the sequence of time points at which measurements are taken. The default is `NULL`. - **`formula.Z` (Scalar Covariate Formula):** This argument determines which components of `data.Z` are included in the model and how the scalar covariates are treated (i.e., as fixed or random effects). - **`basis.type`:** Indicates the type of basis function for the expansion. Options include `'Fourier'`, `'Bspline'`, and `'FPC'`, corresponding to the Fourier basis, b-spline basis, and functional principal component basis, respectively. - **`basis.order`:** Specifies the number of basis functions to use. For the Fourier basis (i.e., $\frac{1}{2},\ \sin(kt),\ \cos(kt)$ for $k = 1, \dots, p_f$), `basis.order` equals $p_f$. For the b-spline basis $\{B_{i,p}(x)\}_{i=-p}^{k}$, `basis.order` is set to $k+p+1$. For the FPC basis, `basis.order` represents the number of functional principal components. - **`bs_degree`:** Specifies the degree of the piecewise polynomials for the b-spline basis; this argument is only necessary when using the b-spline basis. The function `ME.fcQR_IV.SIMEX` returns an object of class `ME.fcQR_IV.SIMEX`, which is a list containing the following elements: 1. **`coef.X`:** A `Fourier_series` or `bspline_series` object representing the function-valued coefficient parameter of the function-valued covariate. 2. **`coef.Z`:** The estimated linear coefficients of the scalar covariates. 3. **`coef.all`:** The original estimate of the linear coefficients. 4. **`function.basis.type`:** The type of function basis used. 5. **`basis.order`:** The number of basis functions used (as specified in the input). 6. **`t_interval`:** A two-element vector representing the interval, which defines the domain of the function-valued covariate. 7. **`t_points`:** The sequence of measurement (time) points. 8. **`formula`:** The regression model. 9. **`formula.Z`:** A formula object containing only the scalar covariates. 10. **`zlevels`:** The levels of any categorical (non-continuous) scalar covariates. This comprehensive structure ensures that users can flexibly input their data and clearly specify the modeling framework for bias-correction in scalar-on-function quantile regression. ```{r iv.simex, eval = FALSE} rm(list = ls()) data(MECfda.data.sim.0.2) res = ME.fcQR_IV.SIMEX(data.Y = MECfda.data.sim.0.2$Y, data.W = MECfda.data.sim.0.2$W, data.Z = MECfda.data.sim.0.2$Z, data.M = MECfda.data.sim.0.2$M, tau = 0.5, basis.type = 'Bspline') ``` ### ME.fcQR_CLS ```{r shili5, eval = FALSE} ME.fcQR_CLS( data.Y, data.W, data.Z, tau = 0.5, t_interval = c(0, 1), t_points = NULL, grid_k, grid_h, degree = 45, observed_X = NULL ) ``` Zhang et. al. proposed a corrected loss score estimation method for scalar-on-function quantile linear regression to correct for bias due to measurement error [@zhang2023partially]. Their statistical model is as follows: \begin{align*} &Q_{Y_i|X_i,Z_i}(\tau) = \int_{\Omega} \beta(\tau,t) X_{i}(t) dt + (1,Z_i^T)\gamma(\tau)\\ &W_i(t) = X_i(t) + U_i(t) \end{align*} where the response variable $Y_i$ and the scalar-valued covariates $Z_i$ are measured without error, the function-valued covariate $X_i(t)$ is measured with error as $W_i(t)$. 1. $E[U_i(t)]=0$. 2. $Cov\{U_i(t),U_i(s)\} = \Sigma_U(s,t)$, where $\Sigma_U(s,t)$ is unknown. 3. $U_i(t)$ are i.i.d for different $i$. Zhang et al. proposed a new corrected loss function for a partially functional linear quantile model with functional measurement error in this manuscript. They established a corrected quantile objective function for an observed measurement, which is an unbiased estimator of the quantile objective function that would be obtained if the true measurements were available. The estimators of the regression parameters are obtained by optimizing the resulting corrected loss function. The resulting estimator of the regression parameters is shown to be consistent. The **MECfda** package provides a function, `ME.fcQR_CLS`, for implementing their bias-correction estimation method. Below is a description of its arguments and return values: - **`data.Y` (Response Variable):** The response variable can be provided as an atomic vector, a one-column matrix, or a data frame. The recommended format is a one-column data frame with a column name. - **`data.W` (Measurement Data for $W$):** This argument represents $W$, the observed measurement of $X$ in the statistical model. It should be provided as a three-dimensional array where each row corresponds to a subject, each column to a measurement (time) point, and each layer to an observation. - **`data.Z` (Scalar Covariates):** Scalar covariates can be omitted (or set to `NULL`) if no scalar covariates are included in the model, provided as an atomic vector for a single scalar covariate, or as a matrix or data frame for multiple covariates. The recommended format is a data frame with column names. - **`tau` (Quantile Level):** Specifies the quantile $\tau \in (0,1)$ to be used in the quantile regression model. The default value is 0.5. - **`t_interval` (Domain of the Function-Valued Covariate):** A two-element vector that specifies the domain (interval) of the function-valued covariate. The default is `c(0,1)`, representing the interval $[0,1]$. - **`t_points` (Measurement Time Points):** Specifies the sequence of measurement (time) points. The default is `NULL`. - **`grid_k` (Candidate Basis Numbers):** An atomic vector where each element is a candidate number for the basis functions. - **`grid_h` (Candidate Tuning Parameter Values):** A non-zero atomic vector where each element is a candidate value for the tuning parameter. - **`degree` (Degree for Derivative and Integral Calculations):** This argument is used when computing derivatives and integrals, with a default value of 45—which is sufficient for most scenarios. - **`observed_X` (Observed $X$ Data for Variance Estimation):** Used for estimating parametric variance; the default value is `NULL`. The function returns an object of class `ME.fcQR_CLS` (a list) containing the following elements: 1. **`estimated_beta_hat`:** Estimated coefficients from the corrected loss function (including the functional component). 2. **`estimated_beta_t`:** The estimated functional curve. 3. **`SE_est`:** The estimated parametric variance (returned only if `observed_X` is not `NULL`). 4. **`estimated_Xbasis`:** The basis matrix used in the estimation. 5. **`res_naive`:** Results from the naive (uncorrected) method. ```{r cls, eval = FALSE} rm(list = ls()) data(MECfda.data.sim.0.1) res = ME.fcQR_CLS(data.Y = MECfda.data.sim.0.1$Y, data.W = MECfda.data.sim.0.1$W, data.Z = MECfda.data.sim.0.1$Z, tau = 0.5, grid_k = 4:7, grid_h = 1:2) ``` ### ME.fcLR_IV ```{r shili6, eval = FALSE} ME.fcLR_IV( data.Y, data.W, data.M, t_interval = c(0, 1), t_points = NULL, CI.bootstrap = F ) ``` Tekwe et. al. proposed a bias correction estimation method for scalar-on-function linear regression model with measurement error using an instrumental variable [@tekwe2019instrumental]. Their statistical model is as follows: \begin{align*} &Y_i = \int_0^1 \beta(t)X_i(t)dt + \varepsilon_i\\ &W_i(t) = X_i(t) + U_i(t)\\ &M_i(t) = \delta X_i(t) + \eta_i(t) \end{align*} where the response variable $Y_i$ and are measured without error, the function-valued covariate $X_i(t)$ is measured with error as $W_i(t)$, and $M_i(t)$ is an measured instrumental variable. They included the following additional assumptions: 1. $E\varepsilon_i(t) = 0$, 2. $EU_i(t) = 0$, 3. $E\eta_i(t) = 0$, 4. $Cov\{X_i(t),\varepsilon_i\} = 0$, 5. $Cov\{M_i(t),\varepsilon_i\} = 0$, 6. $Cov\{M_i(t),U_i(s)\} = 0$, for $\forall t,s\in[0,1]$ and $i = 1,\dots,n$. The **MECfda** package provides the function `ME.fcLR_IV` for implementing their bias-correction estimation method. The arguments are as follows: - **`data.Y` (Response Variable):** The response variable can be provided as an atomic vector, a one-column matrix, or a data frame. The recommended format is a one-column data frame with a column name. - **`data.W` (Measurement Data for $W$):** This argument represents $W$, the observed measurement of $X$ in the statistical model. It should be provided as a data frame or matrix, where each row represents a subject and each column corresponds to a measurement (time) point. - **`data.M` (Instrumental Variable):** This argument provides the data for $M$, the instrumental variable. It should be provided as a data frame or matrix, with each row representing a subject and each column representing a measurement (time) point. - **`t_interval` (Domain of the Function-Valued Covariate):** A two-element vector that specifies the domain (interval) of the function-valued covariate. The default is `c(0,1)`, representing the interval $[0,1]$. - **`t_points` (Measurement Time Points):** Specifies the sequence of measurement (time) points. The default is `NULL`. - **`CI.bootstrap` (Bootstrap Confidence Interval):** A logical flag indicating whether to return a confidence interval using the bootstrap method. The default is `FALSE`. The function returns an object of class `ME.fcLR_IV`, which is a list containing the following elements: 1. **`beta_tW`:** Parameter estimates. 2. **`CI`:** Confidence interval, which is returned only if `CI.bootstrap` is `TRUE`. ```{r lriv, eval = FALSE} rm(list = ls()) data(MECfda.data.sim.0.3) res = ME.fcLR_IV(data.Y = MECfda.data.sim.0.3$Y, data.W = MECfda.data.sim.0.3$W, data.M = MECfda.data.sim.0.3$M) ``` # Simulated Data Generation The package **MECfda** includes some functions to generate simulated data to test the functions in the package. # References