--- title: "Simulation Models" author: Oluwasegun Ojo output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Simulation Models} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.align = 'center', fig.width=6, fig.height=5 ) ``` ```{r setup} library(fdaoutlier) ``` The following are simulation models included in the `fdaoutlier` package. Some of these models were curated from research work related to functional depths and outlier detection for functional data. This documents presents the model equations as well as their corresponding functions and parameters in `fdaoutlier`. The parameters of the `fdaoutlier` functions have been set to reasonable default values for ease of use. ## Model 1 This is a typical magnitude model in which outliers are shifted from the 'normal' non-outlying observations. The **main model** is of the form: $$X_i(t) = \mu t + e_i(t),$$ and the **contamination model** model is of the form: $$X_i(t) = \mu t + qk_i + e_i(t)$$ where: * $t\in [0,1]$, * $e_i(t)$ is a Gaussian process with zero mean and covariance function of the form: $$\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\},$$ * $k_i \in \{-1, 1\}$ (usually with $P(k_i = -1) = P(k_i=1) = 0.5$), * and $q$ is a constant controlling how far the outliers are from the mean function of the data, usually, $q = 6$ or $q = 8$. This model can be accessed with the `simulation_model1()` function in `fdaoutlier`. ```{r model1} library(fdaoutlier) dtss <- simulation_model1(n = 100, p = 50, outlier_rate = .1, seed = 50, plot = F) ``` The returned object is a list containing a matrix of the data and a vector of the indices of the true outliers: ```{r} dim(dtss$data) dtss$true_outliers ``` The simulated data can be tuned using additional parameters to `simulation_model1()`. The following parameters modify the data generated by `simulation_model1()`: * `mu`: the coefficient $\mu$ in the main and contamination models controlling the mean function. * `q`: the shift parameter $q$ in the contamination model which controls how far the outliers are from the mean function. * `kprob`: the probability that $k_i = 1$, i.e., $P(k_i=1)$ in the contamination model * `cov_alpha`: the coefficient $\alpha$ in the covariance function. * `cov_beta`: the coefficient $\beta$ in the covariance function. * `cov_nu`: the coefficient $\nu$ in the covariance function. Additional plotting parameters allows for modifying the plot title (`plot_title`), the font size of the title (`title_cex`), toggle on/off the display of the legend (`show_legend`), y-axis label (`ylabel`) and x-axis label (`xlabel`). ## Model 2 This model generates non-persistent magnitude outliers, i.e., the outliers are magnitude outliers for only a portion of the domain of the functional data. The **main model** is of the form: $$X_i(t) = \mu t + e_i(t),$$ with **contamination model** of the form: $$X_i(t) = \mu t + qk_iI_{T_i \le t\le T_i+l } + e_i(t)$$ where: * $t\in [0,1]$, * $e_i(t)$ is a Gaussian process with zero mean and covariance function of the form: $$\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\},$$ * $k_i \in \{-1, 1\}$ with $P(k_i = -1) = P(k_i=1) = 0.5$, * $q$ is a constant controlling how far the outliers are from the mass of the data, * $I$ is an indicator function, * $T_i$ is a uniform random variable between an interval $[a, b] \subset [0,1]$, * and $l$ is a constant specifying for how much of the domain the outliers are away from the mean function. A call to `simulation_model2()` generates data from this model: ```{r model2} dtss <- simulation_model2(n = 100, p = 50, outlier_rate = .1, seed = 50, plot = F) ``` Additional parameters of `simulation_model3()` to which arguments can be passed are: * `mu`: the coefficient $\mu$ in the main and contamination models controlling the mean function. * `q`: the shift parameter $q$ in the contamination model which controls how far the outliers are from the mean function. * `kprob`: the probability that $k_i = 1$, i.e., $P(k_i=1)$ in the contamination model. * `a`, `b`: values specifying the interval $[a,b]$ from which $T_i$ is drawn in the contamination model. * `l`: the value of $l$ in the contamination model. * `cov_alpha`: the coefficient $\alpha$ in the covariance function. * `cov_beta`: the coefficient $\beta$ in the covariance function. * `cov_nu`: the coefficient $\nu$ in the covariance function. Additional plotting parameters listed for `simulation_model1()` also applies. ## Model 3 This model generates outliers that are magnitude outliers for a part of the domain. The **main model** is of the form: $$X_i(t) = \mu t + e_i(t),$$ with **contamination model** of the form: $$X_i(t) = \mu t + qk_iI_{T_i \le t } + e_i(t)$$ where: * $t\in [0,1]$, * $e_i(t)$ is a Gaussian process with zero mean and covariance function of the form: $$\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\},$$ * $k_i \in \{-1, 1\}$ with $P(k_i = -1) = P(k_i=1) = 0.5$, * $q$ is a constant controlling how far the outliers are from the mass of the data, * $I$ is an indicator function, * and $T_i$ is a uniform random variable between an interval $[a, b] \subset [0,1]$. A call to `simulation_model3()` generates data from this model: ```{r model3} dtss <- simulation_model3(n = 100, p = 50, outlier_rate = .1, seed = 50, plot = F) ``` Additional parameters of `simulation_model3()` to which arguments can be passed are: * `mu`: the coefficient $\mu$ in the main and contamination models controlling the mean function. * `q`: the shift parameter $q$ in the contamination model which controls how far the outliers are from the mean function. * `kprob`: the probability that $k_i = 1$, i.e., $P(k_i=1)$ in the contamination model. * `a`, `b`: values specifying the interval $[a,b]$ from which $T_i$ is drawn in the contamination model. * `cov_alpha`: the coefficient $\alpha$ in the covariance function. * `cov_beta`: the coefficient $\beta$ in the covariance function. * `cov_nu`: the coefficient $\nu$ in the covariance function. Additional plotting parameters listed for `simulation_model1()` also applies. ## Model 4 This models generates outliers defined on the reversed interval of the main model. The **main model** is of the form: $$X_i(t) = \mu t(1 - t)^m + e_i(t),$$ with **contamination model** of the form: $$X_i(t) = \mu(1 - t)t^m + e_i(t)$$ where: * $t\in [0,1]$, * $e_i(t)$ is a Gaussian process with zero mean and covariance function of the form: $$\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\},$$ * and $m$ is a constant. A call to `simulation_model4()` generates data from this model: ```{r model4} dtss <- simulation_model4(n = 100, p = 50, outlier_rate = .1, seed = 50, plot = F) ``` Additional parameters of `simulation_model4()` to which arguments can be passed are: * `mu`: the coefficient $\mu$ in the main and contamination models controlling the mean function. * `m`: the constant $m$ in the main and contamination models. * `cov_alpha`: the coefficient $\alpha$ in the covariance function. * `cov_beta`: the coefficient $\beta$ in the covariance function. * `cov_nu`: the coefficient $\nu$ in the covariance function. Additional plotting parameters listed for `simulation_model1()` also applies. ## Model 5 This models generates shape outliers with a different covariance structure from that of the main model. The **main model** is of the form: $$X_i(t) = \mu t + e_i(t),$$ with **contamination model** of the form: $$X_i(t) = \mu t + \tilde{e}_i(t),$$ where: * $t\in [0,1]$, * and $e_i(t)$ and $\tilde{e}_i(t)$ are Gaussian processes with zero mean and covariance function of the form: $$\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\}$$ A call to `simulation_model5()` generates data from this model: ```{r model5} dtss <- simulation_model5(n = 100, p = 50, outlier_rate = .1, seed = 50, plot = F) ``` Additional parameters of `simulation_model5()` to which arguments can be passed are: * `mu`: the coefficient $\mu$ in the main and contamination models controlling the mean function. * `cov_alpha`: the coefficient $\alpha$ in the covariance function of $e_i(t)$. * `cov_beta`: the coefficient $\beta$ in the covariance function of $e_i(t)$. * `cov_nu`: the coefficient $\nu$ in the covariance function of $e_i(t)$. * `cov_alpha2`: the coefficient $\alpha$ in the covariance function of $\tilde{e}_i(t)$. * `cov_beta2`: the coefficient $\beta$ in the covariance function of $\tilde{e}_i(t)$. * `cov_nu2`: the coefficient $\nu$ in the covariance function of $\tilde{e}_i(t)$. Additional plotting parameters listed for `simulation_model1()` also applies. ## Model 6 This models generates shape outliers that have a different shape for a portion of the domain. The **main model** is of the form: $$X_i(t) = \mu t + e_i(t),$$ with **contamination model** of the form: $$X_i(t) = \mu t + (-1)^u\cdot q + (-1)^{(1-u)}\left(\frac{1}{\sqrt{r\pi}}\right)\exp{(-z(t-v)^w)} + e_i(t)$$ where: * $t\in [0,1]$, * $e_i(t)$ is a Gaussian process with zero mean and covariance function of the form: $$\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\},$$ * $u$ follows Bernoulli distribution with probability $P(u = 1) = 0.5$, * $q$, $r$, $z$ and $w$ are constants, * $v$ follows a Uniform distribution between an interval $[a, b]$. A call to `simulation_model6()` generates data from this model: ```{r model6} dtss <- simulation_model6(n = 100, p = 50, outlier_rate = .1, seed = 50, plot = F) ``` Additional parameters of `simulation_model6()` to which arguments can be passed are: * `mu`: the coefficient $\mu$ in the main and contamination models controlling the mean function. * `q`: the constant term $q$ in the contamination model. * `kprob`: the probability $P(u = 1)$ * `a`, `b`: values specifying the interval of from which $v$ in the contamination model is drawn. * `pi_coeff`: the constant $r$ in the contamination model. * `exp_pow`: the constant $w$ in the contamination model. * `exp_coeff`: the constant $z$ in the contamination model. * `cov_alpha`: the coefficient $\alpha$ in the covariance function. * `cov_beta`: the coefficient $\beta$ in the covariance function. * `cov_nu`: the coefficient $\nu$ in the covariance function. Additional plotting parameters listed for `simulation_model1()` also applies. ## Model 7 This model generates pure shape outliers that are periodic. The **main model** is of the form: $$X_i(t) = \mu t + e_i(t),$$ with **contamination model** of the form: $$X_i(t) = \mu t + k\sin(r\pi(t + \theta)) + e_i(t),$$ where: * $t\in [0,1]$, * and $e_i(t)$ is a Gaussian processes with zero mean and covariance function of the form: $$\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\}$$ * $\theta$ is uniformly distributed in an interval $[a, b]$. * $k$, $r$ are constants A call to `simulation_model7()` generates data from this model: ```{r model7} dtss <- simulation_model7(n = 100, p = 50, outlier_rate = .1, seed = 50, plot = F) ``` Additional parameters of `simulation_model7()` to which arguments can be passed are: * `mu`: the coefficient $\mu$ in the main and contamination models controlling the mean function. * `cov_alpha`: the coefficient $\alpha$ in the covariance function of $e_i(t)$. * `cov_beta`: the coefficient $\beta$ in the covariance function of $e_i(t)$. * `cov_nu`: the coefficient $\nu$ in the covariance function of $e_i(t)$. * `sin_coeff`: the coefficient $k$ in the contamination model. * `pi_coeff`: the coefficient $r$ in the contamination model. * `a`, `b`: values specifying the interval of from which $\theta$ is to be drawn. Additional plotting parameters listed for `simulation_model1()` also applies. ## Model 8 This model generates pure shape outliers that are periodic. The **main model** is of the form: $$X_i(t) = k\sin(r\pi t) + e_i(t),$$ with **contamination model** of the form: $$X_i(t) = k\sin(r\pi t + v) + e_i(t),$$ where: * $t\in [0,1]$, * and $e_i(t)$ is a Gaussian processes with zero mean and covariance function of the form: $$\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\}$$ * $k$, $r$, $v$ are constants A call to `simulation_model8()` generates data from this model: ```{r model8} dtss <- simulation_model8(n = 100, p = 50, outlier_rate = .1, seed = 50, plot = F) ``` Additional parameters of `simulation_model7()` to which arguments can be passed are: * `cov_alpha`: the coefficient $\alpha$ in the covariance function of $e_i(t)$. * `cov_beta`: the coefficient $\beta$ in the covariance function of $e_i(t)$. * `cov_nu`: the coefficient $\nu$ in the covariance function of $e_i(t)$. * `sin_coeff`: the coefficient $k$ in the main and contamination model. * `pi_coeff`: the coefficient $r$ in the main and contamination model. * `constant`: the value of the constant $v$ in the contamination model. Additional plotting parameters listed for `simulation_model1()` also applies. ## Model 9 Periodic functions with outliers of different amplitude. The **main model** is of the form: $$X_i(t) = a_{1i}\sin \pi + a_{2i}\cos\pi + e_i(t),$$ with **contamination model** of the form: $$X_i(t) = (b_{1i}\sin\pi + b_{2i}\cos\pi)(1-u_i) + (c_{1i}\sin\pi + c_{2i}\cos\pi)u_i + e_i(t),$$ where: * $t\in [0,1]$, * $\pi \in [0, 2\pi]$ * $a_{1i}$, $a_{2i}$ follows uniform distribution in an interval $[a_1, a_2]$ * $b_{1i}$, $b_{2i}$ follows uniform distribution in an interval $[b_1, b_2]$ * $c_{1i}$, $c_{2i}$ follows uniform distribution in an interval $[c_1, c_2]$ * $u_i$ follows Bernoulli distribution * and $e_i(t)$ is a Gaussian processes with zero mean and covariance function of the form: $$\gamma(s,t) = \alpha\exp\{-\beta|t-s|^\nu\}$$ A call to `simulation_model9()` generates data from this model: ```{r model9} dtss <- simulation_model9(n = 100, p = 50, outlier_rate = .1, seed = 50, plot = F) ``` Additional parameters of `simulation_model9()` to which arguments can be passed are: * `kprob` the probability $P(u_i = 1)$ * `ai` a vector of 2 values containing $a_{1}$ and $a_{2}$ indicating the interval from which $a_{1i}$ and $a_{2i}$ are drawn in the main model. * `bi` a vector of 2 values containing $b_{1}$ and $b_{2}$ indicating the interval from which $a_{1i}$ and $a_{2i}$ are drawn in the main model. * `ci` a vector of 2 values containing $c_{1}$ and $c_{2}$ indicating the interval from which $c_{1i}$ and $c_{2i}$ are drawn in the main model. * `cov_alpha`: the coefficient $\alpha$ in the covariance function of $e_i(t)$. * `cov_beta`: the coefficient $\beta$ in the covariance function of $e_i(t)$. * `cov_nu`: the coefficient $\nu$ in the covariance function of $e_i(t)$. Additional plotting parameters listed for `simulation_model1()` also applies.