Title: | Outlier Detection Tools for Functional Data Analysis |
---|---|
Description: | A collection of functions for outlier detection in functional data analysis. Methods implemented include directional outlyingness by Dai and Genton (2019) <doi:10.1016/j.csda.2018.03.017>, MS-plot by Dai and Genton (2018) <doi:10.1080/10618600.2018.1473781>, total variation depth and modified shape similarity index by Huang and Sun (2019) <doi:10.1080/00401706.2019.1574241>, and sequential transformations by Dai et al. (2020) <doi:10.1016/j.csda.2020.106960 among others. Additional outlier detection tools and depths for functional data like functional boxplot, (modified) band depth etc., are also available. |
Authors: | Oluwasegun Taiwo Ojo [aut, cre, cph] , Rosa Elvira Lillo [aut], Antonio Fernandez Anta [aut, fnd] |
Maintainer: | Oluwasegun Taiwo Ojo <[email protected]> |
License: | GPL-3 |
Version: | 0.2.1 |
Built: | 2024-11-04 04:03:55 UTC |
Source: | https://github.com/otsegun/fdaoutlier |
This function computes the band depth of López-Pintado and Romo (2009) doi:10.1198/jasa.2009.0108. Bands of 2 functions are always considered using the fast algorithm of Sun et al. (2012) doi:10.1002/sta4.8.
band_depth(dt)
band_depth(dt)
dt |
A matrix or data frame of size |
A numeric vector of size nrow(dt)
containing the band depth values of each curve.
López-Pintado, S., & Romo, J. (2009). On the Concept of Depth for Functional Data. Journal of the American Statistical Association, 104(486), 718-734.
Sun, Y., Genton, M. G., & Nychka, D. W. (2012). Exact fast computation of band depth for large functional datasets: How quickly can one million curves be ranked?. Stat, 1(1), 68-74.
dt1 <- simulation_model1() bd2 <- band_depth(dt = dt1$data)
dt1 <- simulation_model1() bd2 <- band_depth(dt = dt1$data)
Compute the directional outlyingness of a univariate or multivariate functional data based on Dai and Genton (2019) doi:10.1016/j.csda.2018.03.017 and Dai and Genton (2018) doi:10.1080/10618600.2018.1473781.
dir_out( dts, data_depth = "random_projections", n_projections = 200L, seed = NULL, return_distance = TRUE, return_dir_matrix = FALSE )
dir_out( dts, data_depth = "random_projections", n_projections = 200L, seed = NULL, return_distance = TRUE, return_dir_matrix = FALSE )
dts |
A matrix (or data frame) for univariate functional data (of size |
data_depth |
The method for computing the depth. The random projection depth is always used as suggested in Dai and Genton (2018) doi:10.1080/10618600.2018.1473781. Support for more depth methods will be added. |
n_projections |
The number of directions for computing random projection depth. By default, 200 random directions are generated from a scaled uniform distribution between -1 and 1. |
seed |
An integer indicating the seed to set when generating the random directions. Defaults to NULL in which case a seed is not set. |
return_distance |
A logical value. If TRUE, returns: a matrix whose columns are the mean and variation of directional outlyingness, the mahalanobis distance of the observations of this matrix, and the robust estimate of the mean and covariance of this matrix (computed using the minimum covariance determinant method). |
return_dir_matrix |
A logical value. If TRUE, returns the directional outlyingness
matrix (or array for multivariate functional data). Computed from the chosen |
The directional outlyingness, as defined in Dai and Genton (2019) doi:10.1016/j.csda.2018.03.017 is
where is a depth notion, and
is
the unit vector pointing from the median of
to
. For univariate and
multivariate functional data, the projection depth is always used as suggested by
Dai and Genton (2019) doi:10.1016/j.csda.2018.03.017.
Returns a list containing:
mean_outlyingness |
an |
var_outlyingness |
a vector of length n containing the variation of directional outlyingness. |
ms_matrix |
if |
distance |
if |
mcd_obj |
if |
dirout_matrix |
if |
Oluwasegun Taiwo Ojo.
Dai, W., and Genton, M. G. (2018). Multivariate functional data visualization and outlier detection. Journal of Computational and Graphical Statistics, 27(4), 923-934.
Dai, W., and Genton, M. G. (2019). Directional outlyingness for multivariate functional data. Computational Statistics & Data Analysis, 131, 50-65.
Zuo, Y. (2003). Projection-based depth functions and associated medians. The Annals of Statistics, 31(5), 1460-1490.
msplot
for outlier detection using msplot and projection_depth
for multivariate projection depth.
# univariate magnitude model in Dai and Genton (2018). dt4 <- simulation_model4() dirout_object <- dir_out(dts = dt4$data, return_distance = TRUE)
# univariate magnitude model in Dai and Genton (2018). dt4 <- simulation_model4() dirout_object <- dir_out(dts = dt4$data, return_distance = TRUE)
The directional quantile is a measure of outlyingness based on a scaled pointwise deviation from the mean. These deviations are usually scaled by the deviation of the mean from the 2.5% upper and lower quantiles depending on if the (pointwise) observed value of a function is above or below the (pointwise) mean. Directional quantile was mentioned in Myllymäki et al. (2015) doi:10.1016/j.spasta.2014.11.004, Myllymäki et al. (2017) doi:10.1111/rssb.12172 and Dai et al. (2020) doi:10.1016/j.csda.2020.106960.
directional_quantile(dt, quantiles = c(0.025, 0.975))
directional_quantile(dt, quantiles = c(0.025, 0.975))
dt |
A matrix or dataframe of size |
quantiles |
A numeric vector of length 2 specifying the probabilities of the lower and upper quantiles.
Values must be between 0 and 1. Defaults to |
The method computes the directional quantile of a sample of curves discretely observed on common points.
The directional quantile of a function/curve is the maximum pointwise scaled outlyingness of
. The scaling is done using the pointwise absolute difference between the 2.5% mean and the lower (and upper)
quantiles. See Dai et al. (2020) doi:10.1016/j.csda.2020.106960 and
Myllymäki et al. (2017) doi:10.1111/rssb.12172 for more details.
A numeric vector containing the the directional quantiles of each observation of dt
.
Oluwasegun Taiwo Ojo
Dai, W., Mrkvička, T., Sun, Y., & Genton, M. G. (2020). Functional outlier detection and taxonomy by sequential transformations. Computational Statistics & Data Analysis, 106960.
Myllymäki, M., Mrkvička, T., Grabarnik, P., Seijo, H., & Hahn, U. (2017). Global envelope tests for spatial processes. J. R. Stat. Soc. B, 79:381-404.
dt1 <- simulation_model1() dq <- directional_quantile(dt1$data)
dt1 <- simulation_model1() dq <- directional_quantile(dt1$data)
Compute extremal depth for functional data
extremal_depth(dts)
extremal_depth(dts)
dts |
A numeric matrix or dataframe of size |
This function computes the extremal depth of a univariate functional data. The extremal depth of a function
with respect to a set of function
denoted by
is the proportion
of functions in
that is more extreme than
. The functions are ordered using depths cumulative
distribution functions (d-CDFs). Extremal depth like the name implies is based on extreme outlyingness and it
penalizes functions that are outliers even for a small part of the domain. Proposed/mentioned in
Narisetty and Nair (2016) doi:10.1080/01621459.2015.1110033.
A vector containing the extremal depths of the rows of dts
.
Oluwasegun Ojo
Narisetty, N. N., & Nair, V. N. (2016). Extremal depth for functional data and applications. Journal of the American Statistical Association, 111(516), 1705-1714.
@seealso total_variation_depth
for functional data.
dt3 <- simulation_model3() ex_depths <- extremal_depth(dts = dt3$data) # order functions from deepest to most outlying order(ex_depths, decreasing = TRUE)
dt3 <- simulation_model3() ex_depths <- extremal_depth(dts = dt3$data) # order functions from deepest to most outlying order(ex_depths, decreasing = TRUE)
This function computes the extreme rank length depth (ERLD) of a sample of curves or functions.
Functions have to be discretely observed on common domain points. In principle, the ERLD of a function
is the proportion of functions in the sample that is considered to be more extreme
than
, an idea similar to
extremal_depth
.
To determine which functions are more extreme, pointwise ranks of the functions are computed and compared pairwise.
extreme_rank_length( dts, type = c("two_sided", "one_sided_left", "one_sided_right") )
extreme_rank_length( dts, type = c("two_sided", "one_sided_left", "one_sided_right") )
dts |
A matrix or data frame of size |
type |
A character value. Can be one of |
There are three possibilities in the (pairwise) comparison of the pointwise ranks of the functions.
First possibility is to consider only small values as extreme (when type = "one_sided_left"
) in which case the raw pointwise ranks
are used. The second possibility is to consider only large values as extreme (when
type = "one_sided_right"
) in which
case the pointwise ranks used are computed as where
is the raw pointwise rank of the function
at design point
and
is the number of functions in the sample. Third possibility is to consider both small and
large values as extreme (when
type = "two_sided"
) in which case the pointwise ranks used is computed as
. In the computation of the raw pointwise ranks
, ties are broken using
an average. See Dai et al. (2020) doi:10.1016/j.csda.2020.106960 and Myllymäki et al. (2017) doi:10.1111/rssb.12172 for more details.
A numeric vector containing the depth of each curve
Oluwasegun Ojo
Dai, W., Mrkvička, T., Sun, Y., & Genton, M. G. (2020). Functional outlier detection and taxonomy by sequential transformations. Computational Statistics & Data Analysis, 106960.
Myllymäki, M., Mrkvička, T., Grabarnik, P., Seijo, H., & Hahn, U. (2017). Global envelope tests for spatial processes. J. R. Stat. Soc. B, 79:381-404.
dt3 <- simulation_model3() erld <- extreme_rank_length(dt3$data)
dt3 <- simulation_model3() erld <- extreme_rank_length(dt3$data)
This function finds outliers in a sample of curves using the functional boxplot by Sun and Genton (2011) doi:10.1198/jcgs.2011.09224. Unlike the name suggests, the function does not actually produce a plot but is only used as support in finding outliers in other functions. Different depth and outlyingness methods are supported for ordering functions. Alternatively, the depth values of the functions can be supplied directly.
functional_boxplot( dts, depth_method = c("mbd", "tvd", "extremal", "dirout", "linfinity", "bd", "erld", "dq"), depth_values = NULL, emp_factor = 1.5, central_region = 0.5, erld_type = NULL, dq_quantiles = NULL )
functional_boxplot( dts, depth_method = c("mbd", "tvd", "extremal", "dirout", "linfinity", "bd", "erld", "dq"), depth_values = NULL, emp_factor = 1.5, central_region = 0.5, erld_type = NULL, dq_quantiles = NULL )
dts |
A matrix or data frame of size |
depth_method |
A character value specifying the method to use for computing the depth values (if
The default method is |
depth_values |
A numeric vector containing the depth values of the functions in |
emp_factor |
A numeric value specifying the empirical factor for the boxplot. Defaults to 1.5. |
central_region |
A numeric value between 0 and 1 indicating the probability of central region. Defaults to 0.5. |
erld_type |
If |
dq_quantiles |
If |
A list containing:
outliers |
The indices of the functions/curves flagged as outliers. |
depth_values |
The depths of the functions/curves in |
median_curve |
The index of the median curve, which is the curve with the largest depth value (or smallest outlyingness value). |
Sun, Y., & Genton, M. G. (2011). Functional boxplots. Journal of Computational and Graphical Statistics, 20(2), 316-334.
seq_transform
for functional outlier detection using sequential transformation.
dt1 <- simulation_model1() fbplot_obj <- functional_boxplot(dt1$data, depth_method = "mbd") fbplot_obj$outliers
dt1 <- simulation_model1() fbplot_obj <- functional_boxplot(dt1$data, depth_method = "mbd") fbplot_obj$outliers
Computes asymptotically, the factors for F approximation cutoff for (MCD) robust mahalanobis distances according to Hardin and Rocke (2005) doi:10.1198/106186005X77685.
hardin_factor_numeric(n, dimension)
hardin_factor_numeric(n, dimension)
n |
A numeric value indicating the number of observations of the data. |
dimension |
A numeric value indicating the number of variables of the data. |
This function computes the two factors needed for the determining an appropriate cutoff for robust mahalanobis distances computed using the MCD method.
The F approximation according to Hardin and Rocke (2005) doi:10.1198/106186005X77685 is given by:
where is a parameter for finding the degree of freedom of the
distribution,
is a scaling constant and
is the dimension. The first factor
returned by this function (
factor1
) is and the second factor (
factor2
) is .
Returns a list containing:
factor1 |
then estimated value of
|
factor2 |
the
value of |
Hardin, J., and Rocke, D. M. (2005). The distribution of robust distances. Journal of Computational and Graphical Statistics, 14(4), 928-946.
The L-infinity depth is a simple generalization of the multivariate depth to
functional data proposed in Long and Huang (2015)
<arXiv:1506.01332 and also
used in Dai et al. (2020) doi:10.1016/j.csda.2020.106960.
linfinity_depth(dt)
linfinity_depth(dt)
dt |
A matrix or data frame of size |
A numeric vector of size nrow(dt)
containing the band depth values of each curve.
Long, J. P., & Huang, J. Z. (2015). A study of functional depths. arXiv preprint arXiv:1506.01332.
Dai, W., Mrkvička, T., Sun, Y., & Genton, M. G. (2020). Functional outlier detection and taxonomy by sequential transformations. Computational Statistics & Data Analysis, 106960.
dt1 <- simulation_model1() linf <- linfinity_depth(dt1$data)
dt1 <- simulation_model1() linf <- linfinity_depth(dt1$data)
This function computes the modified band depth of López-Pintado and Romo (2009) doi:10.1198/jasa.2009.0108. Bands of 2 functions are always used and the fast algorithm of Sun et al. (2012) doi:10.1002/sta4.8 is used in computing the depth values.
modified_band_depth(dt)
modified_band_depth(dt)
dt |
A matrix or data frame of size |
A numeric vector of size nrow(dt)
containing the band depth values of each curve.
López-Pintado, S., & Romo, J. (2009). On the Concept of Depth for Functional Data. Journal of the American Statistical Association, 104(486), 718-734.
Sun, Y., Genton, M. G., & Nychka, D. W. (2012). Exact fast computation of band depth for large functional datasets: How quickly can one million curves be ranked?. Stat, 1(1), 68-74.
dt1 <- simulation_model1() mbd2 <- modified_band_depth(dt1$data)
dt1 <- simulation_model1() mbd2 <- modified_band_depth(dt1$data)
This function finds outliers in univariate and multivariate functional data using the MS-Plot
method described in Dai and Genton (2018) doi:10.1080/10618600.2018.1473781.
Indices of observations flagged as outliers are returned. In addition, the
scatter plot of against
(||MO||) can be requested for univariate
(multivariate) functional data.
msplot( dts, data_depth = c("random_projections"), n_projections = 200, seed = NULL, return_mvdir = TRUE, plot = TRUE, plot_title = "Magnitude Shape Plot", title_cex = 1.5, show_legend = T, ylabel = "VO", xlabel )
msplot( dts, data_depth = c("random_projections"), n_projections = 200, seed = NULL, return_mvdir = TRUE, plot = TRUE, plot_title = "Magnitude Shape Plot", title_cex = 1.5, show_legend = T, ylabel = "VO", xlabel )
dts |
A matrix/data frame for univariate functional data (of size |
data_depth |
The depth used in the computation of the directional outlyingness of
|
n_projections |
The number of random directions to generate for computing the random projection depth. By default 200 directions are generated. |
seed |
An integer indicating the seed to set when generating random directions for computing the random projection depth. NULL by default in which case no seed is set. |
return_mvdir |
A logical value indicating whether to return the mean and variation of directional
outlyingness ( |
plot |
A logical indicating whether to make the msplot of |
plot_title |
The title of the plot. Set to "Magnitude Shape Plot" by default. Ignored if
|
title_cex |
Numerical value indicating the size of the plot title relative to the device default.
Set to 1.5 by default. Ignored if |
show_legend |
A logical indicating whether to add legend to plot if |
ylabel |
The label of the y-axis. Set to "VO" by default. |
xlabel |
The label of the x-axis if |
MS-Plot finds outliers by computing
the mean and variation of directional outlyingness ( and
) described
in Dai and Genton (2019) doi:10.1016/j.csda.2018.03.017.
A multivariate data whose columns are the computed
and
is then constructed and
the robust mahalanobis distance(s) of the rows of this matrix are computed
(using the minimum covariate determinant estimate of the location and scatter). The tail
of the distribution of these distances is approximated using the
distribution
according to Hardin and Rocke (2005) doi:10.1198/106186005X77685 to get the cutoff.
The projection depth is always used for computing the directional outlyingness
(as suggested by Dai and Genton (2019) doi:10.1016/j.csda.2018.03.017).
Returns a list containing:
outliers_index |
an integer vector containing the indices of the outliers. |
median_curve |
the index of the median function (which is the
function with the smallest robust mahalanobis distance computed from the matrix whose
columns are made up of |
mean_outlyingness |
if |
var_outlyingness |
if |
Oluwasegun Taiwo Ojo.
Dai, W., and Genton, M. G. (2018). Multivariate functional data visualization and outlier detection. Journal of Computational and Graphical Statistics, 27(4), 923-934.
Dai, W., and Genton, M. G. (2019). Directional outlyingness for multivariate functional data. Computational Statistics & Data Analysis, 131, 50-65.
Hardin, J., and Rocke, D. M. (2005). The distribution of robust distances. Journal of Computational and Graphical Statistics, 14(4), 928-946.
dir_out
for directional outlyingness and projection_depth
for multivariate projection depth.
# Univariate magnitude model in Dai and Genton (2018). dt1 <- simulation_model1() msplot_object <- msplot(dts = dt1$data) msplot_object$outliers_index msplot_object$mean_outlyingness msplot_object$var_outlyingness
# Univariate magnitude model in Dai and Genton (2018). dt1 <- simulation_model1() msplot_object <- msplot(dts = dt1$data) msplot_object$outliers_index msplot_object$mean_outlyingness msplot_object$var_outlyingness
MUOD finds outliers by computing for each functional data, a magnitude, amplitude and shape index. Outliers are then detected in each set of index and outliers found are classified as either a magnitude, shape or amplitude outlier.
muod(dts, cut_method = c("boxplot", "tangent"))
muod(dts, cut_method = c("boxplot", "tangent"))
dts |
a matrix or dataframe of size |
cut_method |
a character value indicating method to use for finding indices cutoff.
Must be either |
MUOD was proposed in Azcorra et al. (2020) doi:10.1038/s41598-018-24874-2 as a support method for finding influential users in a social network data. It was also mentioned in Vinue and Epiphano (2020) doi:10.1007/s11634-020-00412-9 where it was compared with other functional outlier detection methods.
MUOD computes for each curve three indices: amplitude, magnitude and shape indices. Then a cutoff is determined for each set of indices and outliers are identified in each set of index. Outliers identified in the magnitude indices are flagged as magnitude outliers. The same holds true for the amplitude and shape indices. Thus, the outliers are not only identified but also classified.
Returns a list containing the following
outliers |
a vector containing the indices of outliers identified. |
indices |
a dataframe containing the shape, magnitude and amplitude indices |
Azcorra, A., Chiroque, L. F., Cuevas, R., Anta, A. F., Laniado, H., Lillo, R. E., Romo, J., & Sguera, C. (2018). Unsupervised scalable statistical method for identifying influential users in online social networks. Scientific reports, 8(1), 1-7.
dt1 <- simulation_model1() md <- muod(dts = dt1$data) str(md$outliers) dim(md$indices)
dt1 <- simulation_model1() md <- muod(dts = dt1$data) str(md$outliers) dim(md$indices)
Support function for plotting data generated by any of the simulation model
functions simulation_model1() - simulation_model9()
.
plot_dtt( y, grid_points, p, true_outliers, show_legend, plot_title, title_cex, ylabel, xlabel, legend_pos = "bottomright" )
plot_dtt( y, grid_points, p, true_outliers, show_legend, plot_title, title_cex, ylabel, xlabel, legend_pos = "bottomright" )
y |
A matrix of |
grid_points |
A vector of the evaluation/domain points of |
p |
A value indicating the number of evaluation/domain points |
true_outliers |
An integer vector indicating the indices of the true outliers |
show_legend |
A logical indicating whether to add legend to plot if |
plot_title |
Title of plot if |
title_cex |
Numerical value indicating the size of the plot title relative to the device default.
Set to 1.5 by default. Ignored if |
ylabel |
The label of the y-axis. Set to |
xlabel |
The label of the x-axis if |
legend_pos |
A character value indicating the position of the
legend. Can be one of |
Helper function to compute the random projection depth of multivariate point(s) with respect to a multivariate data.
projection_depth(dts, dt = dts, n_projections = 500L, seed = NULL)
projection_depth(dts, dt = dts, n_projections = 500L, seed = NULL)
dts |
A matrix or data frame of size |
dt |
A matrix or dataframe of size |
n_projections |
The number of directions for random projections. By default, 500 random directions for projection are generated from a scaled uniform distribution between -1 and 1. |
seed |
The random seed to set when generating the random directions. Defaults to NULL. |
A vector containing the depth values of dts
with respect to dt
.
Oluwasegun Taiwo Ojo
msplot
for outlier detection using msplot and dir_out
for directional outlyingness.
projection_depth(dts = iris[1:5, -5], dt = iris[1:10, -5], n_projection = 7, seed = 20)
projection_depth(dts = iris[1:5, -5], dt = iris[1:10, -5], n_projection = 7, seed = 20)
This method finds and classify outliers using sequential transformations proposed in Algorithm 1 of Dai et al. (2020) doi:10.1016/j.csda.2020.106960. A sequence of transformations are applied to the functional data and after each transformation, a functional boxplot is applied on the transformed data and outliers flagged by the functional data are noted. A number of transformations mentioned in Dai et al. (2020) doi:10.1016/j.csda.2020.106960 are supported including vertical alignment ("T1(X)(t)"), normalization ("T2(X)(t)"), one order of differencing ("D1(X)(t)" and "D2(X)(t)") and point-wise outlyingness data ("O(X)(t)"). The feature alignment transformation based on warping/curve registration is not yet supported.
seq_transform( dts, sequence = c("T0", "T1", "T2"), depth_method = c("mbd", "tvd", "extremal", "dirout", "linfinity", "bd", "erld", "dq"), save_data = FALSE, emp_factor = 1.5, central_region = 0.5, erld_type = NULL, dq_quantiles = NULL, n_projections = 200L, seed = NULL )
seq_transform( dts, sequence = c("T0", "T1", "T2"), depth_method = c("mbd", "tvd", "extremal", "dirout", "linfinity", "bd", "erld", "dq"), save_data = FALSE, emp_factor = 1.5, central_region = 0.5, erld_type = NULL, dq_quantiles = NULL, n_projections = 200L, seed = NULL )
dts |
A matrix for univariate functional data (of size |
sequence |
A character vector usually of length between 1 and 6 containing any of the strings:
Examples of sequences of transformations include: |
depth_method |
A character value specifying depth/outlyingness method to use in the functional boxplot applied after each stage of transformation. Note that the same depth/outlyingness method is used in the functional boxplot applied after each transformation in the sequence. The following methods are currently supported:
|
save_data |
A logical. If TRUE, the intermediate transformed data are returned in a list. |
emp_factor |
The empirical factor for functional boxplot. Defaults to 1.5. |
central_region |
A value between 0 and 1 indicating the central region probability for functional_boxplot. Defaults to 0.5. |
erld_type |
If |
dq_quantiles |
If |
n_projections |
An integer indicating the number of random projections to use in computing the point-wise outlyingness if a 3-d array
is specified in |
seed |
The random seed to set when generating the random directions in the computation of the point-wise outlyingness. Defaults to NULL. in which case a seed is not set. |
This function implements outlier detection using sequential transformations
described in Algorithm 1 of Dai et al. (2020) doi:10.1016/j.csda.2020.106960.
A sequence of transformations are applied consecutively with the functional
boxplot applied on the transformed data after each transformation. The following
example sequences (and their meaning) suggested in Dai et al. (2020)
doi:10.1016/j.csda.2020.106960 can be parsed to argument sequence
.
"T0"
Apply functional boxplot on raw data (no transformation is applied).
c("T0", "T1", "D1")
Apply functional boxplot on raw data, then apply vertical alignment on data followed by applying functional boxplot again. Finally apply one order of differencing on the vertically aligned data and apply functional boxplot again.
c("T0", "T1", "T2")
Apply functional boxplot on raw data, then apply vertical alignment on data followed by applying functional boxplot again. Finally apply normalization using L-2 norm on the vertically aligned data and apply functional boxplot again.
c("T0", "D1", "D2")
Apply functional boxplot on raw data, then apply one order of difference on data followed by applying
functional boxplot again. Finally apply another one order of differencing on the differenced data and apply functional boxplot again.
Note that this sequence of transformation can also be (alternatively) specified by c("T0", "D1", "D1")
, c("T0", "D2", "D2")
, and
c("T0", "D2", "D1")
since "D1"
and "D2"
do the same thing which is to apply one order lag-1 difference on the data.
"O"
Find the pointwise outlyingness of the multivariate or univariate functional data and then apply functional boxplot on the resulting univariate functional data of pointwise outlyingness. Care must be taken to specify a one sided ordering function (i.e. "one_sided_right" extreme rank length depth) in the functional boxplot used on the data of point-wise outlyingness. This is because only large values should be considered extreme in the data of the point-wise outlyingness.
For multivariate functional data (when a 3-d array is supplied to dts
), the sequence of transformation must always begin with "O"
so that the multivariate data can be replaced with the univariate data of point-wise outlyingness which the functional boxplot can subsequently process
because the functional_boxplot
function only supports univariate functional data.
If repeated transformations are used in the sequence (e.g. when sequence = c("T0", "D1", "D1")
), a warning message is thrown
and the labels of the output list are changed (e.g. for sequence = c("T0", "D1", "D1")
, the labels of the output lists
become "T0", "D1_1", "D1_2"
, so that outliers are accessed with output$outlier$D1_1
and output$outlier$D1_2
).
See examples for more.
A list containing two lists are returned. The contents of the returned list are:
outliers: |
A named list of length |
transformed_data |
If |
# same as running a functional boxplot dt1 <- simulation_model1() seqobj <- seq_transform(dt1$data, sequence = "T0", depth_method = "mbd") seqobj$outliers$T0 functional_boxplot(dt1$data, depth_method = "mbd")$outliers # more sequences dt4 <- simulation_model4() seqobj <- seq_transform(dt4$data, sequence = c("T0", "D1", "D2"), depth_method = "mbd") seqobj$outliers$T0 # outliers found in raw data seqobj$outliers$D1 # outliers found after differencing data the first time seqobj$outliers$D2 # outliers found after differencing the data the second time # saving transformed data seqobj <- seq_transform(dt4$data, sequence = c("T0", "D1", "D2"), depth_method = "mbd", save_data = TRUE) seqobj$outliers$T0 # outliers found in raw data head(seqobj$transformed_data$T0) # the raw data head(seqobj$transformed_data$D1) # the first order differenced data head(seqobj$transformed_data$D2) # the 2nd order differenced data # double transforms e.g. c("T0", "D1", "D1") seqobj <- seq_transform(dt4$data, sequence = c("T0", "D1", "D1"), depth_method = "mbd", save_data = TRUE) # throws warning seqobj$outliers$T0 # outliers found in raw data seqobj$outliers$D1_1 #found after differencing data the first time seqobj$outliers$D1_2 #found after differencing data the second time head(seqobj$transformed_data$T0) # the raw data head(seqobj$transformed_data$D1_1) # the first order differenced data head(seqobj$transformed_data$D1_2) # the 2nd order differenced data # multivariate data dtm <- array(0, dim = c(dim(dt1$data), 2)) dtm[,,1] <- dt1$data dtm[,,2] <- dt1$data seqobj <- seq_transform(dtm, sequence = "O", depth_method = "erld", erld_type = "one_sided_right", save_data = TRUE) seqobj$outliers$O # multivariate outliers head(seqobj$transformed_data$O) # univariate outlyingness data
# same as running a functional boxplot dt1 <- simulation_model1() seqobj <- seq_transform(dt1$data, sequence = "T0", depth_method = "mbd") seqobj$outliers$T0 functional_boxplot(dt1$data, depth_method = "mbd")$outliers # more sequences dt4 <- simulation_model4() seqobj <- seq_transform(dt4$data, sequence = c("T0", "D1", "D2"), depth_method = "mbd") seqobj$outliers$T0 # outliers found in raw data seqobj$outliers$D1 # outliers found after differencing data the first time seqobj$outliers$D2 # outliers found after differencing the data the second time # saving transformed data seqobj <- seq_transform(dt4$data, sequence = c("T0", "D1", "D2"), depth_method = "mbd", save_data = TRUE) seqobj$outliers$T0 # outliers found in raw data head(seqobj$transformed_data$T0) # the raw data head(seqobj$transformed_data$D1) # the first order differenced data head(seqobj$transformed_data$D2) # the 2nd order differenced data # double transforms e.g. c("T0", "D1", "D1") seqobj <- seq_transform(dt4$data, sequence = c("T0", "D1", "D1"), depth_method = "mbd", save_data = TRUE) # throws warning seqobj$outliers$T0 # outliers found in raw data seqobj$outliers$D1_1 #found after differencing data the first time seqobj$outliers$D1_2 #found after differencing data the second time head(seqobj$transformed_data$T0) # the raw data head(seqobj$transformed_data$D1_1) # the first order differenced data head(seqobj$transformed_data$D1_2) # the 2nd order differenced data # multivariate data dtm <- array(0, dim = c(dim(dt1$data), 2)) dtm[,,1] <- dt1$data dtm[,,2] <- dt1$data seqobj <- seq_transform(dtm, sequence = "O", depth_method = "erld", erld_type = "one_sided_right", save_data = TRUE) seqobj$outliers$O # multivariate outliers head(seqobj$transformed_data$O) # univariate outlyingness data
This is a typical magnitude model in which outliers are shifted from the normal' non-outlying observations. The main model is of the form:
and the contamination model model is of the form:
where ,
is a Gaussian process with zero mean and
covariance function of the form:
(usually with
),
and
is a constant controlling how far the outliers are from the mean
function of the data, usually,
or
.
The domain of the generated functions is over the interval
.
Please see the simulation models vignette with
vignette("simulation_models", package = "fdaoutlier")
for more details.
simulation_model1( n = 100, p = 50, outlier_rate = 0.05, mu = 4, q = 8, kprob = 0.5, cov_alpha = 1, cov_beta = 1, cov_nu = 1, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 1", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
simulation_model1( n = 100, p = 50, outlier_rate = 0.05, mu = 4, q = 8, kprob = 0.5, cov_alpha = 1, cov_beta = 1, cov_nu = 1, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 1", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
n |
The number of curves to generate. Set to |
p |
The number of evaluation points of the curves. Curves are usually generated
over the interval |
outlier_rate |
A value between |
mu |
The mean value of the functions in the main and contamination model.
Set to |
q |
A value indicating the shift of the outliers from the mean function, i.e., the |
kprob |
A value between |
cov_alpha |
A value indicating the coefficient of the exponential function
of the covariance matrix, i.e., the |
cov_beta |
A value indicating the coefficient of the terms inside the exponential
function of the covariance matrix, i.e., the |
cov_nu |
A value indicating the power to which to raise the terms inside the exponential
function of the covariance matrix, i.e., the |
deterministic |
A logical value. If |
seed |
A seed to set for reproducibility. |
plot |
A logical value indicating whether to plot data. |
plot_title |
Title of plot if |
title_cex |
Numerical value indicating the size of the plot title relative to the device default.
Set to 1.5 by default. Ignored if |
show_legend |
A logical indicating whether to add legend to plot if |
ylabel |
The label of the y-axis. Set to |
xlabel |
The label of the x-axis if |
A list containing:
data |
a matrix of size |
true_outliers |
a vector of integers indicating the row index of the outliers in the generated data. |
dt <- simulation_model1(n = 50, plot = TRUE) dim(dt$data) dt$true_outliers
dt <- simulation_model1(n = 50, plot = TRUE) dim(dt$data) dt$true_outliers
This model generates non-persistent magnitude outliers, i.e., the outliers are magnitude outliers for only a portion of the domain of the functional data. The main model is of the form:
with contamination model of the form:
where: ,
is a Gaussian process with zero mean and
covariance function of the form:
with
,
is a constant controlling how far the outliers are from the mass of the
data,
is an indicator function,
is a uniform random variable between
an interval
, and
is a constant specifying for how
much of the domain the outliers are away from the mean function.
Please see the simulation models vignette with
vignette("simulation_models", package = "fdaoutlier")
for more details.
simulation_model2( n = 100, p = 50, outlier_rate = 0.05, mu = 4, q = 8, kprob = 0.5, a = 0.1, b = 0.9, l = 0.05, cov_alpha = 1, cov_beta = 1, cov_nu = 1, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 2", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
simulation_model2( n = 100, p = 50, outlier_rate = 0.05, mu = 4, q = 8, kprob = 0.5, a = 0.1, b = 0.9, l = 0.05, cov_alpha = 1, cov_beta = 1, cov_nu = 1, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 2", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
n |
The number of curves to generate. Set to |
p |
The number of evaluation points of the curves. Curves are usually generated
over the interval |
outlier_rate |
A value between |
mu |
The mean value of the functions. Set to |
q |
A value indicating the shift of the outliers from the mean function.
Used to control how far the outliers are from the mean function. Set to |
kprob |
A value between |
a , b
|
values values specifying the interval |
l |
the value of |
cov_alpha |
A value indicating the coefficient of the exponential function
of the covariance matrix, i.e., the |
cov_beta |
A value indicating the coefficient of the terms inside the exponential
function of the covariance matrix, i.e., the |
cov_nu |
A value indicating the power to which to raise the terms inside the exponential
function of the covariance matrix, i.e., the |
deterministic |
A logical value. If |
seed |
A seed to set for reproducibility. |
plot |
A logical value indicating whether to plot data. |
plot_title |
Title of plot if |
title_cex |
Numerical value indicating the size of the plot title relative to the device default.
Set to 1.5 by default. Ignored if |
show_legend |
A logical indicating whether to add legend to plot if |
ylabel |
The label of the y-axis. Set to |
xlabel |
The label of the x-axis if |
A list containing:
data |
a matrix of size |
true_outliers |
a vector of integers indicating the row index of the outliers in the generated data. |
dtt <- simulation_model2(plot = TRUE) dtt$true_outliers dim(dtt$data)
dtt <- simulation_model2(plot = TRUE) dtt$true_outliers dim(dtt$data)
This model generates outliers that are magnitude outliers for a part of the domain. The main model is of the form:
with contamination model of the form:
where: ,
is a Gaussian process with zero mean and
covariance function of the form:
with
,
is a constant controlling how far the outliers are from the mass of the
data,
is an indicator function, and
is a uniform random variable between
an interval
.
Please see the simulation models vignette with
vignette("simulation_models", package = "fdaoutlier")
for more details.
simulation_model3( n = 100, p = 50, outlier_rate = 0.05, mu = 4, q = 6, a = 0.1, b = 0.9, kprob = 0.5, cov_alpha = 1, cov_beta = 1, cov_nu = 1, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 3", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
simulation_model3( n = 100, p = 50, outlier_rate = 0.05, mu = 4, q = 6, a = 0.1, b = 0.9, kprob = 0.5, cov_alpha = 1, cov_beta = 1, cov_nu = 1, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 3", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
n |
The number of curves to generate. Set to |
p |
The number of evaluation points of the curves. Curves are usually generated
over the interval |
outlier_rate |
A value between |
mu |
The mean value of the functions. Set to |
q |
A value indicating the shift of the outliers from the mean function.
Used to control how far the outliers are from the mean function. Set to |
a , b
|
values values specifying the interval |
kprob |
A value between |
cov_alpha |
A value indicating the coefficient of the exponential function
of the covariance matrix, i.e., the |
cov_beta |
A value indicating the coefficient of the terms inside the exponential
function of the covariance matrix, i.e., the |
cov_nu |
A value indicating the power to which to raise the terms inside the exponential
function of the covariance matrix, i.e., the |
deterministic |
A logical value. If |
seed |
A seed to set for reproducibility. |
plot |
A logical value indicating whether to plot data. |
plot_title |
Title of plot if |
title_cex |
Numerical value indicating the size of the plot title relative to the device default.
Set to 1.5 by default. Ignored if |
show_legend |
A logical indicating whether to add legend to plot if |
ylabel |
The label of the y-axis. Set to |
xlabel |
The label of the x-axis if |
A list containing:
data |
a matrix of size |
true_outliers |
a vector of integers indicating the row index of the outliers in the generated data. |
dt <- simulation_model3(plot = TRUE) dt$true_outliers dim(dt$data)
dt <- simulation_model3(plot = TRUE) dt$true_outliers dim(dt$data)
This models generates outliers defined on the reversed time interval of the main model. The main model is of the form:
with contamination model of the form:
Please see the simulation models vignette with
vignette("simulation_models", package = "fdaoutlier")
for more details.
simulation_model4( n = 100, p = 50, outlier_rate = 0.05, mu = 30, m = 3/2, cov_alpha = 0.3, cov_beta = (1/0.3), cov_nu = 1, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 4", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
simulation_model4( n = 100, p = 50, outlier_rate = 0.05, mu = 30, m = 3/2, cov_alpha = 0.3, cov_beta = (1/0.3), cov_nu = 1, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 4", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
n |
The number of curves to generate. Set to |
p |
The number of evaluation points of the curves. Curves are usually generated
over the interval |
outlier_rate |
A value between |
mu |
The mean value of the functions. Set to |
m |
the constant |
cov_alpha |
A value indicating the coefficient of the exponential function
of the covariance matrix, i.e., the |
cov_beta |
A value indicating the coefficient of the terms inside the exponential
function of the covariance matrix, i.e., the |
cov_nu |
A value indicating the power to which to raise the terms inside the exponential
function of the covariance matrix, i.e., the |
deterministic |
A logical value. If |
seed |
A seed to set for reproducibility. |
plot |
A logical value indicating whether to plot data. |
plot_title |
Title of plot if |
title_cex |
Numerical value indicating the size of the plot title relative to the device default.
Set to 1.5 by default. Ignored if |
show_legend |
A logical indicating whether to add legend to plot if |
ylabel |
The label of the y-axis. Set to |
xlabel |
The label of the x-axis if |
A list containing:
data |
a matrix of size |
true_outliers |
a vector of integers indicating the row index of the outliers in the generated data. |
dt <- simulation_model4(plot = TRUE) dt$true_outliers dim(dt$data)
dt <- simulation_model4(plot = TRUE) dt$true_outliers dim(dt$data)
This models generates shape outliers with a different covariance structure from that of the main model. The main model is of the form:
contamination model of the form:
where , and
and
are Gaussian processes with zero mean and
covariance function of the form:
Please see the simulation models vignette with
vignette("simulation_models", package = "fdaoutlier")
for more details.
simulation_model5( n = 100, p = 50, outlier_rate = 0.05, mu = 4, cov_alpha = 1, cov_beta = 1, cov_nu = 1, cov_alpha2 = 5, cov_beta2 = 2, cov_nu2 = 0.5, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 5", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
simulation_model5( n = 100, p = 50, outlier_rate = 0.05, mu = 4, cov_alpha = 1, cov_beta = 1, cov_nu = 1, cov_alpha2 = 5, cov_beta2 = 2, cov_nu2 = 0.5, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 5", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
n |
The number of curves to generate. Set to |
p |
The number of evaluation points of the curves. Curves are usually generated
over the interval |
outlier_rate |
A value between |
mu |
The mean value of the functions. Set to |
cov_alpha , cov_alpha2
|
A value indicating the coefficient of the exponential function
of the covariance matrix, i.e., the |
cov_beta , cov_beta2
|
A value indicating the coefficient of the terms inside the exponential
function of the covariance matrix, i.e., the |
cov_nu , cov_nu2
|
A value indicating the power to which to raise the terms inside the exponential
function of the covariance matrix, i.e., the |
deterministic |
A logical value. If |
seed |
A seed to set for reproducibility. |
plot |
A logical value indicating whether to plot data. |
plot_title |
Title of plot if |
title_cex |
Numerical value indicating the size of the plot title relative to the device default.
Set to 1.5 by default. Ignored if |
show_legend |
A logical indicating whether to add legend to plot if |
ylabel |
The label of the y-axis. Set to |
xlabel |
The label of the x-axis if |
A list containing:
data |
a matrix of size |
true_outliers |
a vector of integers indicating the row index of the outliers in the generated data. |
dt <- simulation_model5(plot = TRUE) dt$true_outliers dim(dt$data)
dt <- simulation_model5(plot = TRUE) dt$true_outliers dim(dt$data)
This models generates shape outliers that have a different shape for a portion of the domain. The main model is of the form:
with contamination model of the form:
where: ,
is a Gaussian process with zero mean
and covariance function of the form:
follows Bernoulli distribution with probability
;
,
,
and
are constants, and
follows
a Uniform distribution between an interval
and
is a constant.
Please see the simulation models vignette with
vignette("simulation_models", package = "fdaoutlier")
for more details.
simulation_model6( n = 100, p = 50, outlier_rate = 0.1, mu = 4, q = 1.8, kprob = 0.5, a = 0.25, b = 0.75, cov_alpha = 1, cov_beta = 1, cov_nu = 1, pi_coeff = 0.02, exp_pow = 2, exp_coeff = 50, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 6", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
simulation_model6( n = 100, p = 50, outlier_rate = 0.1, mu = 4, q = 1.8, kprob = 0.5, a = 0.25, b = 0.75, cov_alpha = 1, cov_beta = 1, cov_nu = 1, pi_coeff = 0.02, exp_pow = 2, exp_coeff = 50, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 6", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
n |
The number of curves to generate. Set to |
p |
The number of evaluation points of the curves. Curves are usually generated
over the interval |
outlier_rate |
A value between |
mu |
The mean value of the functions in the main and contamination model.
Set to |
q |
The constant term |
kprob |
The probability |
a , b
|
Values specifying the interval of from which |
cov_alpha |
A value indicating the coefficient of the exponential function
of the covariance matrix, i.e., the |
cov_beta |
A value indicating the coefficient of the terms inside the exponential
function of the covariance matrix, i.e., the |
cov_nu |
A value indicating the power to which to raise the terms inside the exponential
function of the covariance matrix, i.e., the |
pi_coeff |
The constant |
exp_pow |
The constant |
exp_coeff |
The constant |
deterministic |
A logical value. If |
seed |
A seed to set for reproducibility. |
plot |
A logical value indicating whether to plot data. |
plot_title |
Title of plot if |
title_cex |
Numerical value indicating the size of the plot title relative to the device default.
Set to 1.5 by default. Ignored if |
show_legend |
A logical indicating whether to add legend to plot if |
ylabel |
The label of the y-axis. Set to |
xlabel |
The label of the x-axis if |
A list containing:
data |
a matrix of size |
true_outliers |
a vector of integers indicating the row index of the outliers in the generated data. |
dt <- simulation_model6(n = 50, plot = TRUE) dim(dt$data) dt$true_outliers
dt <- simulation_model6(n = 50, plot = TRUE) dim(dt$data) dt$true_outliers
This model generates pure shape outliers that are periodic. The main model is of the form:
with contamination model of the form:
where:
, and
is a Gaussian process with
zero mean and covariance function of the form:
is uniformly distributed in an interval
and
,
are constants.
Please see the simulation models vignette with
vignette("simulation_models", package = "fdaoutlier")
for more details.
simulation_model7( n = 100, p = 50, outlier_rate = 0.05, mu = 4, sin_coeff = 2, pi_coeff = 4, a = 0.25, b = 0.75, cov_alpha = 1, cov_beta = 1, cov_nu = 1, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 7", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
simulation_model7( n = 100, p = 50, outlier_rate = 0.05, mu = 4, sin_coeff = 2, pi_coeff = 4, a = 0.25, b = 0.75, cov_alpha = 1, cov_beta = 1, cov_nu = 1, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 7", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
n |
The number of curves to generate. Set to |
p |
The number of evaluation points of the curves. Curves are usually generated
over the interval |
outlier_rate |
A value between |
mu |
The mean value of the functions in the main and contamination model.
Set to |
sin_coeff |
The coefficient |
pi_coeff |
The coefficient |
a , b
|
Values indicating the interval of the uniform distribution from which
|
cov_alpha |
A value indicating the coefficient of the exponential function
of the covariance matrix, i.e., the |
cov_beta |
A value indicating the coefficient of the terms inside the exponential
function of the covariance matrix, i.e., the |
cov_nu |
A value indicating the power to which to raise the terms inside the exponential
function of the covariance matrix, i.e., the |
deterministic |
A logical value. If |
seed |
A seed to set for reproducibility. |
plot |
A logical value indicating whether to plot data. |
plot_title |
Title of plot if |
title_cex |
Numerical value indicating the size of the plot title relative to the device default.
Set to 1.5 by default. Ignored if |
show_legend |
A logical indicating whether to add legend to plot if |
ylabel |
The label of the y-axis. Set to |
xlabel |
The label of the x-axis if |
A list containing:
data |
a matrix of size |
true_outliers |
a vector of integers indicating the row index of the outliers in the generated data. |
dt <- simulation_model7(n = 50, plot = TRUE) dim(dt$data) dt$true_outliers
dt <- simulation_model7(n = 50, plot = TRUE) dim(dt$data) dt$true_outliers
This model generates pure shape outliers that are periodic. The main model is of the form:
with contamination model of the form:
where , and
is a Gaussian processes with
zero mean and covariance function of the form:
and are constants.
Please see the simulation models vignette with
vignette("simulation_models", package = "fdaoutlier")
for more details.
simulation_model8( n = 100, p = 50, outlier_rate = 0.05, pi_coeff = 15, sin_coeff = 2, constant = 2, cov_alpha = 1, cov_beta = 1, cov_nu = 1, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 8", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
simulation_model8( n = 100, p = 50, outlier_rate = 0.05, pi_coeff = 15, sin_coeff = 2, constant = 2, cov_alpha = 1, cov_beta = 1, cov_nu = 1, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 8", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
n |
The number of curves to generate. Set to |
p |
The number of evaluation points of the curves. Curves are usually generated
over the interval |
outlier_rate |
A value between |
pi_coeff |
The coefficient |
sin_coeff |
The coefficient |
constant |
The value of the constant |
cov_alpha |
A value indicating the coefficient of the exponential function
of the covariance matrix, i.e., the |
cov_beta |
A value indicating the coefficient of the terms inside the exponential
function of the covariance matrix, i.e., the |
cov_nu |
A value indicating the power to which to raise the terms inside the exponential
function of the covariance matrix, i.e., the |
deterministic |
A logical value. If |
seed |
A seed to set for reproducibility. |
plot |
A logical value indicating whether to plot data. |
plot_title |
Title of plot if |
title_cex |
Numerical value indicating the size of the plot title relative to the device default.
Set to 1.5 by default. Ignored if |
show_legend |
A logical indicating whether to add legend to plot if |
ylabel |
The label of the y-axis. Set to |
xlabel |
The label of the x-axis if |
A list containing:
data |
a matrix of size |
true_outliers |
a vector of integers indicating the row index of the outliers in the generated data. |
dt <- simulation_model8(plot = TRUE) dim(dt$data) dt$true_outliers
dt <- simulation_model8(plot = TRUE) dim(dt$data) dt$true_outliers
Periodic functions with outliers of different amplitude. The main model is of the form
with contamination model of the form
where ,
,
,
follows uniform distribution in an interval
,
follows uniform distribution in an interval
;
,
follows uniform distribution
in an interval
;
follows Bernoulli distribution
and
is a Gaussian processes with zero mean and covariance
function of the form
Please see the simulation models vignette with
vignette("simulation_models", package = "fdaoutlier")
for more details.
simulation_model9( n = 100, p = 50, outlier_rate = 0.05, kprob = 0.5, ai = c(3, 8), bi = c(1.5, 2.5), ci = c(9, 10.5), cov_alpha = 1, cov_beta = 1, cov_nu = 1, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 9", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
simulation_model9( n = 100, p = 50, outlier_rate = 0.05, kprob = 0.5, ai = c(3, 8), bi = c(1.5, 2.5), ci = c(9, 10.5), cov_alpha = 1, cov_beta = 1, cov_nu = 1, deterministic = TRUE, seed = NULL, plot = F, plot_title = "Simulation Model 9", title_cex = 1.5, show_legend = T, ylabel = "", xlabel = "gridpoints" )
n |
The number of curves to generate. Set to |
p |
The number of evaluation points of the curves. Curves are usually generated
over the interval |
outlier_rate |
A value between |
kprob |
The probability |
ai |
A vector of two values containing |
bi |
A vector of 2 values containing |
ci |
A vector of 2 values containing $c_1i$ and $c_2i$ in the contamination
model. Set to |
cov_alpha |
A value indicating the coefficient of the exponential function
of the covariance matrix, i.e., the |
cov_beta |
A value indicating the coefficient of the terms inside the exponential
function of the covariance matrix, i.e., the |
cov_nu |
A value indicating the power to which to raise the terms inside the exponential
function of the covariance matrix, i.e., the |
deterministic |
A logical value. If |
seed |
A seed to set for reproducibility. |
plot |
A logical value indicating whether to plot data. |
plot_title |
Title of plot if |
title_cex |
Numerical value indicating the size of the plot title relative to the device default.
Set to 1.5 by default. Ignored if |
show_legend |
A logical indicating whether to add legend to plot if |
ylabel |
The label of the y-axis. Set to |
xlabel |
The label of the x-axis if |
A list containing:
data |
a matrix of size |
true_outliers |
a vector of integers indicating the row index of the outliers in the generated data. |
dt <- simulation_model9(plot = TRUE) dim(dt$data) dt$true_outliers
dt <- simulation_model9(plot = TRUE) dim(dt$data) dt$true_outliers
A dataset containing daily temperature, log precipitation and wind speed of 73 spanish weather stations in Spain between 1980 - 2009.
spanish_weather
spanish_weather
A list containing :
$station_info:
A dataframe containing geographic information from the 73 weather stations with the following variables:
ind:
id of weather station
name:
name of weather station
province:
province of weather station
altitude:
altitude in meters of the station
year.ini:
start year
year.end:
end year
longitude:
longitude of the coordinates of the weather station (in decimal degrees)
longitude:
latitude of the coordinates of the weather station (in decimal degrees)
$temperature:
A matrix of size 73 (stations) by 365 (days) containing average daily temperature for the period 1980-2009 (in degrees Celsius, marked with UTF-8 string). Leap years temperatures for February 28 and 29 were averaged.
$wind_speed:
A matrix of size 73 (stations) by 365 (days) containing average daily wind speed for the period 1980-2009 (in m/s).
$logprec:
A matrix of size 73 (stations) by 365 (days) containing average daily log precipitation
for the period 1980-2009 (in log mm). Negligible precipitation (less than 1 tenth of mm)
is replaced by 0.05 and no precipitation (0.0
mm) is replaced by 0.01
after which
the logarithm was applied.
This is a stripped down version of the popular aemet
spanish weather data available in
the fda.usc
doi:10.18637/jss.v051.i04 package.
See the documentation of fda.usc
for more details about data.
Data obtained from the fda.usc
doi:10.18637/jss.v051.i04 package.
data(spanish_weather) names(spanish_weather) names(spanish_weather$station_info)
data(spanish_weather) names(spanish_weather) names(spanish_weather$station_info)
This function computes the total variation depth (tvd) and the modified shape similarity index (mss) proposed in Huang and Sun (2019) doi:10.1080/00401706.2019.1574241.
total_variation_depth(dts)
total_variation_depth(dts)
dts |
A matrix or dataframe of size |
This function computes the total variation depth (TVD) and modified shape similarity (MSS) index of a univariate functional data. The definition of the estimates of TVD and MSS can be found in Huang and Sun (2019) doi:10.1080/00401706.2019.1574241.
Returns a list containing the following
tvd |
the total variation depths of the observations of |
mss |
the modified shape similarity index of the observations of |
Oluwasegun Ojo
Huang, H., & Sun, Y. (2019). A decomposition of total variation depth for understanding functional outliers. Technometrics, 61(4), 445-458.
tvd_mss
for outlier detection using TVD and MSS.
dt6 <- simulation_model6() tvd_object <- total_variation_depth(dt6$data)
dt6 <- simulation_model6() tvd_object <- total_variation_depth(dt6$data)
Find shape and magnitude outliers using the Total Variation Depth and Modified Shape Similarity Index proposed in Huang and Sun (2019) doi:10.1080/00401706.2019.1574241.
tvd_mss( data, emp_factor_mss = 1.5, emp_factor_tvd = 1.5, central_region_tvd = 0.5 ) tvdmss( dts, emp_factor_mss = 1.5, emp_factor_tvd = 1.5, central_region_tvd = 0.5 )
tvd_mss( data, emp_factor_mss = 1.5, emp_factor_tvd = 1.5, central_region_tvd = 0.5 ) tvdmss( dts, emp_factor_mss = 1.5, emp_factor_tvd = 1.5, central_region_tvd = 0.5 )
emp_factor_mss |
The empirical factor of the classical boxplot used on the modified shape similarity index. Defaults to 1.5. |
emp_factor_tvd |
The empirical factor of the functional boxplot used on the TVD of observations. Defaults to 1.5. |
central_region_tvd |
A number between 0 and 1 indicating the central region probability of the functional boxplot used on the TVD of the observations. Defaults to 0.5. See also details. |
dts , data
|
A matrix or dataframe of size |
This method uses a combination of total variation depth (TVD) and modified shape similarity (MSS) index
defined in Huang and Sun (2019) doi:10.1080/00401706.2019.1574241
to find magnitude and shape outliers. The TVD and MSS of all the observations are
first computed and a classical boxplot is then applied on the MSS. Outliers detected
by the boxplot of MSS are flagged as shape outliers. The shape outliers are then removed
from the data and the TVD of the remaining observations are used in a functional boxplot
to detect magnitude outliers. The central region
of this functional boxplot (central_region_tvd
) is w.r.t. to the original number of curves. Thus if
8 shape outliers are found out of 100 curves, specifying central_region_tvd
= 0.5 will ensure that
50 observations are used as the central region in the functional boxplot on the remaining 92 observations.
Returns a list containing the following
outliers |
the indices of the (shape and magnitude) outliers |
shape_outliers |
the indices of the shape outliers |
magnitude_outliers |
the indices of the magnitude outliers |
tvd |
the total variation depths of the observations of |
mss |
the modified shape similarity index of the observations of |
tvd_mss()
: Deprecated function. Use tvdmss
instead.
Oluwasegun Ojo
Huang, H., & Sun, Y. (2019). A decomposition of total variation depth for understanding functional outliers. Technometrics, 61(4), 445-458.
msplot
for outlier detection using msplot.
dt6 <- simulation_model6() res <- tvdmss(dt6$data) res$outliers
dt6 <- simulation_model6() res <- tvdmss(dt6$data) res$outliers
This is the world population data, revision 2010, by countries used in the paper Nagy et al. (2016) doi:10.1080/10618600.2017.1336445 and Dai et al. (2020) doi:10.1016/j.csda.2020.106960. It contains population (both sexes) of countries as of July 1 in the years 1950 - 2010. The data have been pre-processed as described in Nagy et al. (2016) and hence contains only the 105 countries with population in the range of one million and fifteen million on July 1, 1980.
world_population
world_population
A matrix of size 105 rows by 61 columns.
Data included for illustration and testing purposes.
Data originally available in the depth.fd
package.
data(world_population) str(world_population)
data(world_population) str(world_population)