Title: | Graphical Univariate/Multivariate Assessments for Normality Assumption |
Version: | 1.0.1 |
Maintainer: | Huong Tran <quynhhuong5335@gmail.com> |
Description: | Graphical methods testing multivariate normality assumption. Methods including assessing score function, and moment generating functions,independent transformations and linear transformations. For more details see Tran (2024),"Contributions to Multivariate Data Science: Assessment and Identification of Multivariate Distributions and Supervised Learning for Groups of Objects." , PhD thesis, https://our.oakland.edu/items/c8942577-2562-4d2f-8677-cb8ec0bf6234. |
License: | MIT + file LICENSE |
Encoding: | UTF-8 |
Imports: | Rdpack, ggplot2, stats, Matrix, MatrixExtra, MASS, rlang |
RdMacros: | Rdpack |
Depends: | R (≥ 3.5.0) |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
RoxygenNote: | 7.3.1 |
VignetteBuilder: | knitr |
SysDataCompression: | xz |
NeedsCompilation: | no |
Packaged: | 2025-03-24 14:00:35 UTC; huongtran |
Author: | Huong Tran |
Repository: | CRAN |
Date/Publication: | 2025-03-25 09:10:05 UTC |
Transformation to Independent Univariate Sample
Description
Leave-one-out method gives approximately independent sample of standard multivariate normal distribution, which then produces sample of standard univariate normal distribution.
Usage
Multi.to.Uni(x)
Arguments
x |
multivariate data matrix |
Details
Let \bar{X}_{-k}
and S_{-k}
are the sample mean sample variance
covariance matrix obtained by using all but k^{th}
data point. Then
S_{-k}^{-1/2} (X_k - \bar{X}_{-k}) , k = 1,... n
are approximately
independently distributed as N_p(0, I)
. Thus all n \times p
entries in the data matrix so constructed can be treated as
univariate samples of size n \times p
from N(0, 1)
.
Value
Data frame contains univariate data and the index from multivariate data.
Examples
set.seed(1)
x <- MASS::mvrnorm(100, mu = rep(0, 5), diag(5))
df <- Multi.to.Uni(x)
qqnorm(df$x.new); abline(0, 1)
Graphical plots to assess multivariate normality assumption.
Description
Cumulant generating functions of normally distributed
random variables has derivatives of order higher than 3 are all 0.
Hence, plots of empirical third/fourth order derivatives with large value
or high slope gives indication of non-normality.
Multivariate_CGF_PLot
estimates and provides confidence region for
average (or any linear combination) of third/fourth derivatives of empirical
cumulant function at the points t = t^*1_p
. Plots for
p = 2, 3, \dots, 10
will be faster to obtain, as confidence regions
and other necessary parameters are available in mt3_lst_param.rda
and
mt4_lst_param.rda
.
Higher dimension requires expensive computational cost.
Usage
d3hCGF_plot(x, alpha = 0.05)
d4hCGF_plot(x, alpha = 0.05)
Arguments
x |
Data matrix of size |
alpha |
Significant level (default is |
Value
d3hCGF_plot
returns plot relying in third derivatives.
d4hCGF_plot
returns plot relying in forth derivatives.
See Also
Examples
set.seed(1234)
p <- 3
x <- MASS::mvrnorm(500, rep(0, p), diag(p))
d3hCGF_plot(x)
d4hCGF_plot(x)
Graphical plots to assess multivariate univarite assumption of data.
Description
Plots the empirical third/fourth derivatives of cumulant generating function together with confidence probability region. Indication of non-normality is either violation of probability bands or curves with high slope.
Usage
dhCGF_plot1D(x, alpha = 0.05, method)
Arguments
x |
Univariate data |
alpha |
Significant level (default is |
method |
string, |
Value
Plots
References
Ghosh S (1996). “A New Graphical Tool to Detect Non-Normality.” Journal of the Royal Statistical Society: Series B (Methodological), 58(4), 691-702. doi:10.1111/j.2517-6161.1996.tb02108.x.
Examples
set.seed(123)
x <- rnorm(100)
dhCGF_plot1D(x, method = "T3")
dhCGF_plot1D(x, method = "T4")
Graphical plots to assess the univarite noramality assumption of data.
Description
Score function of a univariate normal distribution is a straight line. A non-linear graph of score function estimator shows evidence of non-normality.
Outliers are detected using the 2-sigma bands method.
Usage
cox(x, P = NULL, lambda = 0.5, x.dist = NULL)
score_plot1D(x, P = NULL, lambda = 0.5, x.dist = NULL, ori.index = NULL)
Arguments
x |
univariate data. |
P |
vector of weight. |
lambda |
smoothing parameter, default is |
x.dist |
the minimum distance between two data points in vector x. |
ori.index |
original index of vector x, default is |
Details
To avoid the singularity of coefficient matrices in spline method, points
with distance less than x.dist
are merged and weight of the
representative points is updated by the summation of weight of
discarded points.
Under null hypothesis, a unbiased estimator score function of a
given data point x_k
is
\hat{\psi}(x_k) = \dfrac{n - 4}{n - 2} \dfrac{x_k - \bar{X}_{-k}}{S_{-k}^2}
and if a_{k}
is the estimate score from function cox
at
the point x_k
, then
a_k\in \hat{\psi}(x_k) \pm 2 \sqrt{\hat{\text{Var}}(\hat{\psi}(x_k))}.
Hence points outside the 2-sigma bands are outliers.
Value
cox
returns the estimate of score function.
x
: The updated univariate data if merging happens.a
: Score value estimated atx
.P
: Updated weight (if merging happens).slt
: Index of merged data point (isNULL
ifx.dist = NULL
).
score_plot1D
returns score functions together with
2-sigma bands for outlier detection.
plot
: plot of estimate score function and its band.outlier
: index of outliers.
References
Ng PT (1994). “Smoothing Spline Score Estimation.” SIAM Journal on Scientific Computing, 15(5), 1003-1025. doi:10.1137/0915061, https://doi.org/10.1137/0915061.
Examples
set.seed(1)
x <- rnorm(100, 2, 4)
re <- cox(sort(x))
plot(re$x, re$a, xlab = "x", ylab = "Estimated Score",
main = "Estimator of score function")
abline(0, 1)
set.seed(1)
x <- rnorm(100, 2, 4)
score_plot1D(sort(x))
Linear combinations of distinct derivatives of empirical cumulant generating function (CGF).
Description
Linear combination of third/fourth derivatives of CGF gives an asymptotically
univariate Gaussian process with mean 0 and covariance between two points
t \in \mathbb{R}^p
and s \in \mathbb{R}^p
is defined.
We consider vector t
and s
as the form t = t^*1_p
and s = s^*1_p
.
Usage
mt3_covLtLs(l, p, bigt = seq(-1, 1, 0.05)/sqrt(p), sTtTs = NULL, seed = 1)
mt4_covLtLs(l, p, bigt = seq(-1, 1, 0.05)/sqrt(p), sTtTs = NULL, seed = 1)
Arguments
l |
vector of linear combination of size equal to the number of distinct
derivatives, see |
p |
dimension of multivariate random vector which data are collected. |
bigt |
array of value |
sTtTs |
Covariance matrix of derivatives vector,
see |
seed |
Random seed to get the estimate of the supremum of the univariate Gaussian process obtained from the linear combination. |
Value
sLtLs
covariance matrix of the linear combination of distinct derivatives, which is a zero-mean Gaussian process.m.supLt
Monte-Carlo estimates of supremum of this Gaussian process
mt3_covLtLs
returns values related to the use of third derivatives.
mt4_covLtLs
returns values related to the use of fourth derivatives.
Examples
bigt <- seq(-1, 1, .5)
p <- 2
# Third derivatives
lT3 <- l_dhCGF(p)[[1]]
l3 <- rep(1/sqrt(lT3), lT3)
mt3_covLtLs(l = l3, p = p, bigt = bigt/sqrt(p), seed = 1)
#fourth derivatives
lT4 <- l_dhCGF(p)[[2]]
l4 <- rep(1/sqrt(lT4), lT4)
mt4_covLtLs(l = l4, p = p, bigt = bigt/sqrt(p), seed = 1)
Covariance matrix of derivatives of sample cumulant generating function (CGF).
Description
Stacking third/fourth derivatives of sample CGF together
to obtain a vector, which (under normality assumption on data) approaches
a normally distributed vector with zero mean and a covariance matrix.
More specifically, covTsTs
computes covariance between any two
points as the form t = t^*1_p
and s = s^*1_p
.
Usage
mt3_covTtTs(bigt, p = 1, pos.matrix = NULL)
mt4_covTtTs(bigt, p = 1, pos.matrix = NULL)
Arguments
bigt |
array contains value of |
p |
dimension of multivariate random vector which data are collected. |
pos.matrix |
matrix containing information of position of any
derivatives. Default is |
Details
Number of distinct third derivatives is
l_{T_3}= p + 2 \times \begin{pmatrix}
p\\2
\end{pmatrix} + \begin{pmatrix}
p \\ 3
\end{pmatrix}
Number of distinct fourth derivatives is
l_{T_4} = p + 3 \times \begin{pmatrix}
p\\2
\end{pmatrix} + 3 \times \begin{pmatrix}
p \\ 3
\end{pmatrix} + \begin{pmatrix}
p \\ 4
\end{pmatrix}
For each pairs of (t^*, s^*)
, covTsTt
results a covariance
matrix of size l_{T_3} \times l_{T_3}
or l_{T_4} \times l_{T_4}
.
Value
A 2 dimensional upper triangular array, with size equals to
length of bigt
. Each element contains a covariance matrix of
derivatives sequences between any two points t = t^* 1_p
and
s = s^*1_p
.
mt3_covTsTt
returns the resulting third derivatives.
mt4_covTsTt
returns the resulting forth derivatives.
Examples
bigt <- seq(-1, 1, .5)
p <- 2
# Third derivatives
mt3_pos.matrix <- mt3_pos(p)
sTsTt3 <- mt3_covTtTs(bigt = bigt, p = p, pos.matrix = mt3_pos.matrix)
dim(sTsTt3)
sTsTt3[1:5, 1:5]
# Fourth derivatives
mt4_pos.matrix <- mt4_pos(p)
sTsTt4 <- mt4_covTtTs(bigt = bigt, p = p, pos.matrix = mt4_pos.matrix)
dim(sTsTt4)
sTsTt4[1:5, 1:5]
Covariance matrix of derivatives of sample moment generating function (MGF).
Description
Stacking derivatives upto the third/fourth orders of sample MGF
together to obtain a vector, which (under normality assumption) approaches
a multivariate normally distributed vector
with zero mean and a covariance matrix.
covZtZs
calculates covariance between any two points
t
and s
in \mathbb{R}^p
.
Usage
mt3_covZtZs(t, s, pos.matrix = NULL)
mt4_covZtZs(t, s, pos.matrix = NULL)
Arguments
t , s |
a vector of length |
pos.matrix |
matrix contains information of positions of derivatives.
Default is |
Value
mt3_covZtZs
Covariance matrix relating to the use
of third derivatives.
mt4_covZtZs
Covariance matrix relating to the use of
fourth derivatives. This also contains information on the third
third derivatives mt3_covZtZs
.
Examples
set.seed(1)
p <- 3
x <- MASS::mvrnorm(100, rep(0, p), diag(p))
t <- rep(0.2, p)
s <- rep(-.3, p)
# Using third derivatives
pos.matrix3 <- mt3_pos(p)
sZtZs3 <- mt3_covZtZs(t, s, pos.matrix = pos.matrix3)
dim(sZtZs3)
sZtZs3[1:5, 1:5]
# Using fourth derivatives
sZtZs4 <- mt4_covZtZs(t, s)
dim(sZtZs4)
sZtZs4[1:5, 1:5]
Calculation of derivatives of empirical cumulant generating function (CGF).
Description
Get the third/fortth derivatives of sample CGF at a given point.
Usage
d3hCGF(myt, x)
d4hCGF(myt, x)
l_dhCGF(p)
dhCGF1D(t, x)
Arguments
myt , t |
numeric vector of length |
x |
data matrix. |
p |
Dimension. |
Details
Estimator of standardized cumulant function is
\log\hat{M}_X(t) = \log \left(\dfrac{1}{n}
\sum_{i = 1}^n \exp(t'S^{\frac{-1}{2}}(X_i - \bar{X})) \right)
and its
k^{th}
order derivatives is defined as
T_k(t) = \dfrac{\partial^k}{
\partial t_{j_1}t_{j_2} \dots t_{j_k}} \log(\hat{M}_X(t)), t \in \mathbb{R}^p
where t_{j_1}t_{j_2} \dots t_{j_k}
are the corresponding components
of vector t \in \mathbb{R}^p
.
Value
d3hCGF
returns the sequence of third derivatives of
empirical CGF, ordered by index of j_1 \leq j_2 \leq j_3 \leq p
.
d4hCGF
returns the sequence of fourth derivatives of empirical
CGF ordered by index of j_1 \leq j_2 \leq j_3 \leq j_4 \leq p
.
l_dhCGF
returns number of distinct third and
fourth derivatives.
dhCGF1D
returns third/fourth derivatives of univariate
empirical CGF, which are d3hCGF
and d4hCGF
when p = 1
.
Examples
p <- 3
# Number of distinct derivatives
l_dhCGF(p)
set.seed(1)
x <- MASS::mvrnorm(100, rep(0, p), diag(p))
myt <- rep(.2, p)
d3hCGF(myt = myt, x = x)
d4hCGF(myt = myt, x = x)
#Univariate data
set.seed(1)
x <- rnorm(100)
t <- .3
dhCGF1D(t, x)
Moment generating functions (MGF) of standard normal distribution.
Description
Get the polynomial term in the expression of derivatives of moment
generating function of N_p(0, I_p)
, with
respect to a given component and its exponent. Up to eighth order.
Usage
dMGF(tab, t, coef = TRUE)
Arguments
tab |
a dataframe with the first column contain indices of components
of a multivariate random vector |
t |
vector in |
coef |
take |
Details
For a standard multivariate normal random variables Y \sim N_p(0, I_p)
\mathbb{E}\left(Y_1^{k_1} ... Y_p^{k_p} \exp(t'X)\right) =
\dfrac{\partial^{k_1}\dots
\partial^{k_p}}{t_1^{k_1} \dots t_p^{k_p}} \exp(t't/2) =
\mu^{(k_1)} (t_1) ... \mu^{(k_p)}(t_p) \exp(t't/2)
For example,
\mathbb{E}Y_2^4 \exp(t'Y) = \dfrac{\partial^4}{\partial t_2^4} \exp(t't/2)
= \mu^{(4)}(t_2) \exp(t't/2).
Value
Value of derivatives.
Examples
#Calculation of above example
t <- rep(.2, 7)
tab <- data.frame(j = 2, exponent = 4)
dMGF(tab, t = t)
dMGF(tab, t = t, coef = FALSE)
Get parameters for plots derivatives of multivariate CGF to assess normality assumption.
Description
Obtain necessary parameters to build a graphical test using the third/fourth derivatives of cumulant generating function.
Usage
mt3_get_param(p, bigt = seq(-1, 1, by = 0.05)/sqrt(p), l = NULL)
mt4_get_param(p, bigt = seq(-1, 1, by = 0.05)/sqrt(p), l = NULL)
Arguments
p |
Dimension. |
bigt |
Array containing value of |
l |
Linear transformation of vector of third/fourth distinct derivatives, default is their average. |
Value
p
Dimension.lT
Number of distinct third/fourth order derivatives.sTtTs
Two dimensional array, each element contains covariance matrix of vector of derivatives, the function calledmt3_covTtTs()
, ormt4_covTtTs()
.l.sTtTs
Covariance matrix of linear combination of distinct derivatives, the function calledmt3_covLtLs()
, ormt4_covLtLs()
.m.supLT
The Monte Carlo estimate of expected value supremum of the Gaussian process, seecovLtLs()
.
mt3_get_param
returns necessary parameters for the 2D plot
relying on third derivatives.
mt4_get_param
returns necessary parameters for the 2D plot
relying on fourth derivatives.
See Also
covZtZs()
,
covLtLs()
, covTtTs()
Examples
p <- 2
mt3 <- mt3_get_param(p, bigt = seq(-1, 1, .5)/sqrt(p))
names(mt3)
mt4 <- mt4_get_param(p, bigt = seq(-1, 1, .5)/sqrt(p))
names(mt4)
Best Linear Transformations
Description
The algorithm uses gradient descent algorithm to obtain the maximum of the square of sample skewness, of the kurtosis or of their average under any univariate linear transformation of the multivariate data.
Usage
linear_transform(
x,
l0 = rep(1, ncol(x)),
method = "both",
epsilon = 1e-10,
iter = 5000,
stepsize = 0.001
)
Arguments
x |
multivariate data matrix. |
l0 |
starting point for projection algorithm,
default is |
method |
character strings,
one of |
epsilon |
bounds on error of optimal solution, default is |
iter |
number of iteration of projection algorithm,
default is |
stepsize |
gradient descent stepsize, default is |
Value
max_result
: The maximum value after linear transformation.x_uni
: Univariate data after transformation.vector_k
: Vector of the "best" linear transformation.error
: Error of projection algorithm.iteration
: Number of iteration.
See Also
Examples
set.seed(1)
x <- MASS::mvrnorm(100, mu = rep(0, 2), diag(2))
linear_transform(x, method = "skewness")$max_result
linear_transform(x, method = "kurtosis")$max_result
linear_transform(x, method = "both")$max_result
From derivatives of MGF to derivatives of CGF.
Description
Taylor expansion implies that vectors of derivatives of
\log(\hat{M}_X(t))
can be approximated
by a linear combination of vectors of derivatives of \hat{M}_X(t)
.
matrix_A
results the corresponding
linear combinations.
Usage
mt3_matrix_A(t)
mt4_matrix_A(t)
Arguments
t |
vector of |
Value
mt3_matrix_A
returns coefficient matrix relating to the use
of third derivatives.
mt4_matrix_A
returns coefficient matrix relating to the
use of fourth derivatives.
Examples
p <- 3
t <- rep(.2, p)
A3 <- mt3_matrix_A(t)
dim(A3)
A3[1:5, 1:5]
A4 <- mt4_matrix_A(t)
dim(A4)
A4[1:5, 1:5]
Derivatives of empirical moment generating function (MGF).
Description
Given dimension p
, returns a dataframe containing the position of
all derivatives of
estimator of moment generating function \hat{M}_X(t)
,
upto third/fourth order.
Usage
mt3_rev_pos(j1, j2, j3, p)
mt3_pos(p)
mt4_pos(p)
Arguments
j1 |
Index of the first variables |
j2 |
Index of the first variables, should be at least |
j3 |
Index of the first variables, should be at least |
p |
Dimension |
Details
The estimator of multivariate moment generating function is
\hat{M}_X(t) = \dfrac{1}{n} \sum_{i = 1}^n \exp(t'X_i)
The chain containing all derivatives up to the third order is
Z = \bigg(\hat{M}, \hat{M}^{001}, \dots \hat{M}^{00p},
\hat{M}^{011}, \hat{M}^{012}, \dots \hat{M}^{0pp}, \hat{M}^{111},
\hat{M}^{112}, \dots \hat{M}^{ppp}\bigg)'
and
\hat{M} = \hat{M}^{000}(t)= \hat{M}_X(t)
\hat{M}^{j_1j_2j_3}(t) =
\dfrac{\partial^k}{\partial t_{j_1} t_{j_2} t_{j_3}} \hat{M}(t)
where k
is the number of j_1, j_2, j_3
different from 0.
Similar notation is applied when fourth derivatives is used.
Value
mt3_rev_pos
returns the position of this particular derivative
in the chain of all derivatives, up to third order.
mt3_pos
an array contaning all position with respect
to index of j_1, j_2, j_3
.
mt4_pos
an array contaning all position with respect to
the index of j_1, j_2, j_3, j_4
.
Examples
mt3_rev_pos(1, 2, 2, p = 3)
p <- 3
mt3_pos(p)
mt4_pos(p)
Sample skewness and Sample Kurtosis.
Description
Sample skewness and Sample Kurtosis.
Usage
kurtosis(x)
skewness(x)
Arguments
x |
univariate data sample |
Details
Sample kurtosis is
\hat{\kappa}_4 =
\dfrac{1}{n-1} \sum_{i = 1}^n \left(\dfrac{X_i - \bar{X}}{S}\right)^4.
Sample skewness is
\hat{\kappa}_3 =
\dfrac{1}{n-1} \sum_{i = 1}^n \left(\dfrac{X_i - \bar{X}}{S}\right)^3.
Value
kurtosis
returns sample kurtosis.
skewness
returns sample skewness.
Examples
set.seed(123)
y <- rnorm(100)
kurtosis(y)
set.seed(123)
x <- rnorm(100)
skewness(x)