Type: | Package |
Title: | Statistical Approach to Outlier Detection in RNA-Seq and Related Data |
Version: | 1.0.0 |
Date: | 2024-11-15 |
Description: | An approach to outlier detection in RNA-seq and related data based on five statistics. 'OutSeekR' implements an outlier test by comparing the distributions of these statistics in observed data with those of simulated null data. |
Depends: | R (≥ 2.10) |
Imports: | future.apply, gamlss, gamlss.dist, lsa, truncnorm |
Suggests: | future, knitr, rmarkdown, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
License: | GPL-2 |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.2.3 |
VignetteBuilder: | knitr |
NeedsCompilation: | no |
Packaged: | 2024-11-15 23:56:34 UTC; danknight |
Author: | Jee Yun Han [aut], John Sahrmann [aut], Jaron Arbet [ctb], Paul Boutros [aut, cre, cph] |
Maintainer: | Paul Boutros <pboutros@mednet.ucla.edu> |
Repository: | CRAN |
Date/Publication: | 2024-11-19 09:10:05 UTC |
Calculate p-values
Description
Calculate p-values for each sample of a single transcript.
Usage
calculate.p.values(
x,
x.distribution,
x.zrange.mean,
x.zrange.median,
x.zrange.trimmean,
x.fraction.kmeans,
x.cosine.similarity,
null.zrange.mean,
null.zrange.median,
null.zrange.trimmean,
null.fraction.kmeans,
null.cosine.similarity,
kmeans.nstart = 1
)
Arguments
x |
A numeric vector of values for an observed transcript. |
x.distribution |
A numeric code corresponding to the optimal distribution of |
x.zrange.mean |
A number, the range of the z-scores calculated using the mean and standard deviation of |
x.zrange.median |
A number, the range of the z-scores calculated using the median and median absolute deviation of |
x.zrange.trimmean |
A number, the range of the z-scores calculated using the trimmed mean and trimmed standard deviation of |
x.fraction.kmeans |
A number, the k-means fraction of |
x.cosine.similarity |
A number, the cosine similarity of |
null.zrange.mean |
A numeric vector, the ranges of the z-scores calculated using the mean and standard deviation of each transcript in the null data. |
null.zrange.median |
A numeric vector, the ranges of the z-scores calculated using the median and median absolute deviation of each transcript in the null data. |
null.zrange.trimmean |
A numeric vector, the ranges of the z-scores calculated using the trimmed mean and trimmed standard deviation of each transcript in the null data. |
null.fraction.kmeans |
A numeric vector, the k-means fraction of each transcript in the null data. |
null.cosine.similarity |
A numeric vector, the cosine similarity of each transcript in the null data. |
kmeans.nstart |
The number of random starts when computing k-means fraction; default is 1. See |
Value
A list consisting of the following entries:
-
p.values
: a vector of p-values for the outlier test run on each sample (up until the p-value exceedsp.value.threshold
); and -
outlier.statistics.list
, a list of vectors containing the values of the outlier statistics calculated from the remaining samples. The list will be of length equal to one plus the total number of outliers (i.e., the number of samples with an outlier test p-value less thanp.value.threshold
) and will contain entriesoutlier.statistics.N
, whereN
is between zero and the total number of outliers.outlier.statistics.N
is the vector of outlier statistics after excluding theN
th outlier sample, withoutlier.statistics.0
being for the complete transcript.
Examples
data(example.data.for.calculate.p.values);
i <- 1; # row index of transcript to test
calculate.p.values(
x = example.data.for.calculate.p.values$data[i,],
x.distribution = example.data.for.calculate.p.values$x.distribution[i],
x.zrange.mean = example.data.for.calculate.p.values$x.zrange.mean[i],
x.zrange.median = example.data.for.calculate.p.values$x.zrange.median[i],
x.zrange.trimmean = example.data.for.calculate.p.values$x.zrange.trimmean[i],
x.fraction.kmeans = example.data.for.calculate.p.values$x.fraction.kmeans[i],
x.cosine.similarity = example.data.for.calculate.p.values$x.cosine.similarity[i],
null.zrange.mean = example.data.for.calculate.p.values$null.zrange.mean,
null.zrange.median = example.data.for.calculate.p.values$null.zrange.median,
null.zrange.trimmean = example.data.for.calculate.p.values$null.zrange.trimmean,
null.fraction.kmeans = example.data.for.calculate.p.values$null.fraction.kmeans,
null.cosine.similarity = example.data.for.calculate.p.values$null.cosine.similarity,
kmeans.nstart = example.data.for.calculate.p.values$kmeans.nstart
);
Calculate residuals
Description
Calculate residuals between quantiles of the input and quantiles of one of four distributions: normal, log-normal, exponential, or gamma.
Usage
calculate.residuals(x, distribution)
Arguments
x |
A numeric vector. |
distribution |
A number corresponding to the optimal distribution of
|
Value
A numeric vector of the same length as x
. Names are not retained.
Examples
# Generate fake data.
set.seed(1234);
x <- rgamma(
n = 20,
shape = 2,
scale = 2
);
names(x) <- paste(
'Sample',
seq_along(x),
sep = '.'
);
calculate.residuals(
x = x,
distribution = 4
);
Detect outliers
Description
Detect outliers in normalized RNA-seq data.
Usage
detect.outliers(
data,
num.null = 1000,
initial.screen.method = c("fdr", "p.value"),
p.value.threshold = 0.05,
fdr.threshold = 0.01,
kmeans.nstart = 1
)
Arguments
data |
A matrix or data frame of normalized RNA-seq data, organized with transcripts on rows and samples on columns. Transcript identifiers should be stored as |
num.null |
The number of transcripts to generate when simulating from null distributions; default is 1000. We recommend using at least 10,000 iterations for publication-level results, with 100,000 or even one million iterations providing more robust estimates. |
initial.screen.method |
The statistical criterion for initial gene selection; valid options are 'FDR' and 'p-value'. |
p.value.threshold |
The p-value threshold for the outlier test; default is 0.05. Once the p-value for a sample exceeds |
fdr.threshold |
The false discovery rate (FDR)-adjusted p-value threshold for determining the final count of outliers; default is 0.01. |
kmeans.nstart |
The number of random starts when computing k-means fraction; default is 1. See |
Value
A list consisting of the following entries:
-
p.values
: a matrix of unadjusted p-values for the outlier test run on each transcript indata
. -
fdr
: a matrix of FDR-adjusted p-values for the outlier test run on each transcript indata
. -
num.outliers
: a vector giving the number of outliers detected for each transcript based on the threshold. -
outlier.test.results.list
: a list of lengthmax(num.outliers) + 1
containing entriesroundN
, whereN
is between one andmax(num.outliers) + 1
.roundN
is the data frame of results for the outlier test after excluding the (N-1)th outlier sample, withround1
being for the original data set (i.e., before excluding any outlier samples). -
distributions
: a numeric vector indicating the optimal distribution for each transcript. Possible values are 1 (normal), 2 (log-normal), 3 (exponential), and 4 (gamma). -
initial.screen.method
: Specifies the statistical criterion for initial feature selection. Valid options are 'p-value' and 'FDR' (p-value used by default).
Examples
data(outliers);
outliers.subset <- outliers[1:10,];
results <- detect.outliers(
data = outliers.subset,
num.null = 10
);
example.data.for.calculate.p.values
Description
Example data (list object) for testing calculate.p.values()
.
Usage
example.data.for.calculate.p.values
Format
An object of class list
of length 13.
Identify optimal distribution of data
Description
Identify which of four distributions—normal, log-normal, exponential, or gamma—best fits the given data according to BIC.
Usage
## S3 method for class 'bic.optimal.data.distribution'
identify(x)
Arguments
x |
A numeric vector. |
Value
A numeric code representing which distribution optimally fits x
. Possible values are
1 = normal,
2 = log-normal,
3 = exponential, and
4 = gamma.
Examples
# Generate fake data.
set.seed(1234);
x <- rgamma(
n = 20,
shape = 2,
scale = 2
);
identify.bic.optimal.data.distribution(
x = x
);
Identify optimal distribution of residuals
Description
Identify which of four distributions—normal, log-normal, exponential, or gamma—best fits the given vector of residuals according to BIC.
Usage
## S3 method for class 'bic.optimal.residuals.distribution'
identify(x)
Arguments
x |
A numeric vector. |
Value
A numeric code representing which distribution optimally fits x
. Possible values are
1 = normal,
2 = log-normal,
3 = exponential, and
4 = gamma.
Examples
# Generate fake data.
set.seed(1234);
x <- rgamma(
n = 20,
shape = 2,
scale = 2
);
identify.bic.optimal.residuals.distribution(
x = x
);
k-means fraction
Description
Given a vector of cluster assigments from quantify.outliers()
run with method = 'kmeans'
, compute the fraction of observations belonging to the smaller of the two clusters.
Usage
kmeans.fraction(x)
Arguments
x |
A numeric vector. |
Details
This function only considers clusters 1 and 2 even if quantify.outliers()
was run with exclude.zero = TRUE
. In that case, zeros are effectively excluded from the counts used to define the k-means fraction. See examples.
Value
A number.
Examples
x <- c(1, 1, 2, 2, 2, 2, 2, 2, 2, 2);
names(x) <- letters[1:length(x)];
kmeans.fraction(x);
Cosine similarity
Description
Compute cosine similarity for detection of outliers. Generate theoretical quantiles based on the optimal distribution of the data, and compute cosine similarity between a point made up of the largest observed quantile and the largest theoretical quantile and a point on the line y = x. .
Usage
outlier.detection.cosine(x, distribution)
Arguments
x |
A numeric vector. |
distribution |
A numeric code corresponding to the optimal distribution of |
Value
A number.
Examples
# Generate fake data.
set.seed(1234);
x <- rgamma(
n = 20,
shape = 2,
scale = 2
);
outlier.detection.cosine(
x = x,
distribution = 4
);
Example data set for outlier testing
Description
Example data set for outlier testing
Usage
outliers
Format
A data frame with 500 rows and 50 columns:
- S01
simulated fragments per kilobase of transcript per million fragments mapped (FPKM) values for sample 1
- S02
simulated FPKM values for sample 2
- S03
simulated FPKM values for sample 3
- S04
simulated FPKM values for sample 4
- S05
simulated FPKM values for sample 5
- S06
simulated FPKM values for sample 6
- S07
simulated FPKM values for sample 7
- S08
simulated FPKM values for sample 8
- S09
simulated FPKM values for sample 9
- S10
simulated FPKM values for sample 10
- S11
simulated FPKM values for sample 11
- S12
simulated FPKM values for sample 12
- S13
simulated FPKM values for sample 13
- S14
simulated FPKM values for sample 14
- S15
simulated FPKM values for sample 15
- S16
simulated FPKM values for sample 16
- S17
simulated FPKM values for sample 17
- S18
simulated FPKM values for sample 18
- S19
simulated FPKM values for sample 19
- S20
simulated FPKM values for sample 20
- S21
simulated FPKM values for sample 21
- S22
simulated FPKM values for sample 22
- S23
simulated FPKM values for sample 23
- S24
simulated FPKM values for sample 24
- S25
simulated FPKM values for sample 25
- S26
simulated FPKM values for sample 26
- S27
simulated FPKM values for sample 27
- S28
simulated FPKM values for sample 28
- S29
simulated FPKM values for sample 29
- S30
simulated FPKM values for sample 30
- S31
simulated FPKM values for sample 31
- S32
simulated FPKM values for sample 32
- S33
simulated FPKM values for sample 33
- S34
simulated FPKM values for sample 34
- S35
simulated FPKM values for sample 35
- S36
simulated FPKM values for sample 36
- S37
simulated FPKM values for sample 37
- S38
simulated FPKM values for sample 38
- S39
simulated FPKM values for sample 39
- S40
simulated FPKM values for sample 40
- S41
simulated FPKM values for sample 41
- S42
simulated FPKM values for sample 42
- S43
simulated FPKM values for sample 43
- S44
simulated FPKM values for sample 44
- S45
simulated FPKM values for sample 45
- S46
simulated FPKM values for sample 46
- S47
simulated FPKM values for sample 47
- S48
simulated FPKM values for sample 48
- S49
simulated FPKM values for sample 49
- S50
simulated FPKM values for sample 50
Compute quantities for outlier detection
Description
Compute quantities for use in the detection of outliers. Specifically, compute z-scores based on the mean / standard deviation, the trimmed mean / trimmed standard deviation, or the median / median absolute deviation, or the cluster assignment from k-means with two clusters.
Usage
quantify.outliers(
x,
method = "mean",
trim = 0,
nstart = 1,
exclude.zero = FALSE
)
Arguments
x |
A numeric vector. |
method |
A string indicating the quantities to be computed. Possible values are
|
trim |
A number, the fraction of observations to be trimmed from each end of |
nstart |
A number, for k-means clustering, the number of random initial centers for the clusters. Default is |
exclude.zero |
A logical, whether zeros should be excluded ( |
Value
A numeric vector the same size as x
whose values are the requested quantities computed on the corresponding elements of x
.
Examples
# Generate fake data.
set.seed(1234);
x <- rgamma(
n = 20,
shape = 2,
scale = 2
);
# Add missing values and zeros for demonstration. Missing values are
# ignored, and zeros can be ignored with `exclude.zeros = TRUE`.
x[1:5] <- NA;
x[6:10] <- 0;
# Compute z-scores based on mean and standard deviation.
quantify.outliers(
x = x,
method = 'mean',
trim = 0
);
# Exclude zeros from the calculation of the mean and standard
# deviation.
quantify.outliers(
x = x,
method = 'mean',
trim = 0,
exclude.zero = TRUE
);
# Compute z-scores based on the 5% trimmed mean and 5% trimmed
# standard deviation.
quantify.outliers(
x = x,
method = 'mean',
trim = 0.05
);
# Compute z-scores based on the median and median absolute deviation.
quantify.outliers(
x = x,
method = 'median'
);
# Compute cluster assignments using k-means with k = 2.
quantify.outliers(
x = x,
method = 'kmeans'
);
# Try different initial cluster assignments.
quantify.outliers(
x = x,
method = 'kmeans',
nstart = 10
);
# Assign zeros to their own cluster.
quantify.outliers(
x = x,
method = 'kmeans',
exclude.zero = TRUE
);
Simulate from a null distribution
Description
Simulate transcripts from a specified null distribution.
Usage
## S3 method for class 'null'
simulate(x, x.distribution, r, r.distribution)
Arguments
x |
A numeric vector of transcripts. |
x.distribution |
A numeric code corresponding to the optimal distribution of
|
r |
A numeric vector of residuals calculated for this transcript. |
r.distribution |
A numeric code corresponding to the optimal distribution of |
Value
A numeric vector of the same length as x
. Names are not retained.
Examples
# Prepare fake data.
set.seed(1234);
x <- rgamma(
n = 20,
shape = 2,
scale = 2
);
names(x) <- paste('Sample', seq_along(x), sep = '.');
x.dist <- identify.bic.optimal.data.distribution(
x = x
);
r <- calculate.residuals(
x = x,
distribution = x.dist
);
r.trimmed <- trim.sample(
x = r
);
r.dist <- identify.bic.optimal.residuals.distribution(
x = r.trimmed
);
null <- simulate.null(
x = x,
x.distribution = x.dist,
r = r.trimmed,
r.distribution = r.dist
);
Trim a vector of numbers
Description
Symmetrically trim a vector of numbers after sorting it.
Usage
trim.sample(x, trim = 0.05)
Arguments
x |
A numeric vector. |
trim |
A number, the fraction of observations to be trimmed from each end of |
Details
If length(x) <= 10
, the function returns x[2:(length(x) - 1)]
.
Value
A sorted, trimmed copy of x
.
Examples
trim.sample(
x = 1:20,
trim = 0.05
);
Range of z-scores
Description
Compute the range of a vector of z-scores.
Usage
zrange(x)
Arguments
x |
A numeric vector |
Value
A number.
Examples
set.seed(1234);
x <- rnorm(
n = 10
);
zrange(
x = x
);