Type: | Package |
Title: | Iterative Extrapolation of Species' Haplotype Accumulation Curves for Genetic Diversity Assessment |
Version: | 1.0.6-1 |
Date: | 2022-05-23 |
Author: | Jarrett D. Phillips [aut, cre], Steven H. French [ctb], Navdeep Singh [ctb] |
Maintainer: | Jarrett D. Phillips <phillipsjarrett1@gmail.com> |
Description: | Performs iterative extrapolation of species' haplotype accumulation curves using a nonparametric stochastic (Monte Carlo) optimization method for assessment of specimen sampling completeness based on the approach of Phillips et al. (2015) <doi:10.1515/dna-2015-0008>, Phillips et al. (2019) <doi:10.1002/ece3.4757> and Phillips et al. (2020) <doi:10.7717/peerj-cs.243>. 'HACSim' outputs a number of useful summary statistics of sampling coverage ("Measures of Sampling Closeness"), including an estimate of the likely required sample size (along with desired level confidence intervals) necessary to recover a given number/proportion of observed unique species' haplotypes. Any genomic marker can be targeted to assess likely required specimen sample sizes for genetic diversity assessment. The method is particularly well-suited to assess sampling sufficiency for DNA barcoding initiatives. Users can also simulate their own DNA sequences according to various models of nucleotide substitution. A Shiny app is also available. |
License: | GPL-3 |
URL: | <https://github.com/jphill01/HACSim.R> <https://github.com/jphill01/HACSim-RShiny-App> <https://jphill01.shinyapps.io/HACSim> |
NeedsCompilation: | yes |
Imports: | ape (≥ 5.3), data.table (≥ 1.12.8), graphics (≥ 3.6.1), matrixStats (≥ 0.56.0), pegas (≥ 0.13), Rcpp (≥ 1.0.3), shiny (≥ 1.6.0), stats (≥ 3.6.1), utils (≥ 3.6.1) |
LinkingTo: | Rcpp, RcppArmadillo |
RoxygenNote: | 6.1.1 |
Packaged: | 2022-06-13 00:18:40 UTC; jarrettphillips |
Repository: | CRAN |
Date/Publication: | 2022-06-13 06:50:16 UTC |
Iterative Extrapolation of Species' Haplotype Accumulation Curves for Genetic Diversity Assessment
Description
HACSim (Haplotype Accumulation Curve Simulator) employs a novel nonparametric stochastic (Monte Carlo) optimization method of iteratively generating species' haplotype accumulation curves through extrapolation to assess sampling completeness based on the approach outlined in Phillips et al. (2015) <doi:10.1515/dna-2015-0008>, Phillips et al. (2019) <doi:10.1002/ece3.4757> and Phillips et al. (2020) <doi: 10.7717/peerj-cs.243>. HACSim outputs a number of useful summary statistics of sampling coverage ("Measures of Sampling Closeness"), including an estimate of the likely required sample size (along with desired level confidence intervals) necessary to recover a given number/proportion of observed unique species' haplotypes. Any genomic marker can be targeted to assess likely required specimen sample sizes for genetic diversity assessment. The method is particularly well-suited to assess sampling sufficiency for DNA barcoding initiatives. Users can also simulate their own DNA sequences according to various models of nucleotide substitution. A Shiny app is also available.
Details
The DESCRIPTION file:
Package: | HACSim |
Type: | Package |
Title: | Iterative Extrapolation of Species' Haplotype Accumulation Curves for Genetic Diversity Assessment |
Version: | 1.0.6-1 |
Date: | 2022-05-23 |
Author: | Jarrett D. Phillips [aut, cre], Steven H. French [ctb], Navdeep Singh [ctb] |
Maintainer: | Jarrett D. Phillips <phillipsjarrett1@gmail.com> |
Description: | Performs iterative extrapolation of species' haplotype accumulation curves using a nonparametric stochastic (Monte Carlo) optimization method for assessment of specimen sampling completeness based on the approach of Phillips et al. (2015) <doi:10.1515/dna-2015-0008>, Phillips et al. (2019) <doi:10.1002/ece3.4757> and Phillips et al. (2020) <doi: 10.7717/peerj-cs.243>. 'HACSim' outputs a number of useful summary statistics of sampling coverage ("Measures of Sampling Closeness"), including an estimate of the likely required sample size (along with desired level confidence intervals) necessary to recover a given number/proportion of observed unique species' haplotypes. Any genomic marker can be targeted to assess likely required specimen sample sizes for genetic diversity assessment. The method is particularly well-suited to assess sampling sufficiency for DNA barcoding initiatives. Users can also simulate their own DNA sequences according to various models of nucleotide substitution. A Shiny app is also available. |
License: | GPL-3 |
URL: | <https://github.com/jphill01/HACSim.R> <https://github.com/jphill01/HACSim-RShiny-App> <https://jphill01.shinyapps.io/HACSim> |
NeedsCompilation: | yes |
Imports: | ape (>= 5.3), data.table (>= 1.12.8), graphics (>= 3.6.1), matrixStats (>= 0.56.0), pegas (>= 0.13), Rcpp (>= 1.0.3), shiny (>= 1.6.0), stats (>= 3.6.1), utils (>= 3.6.1) |
LinkingTo: | Rcpp, RcppArmadillo |
RoxygenNote: | 6.1.1 |
Packaged: | 2019-10-23 16:00:58 UTC; jarrettphillips |
Index of help topics:
HAC.sim Internal R code HAC.simrep Run a simulation of haplotype accumulation curves for hypothetical or real species HACClass Internal R code HACHypothetical Set up an object to simulate haplotype accumulation curves for a hypothetical species HACReal Set up an object to simulate haplotype accumulation curves for a real species HACSim-package Iterative Extrapolation of Species' Haplotype Accumulation Curves for Genetic Diversity Assessment accumulate Internal C++ code envr Simulation variable storage environment launchApp Launch HACSim R Shiny web app sim.seqs Simulate DNA sequences according to DNA substitution models
Author(s)
Jarrett D. Phillips [aut, cre], Steven H. French [ctb], Navdeep Singh [ctb]
Maintainer: Jarrett D. Phillips <phillipsjarrett1@gmail.com>
References
Phillips, J.D., Gwiazdowski, R.A., Ashlock, D. and Hanner, R. (2015). An exploration of sufficient sampling effort to describe intraspecific DNA barcode haplotype diversity: examples from the ray-finned fishes (Chordata: Actinopterygii). DNA Barcodes, 3: 66-73.
Phillips, J.D., Gillis, D.J. and Hanner, R.H. (2019). Incomplete estimates of genetic diversity within species: Implications for DNA barcoding. Ecology and Evolution, 9(5): 2996-3010.
Phillips, J.D., Gillis, D.J. and Hanner, R.H. (2020). HACSim: An R package to estimate intraspecific sample sizes for genetic diversity assessment using haplotype accumulation curves. PeerJ Computer Science
Examples
## Simulate hypothetical species ##
N <- 100 # total number of sampled individuals
Hstar <- 10 # total number of haplotypes
probs <- rep(1/Hstar, Hstar) # equal haplotype frequency distribution
HACSObj <- HACHypothetical(N = N, Hstar = Hstar,
probs = probs, filename = "output") # outputs a CSV
# file called "output.csv"
## Simulate hypothetical species - subsampling ##
HACSObj <- HACHypothetical(N = N, Hstar = Hstar,
probs = probs, perms = 1000, p = 0.95,
subsample = TRUE, prop = 0.25, conf.level = 0.95,
filename = "output")
## Simulate hypothetical species and all paramaters changed - subsampling ##
HACSObj <- HACHypothetical(N = N, Hstar = Hstar, probs = probs,
perms = 10000, p = 0.90, subsample = TRUE, prop = 0.15,
conf.level = 0.95, filename = "output")
HAC.simrep(HACSObj) # runs a simulation
## Simulate real species ##
## Not run:
## Simulate real species ##
# outputs file called "output.csv"
HACSObj <- HACReal(filename = "output")
## Simulate real species - subsampling ##
HACSObj <- HACReal(subsample = TRUE, prop = 0.15,
conf.level = 0.95, filename = "output")
## Simulate real species and all parameters changed - subsampling ##
HACSObj <- HACReal(perms = 10000, p = 0.90, subsample = TRUE,
prop = 0.15, conf.level = 0.99, filename = "output")
# user prompted to select appropriate FASTA file
HAC.simrep(HACSObj)
## End(Not run)
## Not run:
## Simulate DNA sequences ##
num.seqs <- 100 # number of DNA sequences
num.haps <- 15 # number of haplotypes
length.seqs <- 658 # length of DNA sequences
count.haps <- c(60, rep(10, 2), rep(5, 2), rep(1, 5)) # haplotype frequency distribution
nucl.freqs <- rep(0.25, 4) # nucleotide frequency distribution
subst.model <- "JC69" # desired nucleotide substitution model
mu.rate <- 1e-3 # mutation rate
transi.rate <- NULL # transition rate
transv.rate <- NULL # transversion rate
sim.seqs(num.seqs = num.seqs, num.haps = num.haps, length.seqs = length.seqs,
nucl.freqs = nucl.freqs, count.haps = count.haps, subst.model = subst.model,
transi.rate = transi.rate, transv.rate = transv.rate)
# outputs file called "output.csv"
HACSObj <- HACReal(filename = "output")
## Simulate DNA sequences - subsampling ##
HACSObj <- HACReal(subsample = TRUE, prop = 0.15,
conf.level = 0.95, filename = "output")
## Simulate DNA sequences and all parameters changed - subsampling ##
HACSObj <- HACReal(perms = 10000, p = 0.90, subsample = TRUE,
prop = 0.15, conf.level = 0.99, filename = "output")
# user prompted to select appropriate FASTA file
HAC.simrep(HACSObj)
## End(Not run)
Internal R code
Description
HAC.sim
comprises internal R code used by HAC.simrep
and is not directly called by the user.
Run a simulation of haplotype accumulation curves for hypothetical or real species
Description
Runs the HACSim
algorithm by successively calling
HAC.sim
to iteratively extrapolate haplotype accumulation curves to
determine likely specimen sample sizes for hypothetical or real species
The algorithm employs the following iterative methods when calculating the "Measures of Sampling Closeness":
Mean number of haplotype sampled:
H_i
Mean number of haplotypes not sampled
H^* - H_i
Proportion of haplotypes sampled:
\frac{H_i}{H^*}
Proportion of haplotypes not sampled:
1 - \frac{H_i}{H^*}
Mean value of
N*
:\frac{N_iH^*}{H_i}
Mean number of specimens not sampled:
\frac{N_iH^*}{H_i} - N_i
where H_i
is stochastically-determined through sampling from
probs
, the observed species' haplotype frequency distribution vector.
As the algorithm proceeds, H_i
will approach H^*
asymptotically
(and hence, N_i
will converge to N^*
), but will likely fluctuate
randomly from one iteration to the next. However, estimates of N^*
found
at each iteration will be monotonically-increasing.
Usage
HAC.simrep(HACSObject)
Arguments
HACSObject |
object containing the desired simulation parameters |
Value
Iteration results are outputted to the console and graphs displayed in
the plot window. Plots depict haplotype accumulation (along with shaded
confidence intervals for the mean number of haplotypes found). Dashed lines
correspond to the endpoint of the curve and reflect haplotype recovery for a
user-defined cutoff (default p
= 0.95, 95% haplotype diversity). Output
from the first iteration is useful for judging levels of haplotype diversity
and recovery found in observed intraspecific sequence datasets, reflecting
current sampling depth. The required sample size is displayed in the second-
last iteration. All other information corresponding to the extrapolated sample
size can be found in the last iteration. Iteration results can optionally be
saved to a CSV file. Subsampled DNA sequences are automatically saved to a
FASTA file.
Note
When simulating real species via HACReal(...)
, a pop-up window will appear prompting the user to select an intraspecific FASTA file of
aligned/trimmed DNA sequences. The alignment must not contain missing or
ambiguous nucleotides (i.e., it should only contain A, C, G or T); otherwise, haplotype diversity may be overestimated. Excluding sequences or alignment
sites with missing/ambiguous data is an option.
Examples
## Simulate hypothetical species ##
N <- 100 # total number of sampled individuals
Hstar <- 10 # total number of haplotypes
probs <- rep(1/Hstar, Hstar) # equal haplotype frequency distribution
HACSObj <- HACHypothetical(N = N, Hstar = Hstar , probs = probs,
filename = "output") # outputs a CSV file called "output.csv"
## Simulate hypothetical species - subsampling ##
HACSObj <- HACHypothetical(N = N, Hstar = Hstar, probs = probs,
perms = 1000, p = 0.95, subsample = TRUE, prop = 0.25,
conf.level = 0.95, filename = "output")
## Simulate hypothetical species and all paramaters changed - subsampling ##
HACSObj <- HACHypothetical(N = N, Hstar = Hstar, probs = probs,
perms = 10000, p = 0.90, subsample = TRUE, prop = 0.15, conf.level = 0.95,
filename = "output")
try(HAC.simrep(HACSObj)) # runs a simulation
## Simulate real species ##
## Not run:
## Simulate real species ##
# outputs file called "output.csv"
HACSObj <- HACReal(filename = "output")
## Simulate real species - subsampling ##
HACSObj <- HACReal(subsample = TRUE, prop = 0.15, conf.level = 0.95,
filename = "output")
## Simulate real species and all parameters changed - subsampling ##
HACSObj <- HACReal(perms = 10000, p = 0.90, subsample = TRUE,
prop = 0.15, conf.level = 0.99, filename = "output")
# user prompted to select appropriate FASTA file
try(HAC.simrep(HACSObj))
## End(Not run)
Internal R code
Description
HACClass
comprises internal R code used to generate an
object used by HAC.simrep
. It is not directly called by the user.
Set up an object to simulate haplotype accumulation curves for a hypothetical species
Description
Helper function which creates an object containing necessary information to run a simulation of haplotype accumulation for a hypothetical species of interest
Usage
HACHypothetical(N, Hstar, probs, perms = 10000, p = 0.95,
conf.level = 0.95, ci.type = "quantile", subsample = FALSE, prop = NULL,
progress = TRUE, num.iters = NULL, filename = NULL)
Arguments
N |
Number of individuals |
Hstar |
Number of unique species' haplotypes |
probs |
Haplotype frequency distribution vector |
perms |
Number of permutations (replications) |
p |
Proportion of haplotypes to recover |
conf.level |
Desired confidence level for graphical output and interval estimation |
ci.type |
Type of confidence interval for graphical output. Choose from "quantile" or "asymptotic" |
subsample |
Is a subsample of haplotype labels desired? |
prop |
If subsample = TRUE, the proportion of haplotype labels to subsample |
num.iters |
Number of iterations to compute |
progress |
Should iteration output be printed to the R console? |
filename |
Name of file where simulation results are to be saved |
Value
A list object of class "HAC" with 13 elements that can be passed to
HAC.simrep
as follows:
input.seqs |
Should a FASTA file of aligned/trimmed DNA sequences be inputted? Default is FALSE |
subset.seqs |
Should a subsample of DNA sequences be taken? Default is FALSE |
prop.seqs |
Proportion of DNA sequences to subsample. Default is NULL |
prop.haps |
Proportion of haplotype labels to subsample. Default is NULL (can be altered by user) |
subset.haps |
Should a subsample of haplotype labels be taken? Default is NULL (can be altered by user) |
N |
Number of individuals. NA by default (provided by user) |
Hstar |
Number of unique species' haplotypes. NA by default (provided by user) |
probs |
Haplotype frequency distribution vector. NA by default (provided by user) |
p |
Proportion of haplotypes to recover. |
perms |
Number of permutations (replications). |
conf.level |
Desired confidence level for graphical output and interval estimation. |
ci.type |
Type of confidence interval for graphical output.
|
num.iters |
Number of iterations to compute. |
progress |
Should iteration output be printed to the R console? Default is TRUE |
filename |
Name of file where simulation results are to be saved. |
Note
N
must be greater than 1 and greater than or equal to Hstar
.
Hstar
must be greater than 1.
probs
must have a length equal to Hstar
and its elements must sum
to 1.
Examples
## Simulate hypothetical species ##
N <- 100 # total number of sampled individuals
Hstar <- 10 # total number of haplotypes
probs <- rep(1/Hstar, Hstar) # equal haplotype frequency distribution
# outputs a CSV file called "output.csv"
HACSObj <- HACHypothetical(N = N, Hstar = Hstar, probs = probs,
filename = "output")
## Simulate hypothetical species - subsampling ##
# subsamples 25% of haplotype labels
HACSObj <- HACHypothetical(N = N, Hstar = Hstar, probs = probs,
perms = 1000, p = 0.95, subsample = TRUE, prop = 0.25,
conf.level = 0.95, filename = "output")
## Simulate hypothetical species and all paramaters changed - subsampling ##
HACSObj <- HACHypothetical(N = N, Hstar = Hstar, probs = probs,
perms = 10000, p = 0.90, subsample = TRUE, prop = 0.15, conf.level = 0.95,
num.iters = 1, filename = "output")
Set up an object to simulate haplotype accumulation curves for a real species
Description
Helper function which creates an object containing necessary information to run a simulation of haplotype accumulation for a real species of interest
Usage
HACReal(perms = 10000, p = 0.95, conf.level = 0.95,
ci.type = "quantile", subsample = FALSE, prop = NULL,
progress = TRUE, num.iters = NULL, filename = NULL)
Arguments
perms |
Number of permutations (replications) |
p |
Proportion of haplotypes to recover |
conf.level |
Desired confidence level for graphical output and interval estimation |
ci.type |
Type of confidence interval for graphical output. Choose from "quantile" or "asymptotic" |
subsample |
Is a subsample of DNA sequences desired? |
prop |
If subsample = TRUE, the proportion of DNA sequences to subsample |
num.iters |
Number of iterations to compute |
progress |
Should iteration output be printed to the R console? |
filename |
Name of file where simulation results are to be saved |
Value
A list object of class "HAC" with 13 elements that can be passed to
HAC.simrep
as follows:
input.seqs |
Should a FASTA file of aligned/trimmed DNA sequences be inputted? Default is TRUE |
subset.seqs |
Should a subsample of DNA sequences be taken? Default is FALSE (can be altered by user) |
prop.seqs |
Proportion of DNA sequences to subsample. Default is NA (can be altered by user) |
prop.haps |
Proportion of haplotype labels to subsample. Default is NULL |
subset.haps |
Should a subsample of haplotype labels be taken? Default is NULL |
N |
Number of individuals. NA by default (computed automatically by algorithm) |
Hstar |
Number of unique species' haplotypes. NA by default (computed automatically by algorithm) |
probs |
Haplotype frequency distribution vector. NA by default (computed automatically by algorithm) |
p |
Proportion of haplotypes to recover. |
perms |
Number of permutations (replications). |
conf.level |
Desired confidence level for graphical output and interval estimation. |
ci.type |
Type of confidence interval for graphical output.
|
num.iters |
Number of iterations to compute. |
progress |
Should iteration output be printed to the R console? Default is TRUE |
filename |
Name of file where simulation results are to be saved. |
Examples
## Simulate real species ##
# outputs file called "output.csv"
HACSObj <- HACReal(filename = "output")
## Simulate real species - subsampling ##
# subsamples 25% of DNA sequences
HACSObj <- HACReal(subsample = TRUE, prop = 0.25, conf.level = 0.95,
filename = "output")
## Simulate real species and all parameters changed - subsampling ##
HACSObj <- HACReal(perms = 10000, p = 0.90, subsample = TRUE,
prop = 0.15, conf.level = 0.99, num.iters = 1, filename = "output")
Internal C++ code
Description
accumulate
comprises internal C++ code employed by
HAC.sim
. It is not directly called by the user.
Simulation variable storage environment
Description
envr
is a new (initially empty) environment that is created
when HACSim
is loaded.
Value
When a simulation is run via HAC.simrep
, envr
will contain
26 elements as follows:
ci.type |
Type of confidence interval to compute and plot. Default is
|
conf.level |
The desired confidence level. Default is
|
d |
A dataframe with |
df.out |
A dataframe with |
filename |
The name of the file where results are to be saved. Default is NULL. |
Hstar |
Number of unique species' haplotypes |
input.seqs |
Should DNA sequences be inputted? Default is FALSE. |
iters |
The number of iterations required to reach convergence |
N |
The starting sample size used to initialize the algorithm |
Nstar |
The final (extrapolated) sample size |
Nstar.high |
The upper endpoint of the desired level confidence interval for the 'true' required sample size |
Nstar.low |
The lower endpoint of the desired level confidence interval for the 'true' required sample size |
num.iters |
Number of iterations to compute. |
p |
The user-specified level of haplotype recovery. Default is
|
perms |
The user-specified number of permutations (replications). Default
is |
probs |
Haplotype frequency distribution vector |
progress |
Should iteration results be outputted to the console? Default is TRUE. |
prop.haps |
If |
prop.seqs |
If |
ptm |
A timer to track progress of the algorithm in seconds |
R |
The proportion of haplotypes recovered by the algorithm |
R.low |
The lower endpoint of the desired level confidence interval for the 'true' fraction of haplotypes captured |
R.up |
The upper endpoint of the desired level confidence interval for the 'true' fraction of haplotypes captured |
subset.haps |
Should a subsample of haplotype labels be taken? Default is FALSE. |
subset.seqs |
Should a subsample of DNA sequences be taken? Default is FALSE. |
X |
Mean number of specimens not sampled |
Examples
# Returns the frequencies of each haplotype in the extrapolated sample
max(envr$d$specs) * envr$probs
# Returns the extrapolated sample size corresponding to the dotted line
# in the last iteration plot
envr$d[which(envr$d$means >= envr$p * envr$Hstar), ][1, 1]
Launch HACSim R Shiny web app
Description
Launch HACSim R Shiny web app locally on a user's R session
Usage
launchApp()
Simulate DNA sequences according to DNA substitution models
Description
Simulates DNA sequences according to various DNA substitution models:
Jukes-Cantor (1969)
Kimura (1980)
Felsenstein (1981)
Hasegawa-Kishino-Yano (1985)
Output can then be passed to HACReal()
.
Usage
sim.seqs(num.seqs, num.haps, length.seqs, count.haps, nucl.freqs,
codon.tbl = c("standard", "vertebrate mitochondrial",
"invertebrate mitochondrial"), subst.model = c("JC69", "K80", "F81", "HKY85"),
mu.rate, transi.rate, transv.rate)
Arguments
num.seqs |
Number of simulated DNA sequences |
num.haps |
Number of simulated unique species' haplotypes |
length.seqs |
Basepair length of DNA sequences |
count.haps |
Haplotype frequency distribution vector |
nucl.freqs |
Nucleotide frequency distribution vector of A, C, G, and T respectively |
codon.tbl |
Codon table |
subst.model |
Model of DNA substitution |
mu.rate |
Overall nucleotide mutation rate/site/generation |
transi.rate |
Nucleotide transition rate/site/generation |
transv.rate |
Nucleotide transversion rate/site/generation |
Value
A FASTA file of DNA sequences
Note
num.seqs
must be greater than or equal to num.haps
.
Both num.seqs
and num.haps
must be greater than 1.
nucl.freqs
must have a length of four and its elements must sum to 1.
count.haps
must have a length of num.haps
and its elements must
sum to num.seqs
.
subst.model
must be one of "JC69" (Jukes Cantor corrected p-distance),
"K80" (Kimura-2-Parameter (K2P), "F81" (Felenstein) or "HKY85"
(Hasegawa-Kishino-Yano)
mu.rate
must be specified for both "JC69" and "F81" models
transi.rate
and transv.rate
must be specified for both "K80"
and "HKY85" models
All elements nucl.freqs
must be equal to 0.25 when subst.model
is either "JC69" or "K80"
All elements nucl.freqs
must differ from 0.25 when subst.model
is either "F81" or "HKY85"
Examples
## Not run:
# Simulate DNA sequences from the 5'-COI DNA barcode region under a Jukes
# Cantor nucleotide substitution model
num.seqs <- 100 # number of DNA sequences
num.haps <- 10 # number of haplotypes
length.seqs <- 658 # length of DNA sequences
count.haps <- c(65, rep(10, 2), rep(5, 2), rep(1, 5)) # haplotype frequency distribution
nucl.freqs <- rep(0.25, 4) # nucleotide frequency distribution
subst.model <- "JC69" # desired nucleotide substitution model
codon.tbl <- "vertebrate mitochondrial"
mu.rate <- 1e-3 # mutation rate
transi.rate <- NULL # transition rate
transv.rate <- NULL # transversion rate
sim.seqs(num.seqs = num.seqs, num.haps = num.haps, length.seqs = length.seqs,
count.haps = count.haps, nucl.freqs = nucl.freqs, codon.tbl = codon.tbl,
subst.model = subst.model, mu.rate = mu.rate, transi.rate = transi.rate,
transv.rate = transv.rate)
## End(Not run)