Type: | Package |
Title: | Training Set Determination for Genomic Selection |
Version: | 2.0 |
Date: | 2022-06-07 |
Description: | We propose an optimality criterion to determine the required training set, r-score, which is derived directly from Pearson's correlation between the genomic estimated breeding values and phenotypic values of the test set <doi:10.1007/s00122-019-03387-0>. This package provides two main functions to determine a good training set and its size. |
License: | GPL (≥ 3) |
Encoding: | UTF-8 |
Imports: | dplyr, ggplot2, latex2exp, lifecycle, parallel, Rcpp (≥ 1.0.8.3) |
LinkingTo: | Rcpp, RcppEigen |
RoxygenNote: | 7.2.0 |
URL: | https://github.com/oumarkme/TSDFGS |
BugReports: | https://github.com/oumarkme/TSDFGS/issues |
Depends: | R (≥ 2.10) |
LazyData: | true |
NeedsCompilation: | yes |
Packaged: | 2022-06-07 13:21:24 UTC; mark |
Author: | Jen-Hsiang Ou |
Maintainer: | Jen-Hsiang Ou <jen-hsiang.ou@imbim.uu.se> |
Repository: | CRAN |
Date/Publication: | 2022-06-07 14:00:11 UTC |
TSDFGS: Training Set Determination for Genomic Selection
Description
We propose an optimality criterion to determine the required training set, r-score, which is derived directly from Pearson's correlation between the genomic estimated breeding values and phenotypic values of the test set doi:10.1007/s00122-019-03387-0. This package provides two main functions to determine a good training set and its size.
Author(s)
Maintainer: Jen-Hsiang Ou jen-hsiang.ou@imbim.uu.se (ORCID)
Authors:
Po-Ya Wu Po-Ya.Wu@hhu.de (ORCID)
Chen-Tuo Liao ctliao@ntu.edu.tw (ORCID) [thesis advisor]
See Also
Useful links:
Fit logistic growth curve model
Description
A function for fitting logisti growth model
Usage
FGCM(geno, nt = NULL, n_iter = NULL, multi.threads = TRUE)
Arguments
geno |
Genotype information saved as a dataframe. Columns represent variants (SNPs or PCs). |
nt |
A numerical vector of training set sample size for estimating logistic growth curve parameters |
n_iter |
Number of simulation of each training set size. Automatically gave a suitable number by default. |
multi.threads |
Default: TRUE. Set as FALSE if you just want to run it by single thread. |
Value
Estimation of parameters.
Examples
data(geno)
## Not run: FGCM(geno)
Sample size determination for genomic selection
Description
This function is designed to generate an operating curve for sample size determination
Usage
SSDFGS(geno, nt = NULL, n_iter = NULL, multi.threads = TRUE)
Arguments
geno |
A numeric data frame carried genotype information (column: PCs, row: sample) |
nt |
A numeric vector carried training set sizes for r-score simulation. |
n_iter |
Number of iterations for estimating parameters. |
multi.threads |
Default (multi.threads = TRUE) use 75% of threads if the computer has more than 4 threads. |
Value
An operating curve and its information.
Author(s)
Jen-Hsiang Ou & Po-Ya Wu
Examples
data(geno)
## Not run: SSDFGS(geno)
CD-score
Description
This function calculate CD-score doi:10.1186/1297-9686-28-4-359 by given training set and test set.
Usage
cd_score(X, X0)
Arguments
X |
A numeric matrix. The training set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker). |
X0 |
A numeric mareix. The test set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker). |
Value
A floating-point number, CD score.
Author(s)
Jen-Hsiang Ou
Examples
data(geno)
## Not run: cd_score(geno[1:50, ], geno[51:100])
Genotype information
Description
A PCA matrix of rice genotype information. This data was published by Zhao et al. (2011) doi:10.1038/ncomms1467
Usage
geno
Format
A numeric matrix (PCA) with 404 rows (sample) and 404 columns (PCs).
Source
http://www.ricediversity.org/data/
Examples
data(geno)
Simulate r-scores of each training set size
Description
Calculate r-scores (un-target) by in parallel.
Usage
nt2r(geno, nt, n_iter = 30, multi.threads = TRUE)
Arguments
geno |
A numeric dataframe of genotype, column represent sites (genotype coding as 1, 0, -1) |
nt |
Numeric. Number of training set size |
n_iter |
Times of iteration. (default = 30) |
multi.threads |
Default: TRUE |
Value
A vector of r-scores of each iteration
Examples
data(geno)
## Not run: nt2r(geno, 50)
Optimal training set determination
Description
This function is designed for determining optimal training set.
Usage
optTrain(
geno,
cand,
n.train,
subpop = NULL,
test = NULL,
method = "rScore",
min.iter = NULL
)
Arguments
geno |
A numeric matrix of principal components (rows: individuals; columns: PCs). |
cand |
An integer vector of which rows of individuals are candidates of the training set in the geno matrix. |
n.train |
The size of the target training set. This could be determined with the help of the ssdfgp function provided in this package. |
subpop |
A character vector of sub-population's group name. The algorithm will ignore the population structure if it remains NULL. |
test |
An integer vector of which rows of individuals are in the test set in the geno matrix. The algorithm will use an un-target method if it remains NULL. |
method |
Choices are rScore, PEV and CD. rScore will be used by default. |
min.iter |
Minimum iteration of all methods can be appointed. One should always check if the algorithm is converged or not. A minimum iteration will set by considering the candidate and test set size if it remains NULL. |
Value
This function will return 3 information including OPTtrain (a vector of chosen optimal training set), TOPscore (highest scores of before iteration), and ITERscore (criteria scores of each iteration).
Author(s)
Jen-Hsiang Ou
Examples
data(geno)
## Not run: optTrain(geno, cand = 1:404, n.train = 100)
PEV score
Description
This function calculate prediction error variance (PEV) score doi:10.1186/s12711-015-0116-6 by given training set and test set.
Usage
pev_score(X, X0)
Arguments
X |
A numeric matrix. The training set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker). |
X0 |
A numeric mareix. The test set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker). |
Value
A floating-point number, PEV score.
Author(s)
Jen-Hsiang Ou
Examples
data(geno)
## Not run: pev_score(geno[1:50, ], geno[51:100])
r-score
Description
This function calculate r-score doi:10.1007/s00122-019-03387-0 by given training set and test set.
Usage
r_score(X, X0)
Arguments
X |
A numeric matrix. The training set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker). |
X0 |
A numeric mareix. The test set genotypic information matrix can be given as genotype matrix (coded as -1, 0, 1) or principle component matrix (row: sample; column: marker). |
Value
A floating-point number, r-score.
Author(s)
Jen-Hsiang Ou
Examples
data(geno)
## Not run: r_score(geno[1:50, ], geno[51:100])
Sub-population information
Description
Sub-population information of samples. This data was published by Zhao et al. (2011) doi:10.1038/ncomms1467
Usage
subpop
Format
A character vector.
Source
http://www.ricediversity.org/data/
Examples
data(subpop)