Title: | Match Cases to Controls Based on Genotype Principal Components |
Version: | 0.3.3 |
Maintainer: | Derek W. Brown <derek9@gwmail.gwu.edu> |
Description: | Matches cases to controls based on genotype principal components (PC). In order to produce better results, matches are based on the weighted distance of PCs where the weights are equal to the % variance explained by that PC. A weighted Mahalanobis distance metric (Kidd et al. (1987) <doi:10.1016/0031-3203(87)90066-5>) is used to determine matches. |
License: | MIT + file LICENSE |
URL: | https://github.com/machiela-lab/PCAmatchR |
BugReports: | https://github.com/machiela-lab/PCAmatchR/issues |
Encoding: | UTF-8 |
LazyData: | true |
Depends: | R (≥ 3.5.0) |
Suggests: | optmatch, testthat, knitr, rmarkdown, R.rsp |
VignetteBuilder: | R.rsp |
RoxygenNote: | 7.2.2 |
NeedsCompilation: | no |
Packaged: | 2022-11-30 12:24:40 UTC; myersta |
Author: | Derek W. Brown |
Repository: | CRAN |
Date/Publication: | 2022-12-01 00:20:06 UTC |
First 20 principal components of 2504 individuals from the 1000 Genome Project
Description
A sample dataset containing information about population, gender, and the first 20 principal components calculated from 2504 individuals in the Phase 3 data release of the 1000 Genomes Project. The principal component analysis was conducted using PLINK.
Usage
PCs_1000G
Format
A data frame with 2504 rows and 24 variables:
- sample
sample ID number
- pop
three letter designation of 1000 Genomes reference population
- super_pop
three letter designation of 1000 Genomes reference super population
- gender
gender of individual
- PC1
principal component 1
- PC2
principal component 2
- PC3
principal component 3
- PC4
principal component 4
- PC5
principal component 5
- PC6
principal component 6
- PC7
principal component 7
- PC8
principal component 8
- PC9
principal component 9
- PC10
principal component 10
- PC11
principal component 11
- PC12
principal component 12
- PC13
principal component 13
- PC14
principal component 14
- PC15
principal component 15
- PC16
principal component 16
- PC17
principal component 17
- PC18
principal component 18
- PC19
principal component 19
- PC20
principal component 20
Source
https://www.internationalgenome.org
Examples
head(PCs_1000G)
genome_PC <- PCs_1000G
# Create PCs
PC <- as.data.frame(genome_PC[,c(1,5:24)])
head(PC)
First 20 eigenvalues of 2504 individuals from the 1000 Genome Project
Description
A sample dataset containing the first 20 eigenvalues calculated from 2504 individuals in the Phase 3 data release of the 1000 Genomes Project. The principal component analysis was conducted using PLINK.
Usage
eigenvalues_1000G
Format
A data frame with 20 rows and 1 variable:
- eigen_values
calculated eigenvalues
Source
Machiela Lab
Examples
eigenvalues_1000G
genome_values <- eigenvalues_1000G
values <- c(genome_values)$eigen_values
All eigenvalues of 2504 individuals from the 1000 Genome Project
Description
A sample dataset containing all the eigenvalues calculated from 2504 individuals in the Phase 3 data release of the 1000 Genomes Project. The principal component analysis was conducted using PLINK.
Usage
eigenvalues_all_1000G
Format
A data frame with 2504 rows and 1 variable:
- eigen_values
calculated eigenvalues
Source
Machiela Lab
Examples
eigenvalues_all_1000G
genome_values <- eigenvalues_all_1000G
values <- c(genome_values)$eigen_values
Weighted matching of controls to cases using PCA results.
Description
Weighted matching of controls to cases using PCA results.
Usage
match_maker(
PC = NULL,
eigen_value = NULL,
data = NULL,
ids = NULL,
case_control = NULL,
num_controls = 1,
num_PCs = NULL,
eigen_sum = NULL,
exact_match = NULL,
weight_dist = TRUE,
weights = NULL
)
Arguments
PC |
Individual level principal component. |
eigen_value |
Computed eigenvalue for each PC. Used as the numerator to calculate the percent variance explained by each PC. |
data |
Dataframe containing id and case/control status. Optionally includes covariate data for exact matching. |
ids |
The unique id variable contained in both "PC" and "data." |
case_control |
The case control status variable. |
num_controls |
The number of controls to match to each case. Default is 1:1 matching. |
num_PCs |
The total number of PCs calculated within the PCA. Can be used as the denomiator to calculate the percent variance explained by each PC. Default is 1000. |
eigen_sum |
The sum of all possible eigenvalues within the PCA. Can be used as the denomiator to calculate the percent variance explained by each PC. |
exact_match |
Optional variables contained in the dataframe on which to perform exact matching (i.e. sex, race, etc.). |
weight_dist |
When set to true, matches are produced based on PC weighted Mahalanobis distance. Default is TRUE. |
weights |
Optional user defined weights used to compute the weighted Mahalanobis distance metric. |
Value
A list of matches and weights.
Examples
# Create PC data frame by subsetting provided example dataset
pcs <- as.data.frame(PCs_1000G[,c(1,5:24)])
# Create eigenvalues vector using example dataset
eigen_vals <- c(eigenvalues_1000G)$eigen_values
# Create full eigenvalues vector using example dataset
all_eigen_vals<- c(eigenvalues_all_1000G)$eigen_values
# Create Covarite data frame
cov_data <- PCs_1000G[,c(1:4)]
# Generate a case status variable using ESN 1000 Genome population
cov_data$case <- ifelse(cov_data$pop=="ESN", c(1), c(0))
# With 1 to 1 matching
if(requireNamespace("optmatch", quietly = TRUE)){
library(optmatch)
match_maker(PC = pcs,
eigen_value = eigen_vals,
data = cov_data,
ids = c("sample"),
case_control = c("case"),
num_controls = 1,
eigen_sum = sum(all_eigen_vals),
weight_dist=TRUE
)
}
Function to plot matches from match_maker output
Description
Function to plot matches from match_maker output
Usage
plot_maker(
data = NULL,
x_var = NULL,
y_var = NULL,
case_control = NULL,
line = T,
...
)
Arguments
data |
match_maker output |
x_var |
Principal component 1 |
y_var |
Principal component 2 |
case_control |
Case or control status |
line |
draw line |
... |
Arguments passed to |
Value
None
Examples
# run match_maker()
# Create PC data frame by subsetting provided example dataset
pcs <- as.data.frame(PCs_1000G[,c(1,5:24)])
# Create eigenvalues vector using example dataset
eigen_vals <- c(eigenvalues_1000G)$eigen_values
# Create full eigenvalues vector using example dataset
all_eigen_vals<- c(eigenvalues_all_1000G)$eigen_values
# Create Covarite data frame
cov_data <- PCs_1000G[,c(1:4)]
# Generate a case status variable using ESN 1000 Genome population
cov_data$case <- ifelse(cov_data$pop=="ESN", c(1), c(0))
# With 1 to 1 matching
if(requireNamespace("optmatch", quietly = TRUE)){
library(optmatch)
match_maker_output<- match_maker(PC = pcs,
eigen_value = eigen_vals,
data = cov_data,
ids = c("sample"),
case_control = c("case"),
num_controls = 1,
eigen_sum = sum(all_eigen_vals),
weight_dist=TRUE
)
# run plot_maker()
plot_maker(data=match_maker_output,
x_var="PC1",
y_var="PC2",
case_control="case",
line=TRUE)
}