Title: | Joint Feature Screening via Sparse MLE |
Version: | 2.2-2 |
Description: | Feature screening is a powerful tool in processing ultrahigh dimensional data. It attempts to screen out most irrelevant features in preparation for a more elaborate analysis. Xu and Chen (2014)<doi:10.1080/01621459.2013.879531> proposed an effective screening method SMLE, which naturally incorporates the joint effects among features in the screening process. This package provides an efficient implementation of SMLE-screening for high-dimensional linear, logistic, and Poisson models. The package also provides a function for conducting accurate post-screening feature selection based on an iterative hard-thresholding procedure and a user-specified selection criterion. |
License: | GPL-3 |
Depends: | R(≥ 4.0.0) |
Imports: | glmnet, matrixcalc, mvnfast |
Encoding: | UTF-8 |
LazyData: | true |
RoxygenNote: | 7.2.3 |
NeedsCompilation: | no |
Author: | Qianxiang Zang [aut, cre], Chen Xu [aut], Kelly Burkett [aut] |
Maintainer: | Qianxiang Zang <SMLEmaintainer@gmail.com> |
Repository: | CRAN |
Packaged: | 2025-01-28 22:31:07 UTC; mac |
Date/Publication: | 2025-01-29 00:10:06 UTC |
Suggests: | knitr, rmarkdown, testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
VignetteBuilder: | knitr |
Joint SMLE-screening for generalized linear models
Description
Feature screening is a powerful tool in processing ultrahigh dimensional data. It attempts to screen out most irrelevant features in preparation for a more elaborate analysis. This package provides an efficient implementation of SMLE-screening for linear, logistic, and Poisson models, where the joint effects among features are naturally incorporated in the screening process. The package also provides a function for conducting accurate post-screening feature selection based on an iterative hard-thresholding procedure and a user-specified selection criterion.
Details
Package: | smle |
Type: | Package |
Version: | 2.1-1 |
Date: | 2024-02-12 |
License: | GPL-3 |
Input a n \times 1
response vector Y and a n \times p
predictor (feature) matrix X. The package outputs a set of k < n
features that seem to be most relevant for joint regression. Moreover, the package provides a data simulator that generates synthetic datasets from high-dimensional GLMs, which accommodate both numerical and categorical features with commonly used correlation structures.
Key functions:
Gen_Data
SMLE
smle_select
Author(s)
Qianxiang Zang, Chen Xu, Kelly Burkett
Maintainer: Qianxiang Zang <qzang023@uottawa.ca>
References
Xu, C. and Chen, J. (2014)
The Sparse MLE for Ultrahigh-Dimensional Feature Screening
Journal of the American Statistical Association, 109(507), 1257–1269.
Friedman, J., Hastie, T. and Tibshirani, R. (2010)
Regularization Paths for Generalized Linear Models via Coordinate
Descent
Journal of Statistical Software, 33(1), 1-22.
Examples
set.seed(1)
#Generate correlated data
Data <- Gen_Data(n = 200, p = 5000, correlation = "MA",family = "gaussian")
print(Data)
# joint feature screening via SMLE
fit <- SMLE(Y = Data$Y, X = Data$X, k = 10, family = "gaussian")
print(fit)
summary(fit)
plot(fit)
#Are there any features missed after screening?
setdiff(Data$subset_true, fit$ID_retained)
# Elaborative selection after screening
fit_s <- smle_select(fit, gamma_ebic = 0.5, vote = FALSE)
#Are there any features missed after selection?
setdiff(Data$subset_true, fit_s$ID_selected)
print(fit_s)
summary(fit_s)
plot(fit_s)
Data simulator for high-dimensional GLMs
Description
This function generates synthetic datasets from GLMs with a user-specified correlation structure. It permits both numerical and categorical features, whose quantity can be larger than the sample size.
Usage
Gen_Data(
n = 200,
p = 1000,
sigma = 1,
num_ctgidx = NULL,
pos_ctgidx = NULL,
num_truecoef = NULL,
pos_truecoef = NULL,
level_ctgidx = NULL,
effect_truecoef = NULL,
correlation = c("ID", "AR", "MA", "CS"),
rho = 0.2,
family = c("gaussian", "binomial", "poisson")
)
Arguments
n |
Sample size, number of rows for the feature matrix to be generated. |
p |
Number of columns for the feature matrix to be generated. |
sigma |
Parameter for noise level. |
num_ctgidx |
The number of features that are categorical. Set to |
pos_ctgidx |
Vector of indices denoting which columns are categorical. |
num_truecoef |
The number of features (columns) that affect response. Default is 5. |
pos_truecoef |
Vector of indices denoting which features (columns) affect the response variable. If not specified, positions are randomly sampled. See Details for more information. |
level_ctgidx |
Vector to indicate the number of levels for the categorical features in |
effect_truecoef |
Effect size corresponding to the features in |
correlation |
Correlation structure among features. |
rho |
Parameter controlling the correlation strength, default is |
family |
Model type for the response variable.
|
Details
Simulated data (y_i , x_i)
where x_i = (x_{i1},x_{i2} , . . . , x_{ip})
are generated as follows:
First, we generate a p
by 1
model coefficient vector beta with all
entries being zero, except for the positions specified in pos_truecoef
,
on which effect_truecoef
is used. When pos_truecoef
is not specified,
we randomly choose num_truecoef
positions from the coefficient
vector. When effect_truecoef
is not specified, we randomly set the strength
of the true model coefficients as follow:
(0.5+U) Z,
where U
is sampled from a uniform distribution from 0 to 1,
and Z
is sampled from a binomial distribution P(Z=1)=1/2,P(Z=-1)=1/2
.
Next, we generate a n
by p
feature matrix X
according to the model selected with
correlation
and specified as follows.
Independent (ID): all features are independently generated from N( 0, 1)
.
Moving average (MA): candidate features x_1,..., x_p
are joint normal,
marginally N( 0, 1)
, with
cov(x_j, x_{j-1}) = \rho
, cov(x_j, x_{j-2}) = \rho/2
and cov(x_j, x_h) = 0
for |j-h|>3
.
Compound symmetry (CS): candidate features x_1,..., x_p
are joint normal,
marginally N( 0, 1)
, with cov(x_j, x_h) =\rho/2
if j
, h
are both in the set of important features and cov(x_j, x_h)=\rho
when only
one of j
or h
are in the set of important features.
Auto-regressive (AR): candidate features x_1,..., x_p
are joint normal, marginally N( 0, 1)
, with
cov(x_j, x_h) = \rho^{|j-h|}
for all j
and h
. The correlation strength \rho
is controlled by the argument rho
.
Then, we generate the response variable Y
according to its response type, which is controlled by the argument family
For the Gaussian model, y_i =x_i\beta + \epsilon_i
where \epsilon_i
is N( 0, 1)
for i
from 1
to n
.
For the binary model let \pi_i = P(Y = 1|x_i)
. We sample y_i
from Bernoulli(\pi_i
) where logit(\pi_i) = x_i \beta
.
Finally, for the Poisson model, y_i
is generated from the Poisson distribution with the link \pi_i
= exp(x_i\beta )
.
For more details see the reference below.
Value
call |
The call that produced this object. |
Y |
Response variable vector of length |
X |
Feature matrix or data.frame (matrix if |
subset_true |
Vector of column indices of X for the features that affect the response variables (relevant features). |
coef_true |
Vector of effects for the features that affect the response variables. |
categorical |
Logical flag whether the model contains categorical features. |
CI |
Indices of categorical features when |
rho,family,correlation are return of arguments passed in the function call.
References
Xu, C. and Chen, J. (2014). The Sparse MLE for Ultrahigh-Dimensional Feature Screening, Journal of the American Statistical Association, 109(507), 1257-1269
Examples
#Simulating data with binomial response and auto-regressive structure.
set.seed(1)
Data <- Gen_Data(n = 500, p = 2000, family = "binomial", correlation = "AR")
cor(Data$X[,1:5])
print(Data)
Joint feature screening via sparse maximum likelihood estimation for GLMs
Description
Input a n
by 1
response Y
and a n
by p
feature matrix X
;
the function uses SMLE to retain only a set of k<n
features that seem
to be most related to the response variable. It thus serves as a pre-processing step for an
elaborative analysis. In SMLE, the joint effects between features are naturally
accounted for; this makes the screening more reliable. The function uses the
efficient iterative hard thresholding (IHT) algorithm with step parameter
adaptively tuned for fast convergence. Users can choose to further conduct
an elaborative selection after SMLE-screening. See smle_select()
for more details.
Usage
SMLE(formula = NULL, ...)
## Default S3 method:
SMLE(
formula = NULL,
X = NULL,
Y = NULL,
data = NULL,
k = NULL,
family = c("gaussian", "binomial", "poisson"),
keyset = NULL,
intercept = TRUE,
categorical = TRUE,
group = TRUE,
codingtype = NULL,
coef_initial = NULL,
max_iter = 500,
tol = 10^(-3),
selection = F,
standardize = TRUE,
fast = FALSE,
U = 1,
U_rate = 0.5,
penalize_mod = TRUE,
...
)
## S3 method for class 'formula'
SMLE(formula, data, k = NULL, keyset = NULL, categorical = NULL, ...)
Arguments
formula |
An object of class |
... |
Additional arguments to be passed to |
X |
The |
Y |
The response vector |
data |
An optional data frame, list or environment (or object coercible by |
k |
Total number of features (including |
family |
Model assumption between |
keyset |
A numeric vector with column indices for the key features that
do not participate in feature screening and are forced to remain in the model.
The column indices for the key features should be from |
intercept |
A logical flag to indicate whether to an intercept be used in the model. An intercept will not participate in screening. |
categorical |
A logical flag for whether the input feature matrix includes
categorical features( either |
group |
Logical flag for whether to treat the dummy covariates of a
categorical feature as a group. (Only for categorical data, see Details).
Default is |
codingtype |
Coding types for categorical features; default is |
coef_initial |
A |
max_iter |
Maximum number of iteration steps. Default is 500. |
tol |
A tolerance level to stop the iterations, when the squared sum of
differences between two successive coefficient updates is below it.
Default is |
selection |
A logical flag to indicate whether an elaborate selection
is to be conducted by |
standardize |
A logical flag for feature standardization, prior to
performing feature screening. The resulting coefficients are
always returned on the original scale.
If features are in the same units already, you might not wish to
standardize. Default is |
fast |
Set to |
U |
A numerical multiplier of initial tuning step parameter in IHT algorithm. Default is 1. For binomial model, a larger initial value is recommended; a smaller one is recommended for poisson model. |
U_rate |
Decreasing rate in tuning step parameter |
penalize_mod |
A logical flag to indicate whether adjustment is used in
ranking groups of features. This argument is applicable only when
|
Details
With the input Y
and X
, SMLE()
conducts joint feature screening by running
iterative hard thresholding algorithm (IHT), where the default initial value is set to
be the Lasso estimate with the sparsity closest to the sample size minus one.
In SMLE()
, the initial value for step size parameter 1/u
is
determined as follows. When coef_initial = 0
, we set 1/u = U / \sqrt{p}
.
When coef_initial != 0
, we generate a sub-matrix X_0
using the columns of X
corresponding to the non-zero positions of coef_initial
and set
1/u = U/\sqrt{p}||X||^2_{\infty}
and recursively decrease the value of step size by
U_rate
to guarantee the likelihood increment. This strategy is called u
-search.
SMLE()
terminates IHT iterations when either tol
or max_iter
is
satisfied. When fast = TRUE
, the algorithm also stops when the non-zero
members of the coefficient estimates remain the same for 10 successive
iterations or the log-likelihood difference between coefficient estimates is less
than 0.01
times the log-likelihood increase of the first step, or
tol
\sqrt k
is satisfied.
In SMLE()
, categorical features are coded by dummy covariates with the
method specified in codingtype
. Users can use group
to specify
whether to treat those dummy covariates as a single group feature or as
individual features.
When group = TRUE
with penalize_mod = TRUE
, the effect for a group
of J
dummy covariates is computed by
\beta_i = \sqrt{(\beta_1)^2+...+(\beta_J)^2}/\sqrt J,
which will be treated as a single feature in IHT iterations. When group = FALSE
,
a group of J
dummy covariates will be treated as J
individual features in the IHT iterations; in this case,
a categorical feature is retained after screening when at least one of the corresponding dummy covariates is retained.
Since feature screening is usually a preprocessing step, users may wish to
further conduct an elaborative feature selection after screening. This can
be done by setting selection = TRUE
in SMLE()
or applying any existing
selection method on the output of SMLE()
.
Value
call |
The call that produced this object. |
ID_retained |
A vector indicating the features retained after SMLE-screening.
The output includes both features retained by |
coef_retained |
The vector of coefficients estimated by IHT for the retained features. When the
retained set contains a categorical feature, the value returns a group effect if
|
path_retained |
IHT iteration path with columns recording the coefficient updates. |
num_retained |
Number of retained features after screening. |
intercept |
The estimated intercept value by IHT, if |
steps |
Number of IHT iterations. |
likelihood_iter |
A list of log-likelihood updates over the IHT iterations. |
Usearch |
A vector giving the number of attempts to find a proper |
modified_data |
A list containing data objects generated by SMLE.
|
iteration_data |
A list containing data objects that track the coefficients over iterations.
|
X
, Y
, data
, family
, categorical
and codingtype
are return of arguments passed in the function call.
References
UCLA Statistical Consulting Group. coding systems for categorical variables in regression analysis. https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/. Retrieved May 28, 2020.
Xu, C. and Chen, J. (2014). The Sparse MLE for Ultrahigh-Dimensional Feature Screening, Journal of the American Statistical Association, 109(507), 1257-1269.
Examples
# Example 1:
set.seed(1)
Data <- Gen_Data( n= 200, p = 5000, family = "gaussian", correlation = "ID")
fit <- SMLE( Y = Data$Y , X = Data$X, k = 9,family = "gaussian")
summary(fit)
Data$subset_true %in% fit$ID_retained # Sure screening check.
plot(fit)
# Example 2:
set.seed(1)
Data_sim2 <- Gen_Data(n = 420, p = 1000, family = "gaussian", num_ctgidx = 5,
pos_ctgidx = c(1,3,5,7,9), effect_truecoef= c(1,2,3,-4,-5),
pos_truecoef = c(1,3,5,7,8), level_ctgidx = c(3,3,3,4,5))
train_X <- Data_sim2$X[1:400,]; test_X <- Data_sim2$X[401:420,]
train_Y <- Data_sim2$Y[1:400]; test_Y <- Data_sim2$Y[401:420]
fit <- SMLE(Y = train_Y, X = train_X, family = "gaussian", group = TRUE, k = 15)
predict(fit, newdata = test_X)
test_Y
# Example 3:
library(datasets)
data("attitude")
set.seed(1)
noise <- matrix(rnorm(30*100, mean = mean(attitude$rating) , sd = 1), ncol = 100)
colnames(noise) <- paste("Noise", seq(100), sep = ".")
df <- data.frame(cbind(attitude, noise))
fit <- SMLE(rating ~., data = df)
fit
Extract coefficients from fitted model
Description
Extract coefficients from fitted model for either a 'smle'
or 'selection'
object.
Usage
## S3 method for class 'smle'
coef(object, refit = TRUE, ...)
## S3 method for class 'selection'
coef(object, refit = TRUE, ...)
Arguments
object |
Returned object from either the function |
refit |
A logical flag that controls what coefficients are being return. Default is |
... |
This argument is not used and listed for method consistency. |
Value
Fitted coefficients based on the screened or selected model specified in the object.
If refit = TRUE
, the coefficients are estimated by re-fitting the final
screened/selected model with glm()
. If refit = FALSE
the coefficients estimated by the IHT algorithm are returned.
Examples
set.seed(1)
Data<-Gen_Data(n=100, p=5000, family = "gaussian", correlation="ID")
fit<-SMLE(Y = Data$Y, X = Data$X, k=15, family = "gaussian")
coef(fit)
fit_s<-smle_select(fit)
coef(fit_s)
Extract log-likelihood
Description
This is a method written to extract the log-likelihood from 'smle'
and 'selection'
objects.
It refits the model by glm()
based on the response and the features selected after screening or selection,
and returns an object of 'logLik'
from the generic.
Usage
## S3 method for class 'smle'
logLik(object, ...)
## S3 method for class 'selection'
logLik(object, ...)
Arguments
object |
An object of class |
... |
Forwarded arguments. |
Value
Returns an object of class 'logLik'
. This is a number with at least one attribute,
"df"
(degrees of freedom), giving the number of (estimated) parameters in the model. For more details, see the generic logLik()
in stats.
Examples
set.seed(1)
Data<-Gen_Data(n=100, p=5000, family = "gaussian", correlation="ID")
fit<-SMLE(Y=Data$Y, X=Data$X, k=9, family = "gaussian")
logLik(fit)
Plots to visualize the post-screening selection
Description
This function constructs a sparsity vs. selection criterion curve for a 'selection'
object.
When EBIC is used with voting, it also constructs a histogram showing the voting result.
Usage
## S3 method for class 'selection'
plot(x, ...)
Arguments
x |
A |
... |
Additional arguments to the |
Value
No return value.
Examples
set.seed(1)
Data <- Gen_Data(correlation = "MA", family = "gaussian")
fit <- SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")
fit_s <- smle_select(fit, vote = TRUE)
plot(fit_s)
Plots to visualize SMLE screening
Description
This function returns two plot windows. By default, the first shows
1) the solution path (estimated coefficient by iteration step) for
the retained features.
By default, the second plot contains 4 plots to assess convergence:
2) log-likelihood,
3) Euclidean distance between the current and the previous coefficient estimates,
4) the number of tries in u-search (see details of SMLE()
),
and 5) the number of features changed in the current active set.
Usage
## S3 method for class 'smle'
plot(x, num_path = NULL, label = TRUE, which_path = NULL, out_plot = 1, ...)
Arguments
x |
A |
num_path |
The number of top coefficients to be shown. Default is equal to the number of features retained in the model. |
label |
Logical flag for whether to label each curve with the feature index. Default is |
which_path |
A vector to control which features are shown in addition to the paths for the most significant coefficients. |
out_plot |
A number from 1 to 5 indicating which plot is to be shown in the separate window; the default for solution path plot is "1". See Description for plot labels 2-5. |
... |
Additional arguments passed to the second plot. |
Value
No return value.
Examples
set.seed(1)
Data <- Gen_Data(correlation = "CS")
fit <- SMLE(Y = Data$Y,X = Data$X, k = 20, family = "gaussian")
plot(fit)
Prediction based on SMLE screening and selection
Description
For a model object of class 'smle'
or 'selection'
, this function returns the predicted response values after re-fitting the model using glm
.
Usage
## S3 method for class 'smle'
predict(object, newdata = NULL, type = c("link", "response", "terms"), ...)
## S3 method for class 'selection'
predict(object, newdata = NULL, type = c("link", "response", "terms"), ...)
Arguments
object |
A |
newdata |
Matrix of new values for the features at which predictions are to be made. If omitted, the fitted linear predictors are used. |
type |
The type of prediction required by |
... |
Further arguments passed to |
Value
A prediction vector with length equal to the number of rows of newdata
.
Examples
set.seed(1)
Data_sim <- Gen_Data(n = 420, p = 1000, sigma = 0.5, family = "gaussian")
train_X <- Data_sim$X[1:400,]; test_X <- Data_sim$X[401:420,]
train_Y <- Data_sim$Y[1:400]; test_Y <- Data_sim$Y[401:420]
fit1 <- SMLE(Y = train_Y, X = train_X, family = "gaussian", k = 10)
#Fitted responses vs true responses in training data
predict(fit1)[1:10]
train_Y[1:10]
#Predicted responses vs true responses in testing data
predict(fit1, newdata = test_X)
test_Y
Print an object
Description
This function prints information about the fitted model from a call to SMLE()
or smle_select()
,
or about the simulated data from a call to Gen_Data()
. The object passed as an argument to print is returned invisibly.
Usage
## S3 method for class 'smle'
print(x, ...)
## S3 method for class 'selection'
print(x, ...)
## S3 method for class 'summary.smle'
print(x, ...)
## S3 method for class 'summary.selection'
print(x, ...)
## S3 method for class 'sdata'
print(x, ...)
Arguments
x |
Fitted object. |
... |
This argument is not used and listed for method consistency. |
Value
Return argument invisibly.
Examples
set.seed(1)
Data<-Gen_Data(correlation = "MA", family = "gaussian")
Data
fit<-SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")
print(fit)
summary(fit)
p values of synthetic genetic association study data set
Description
The first column is the chromosome number. The second columns is SNP name. The third column is the genomic position of the SNP on the whole data set. The marginal p-values of each SNPs is pre-calculated and saved in the fourth column.
Usage
data(pvals)
Format
An object of class data.frame
with 10031 rows and 4 columns.
Elaborative post-screening selection with SMLE
Description
The features retained after screening are still likely to contain some that
are not related to the response. The function smle_select()
is designed to
further identify the relevant features using SMLE()
.
Given a response and a set of K
features, this function
first runs SMLE(fast = TRUE)
to generate a series of sub-models with
sparsity k varying from k_min
to k_max
.
It then selects the best model from the series based on a selection criterion.
When criterion EBIC is used, users can choose to repeat the selection with
different values of the tuning parameter \gamma
, and
conduct importance voting for each feature. When vote = T
, this function
fits all the models with \gamma
specified in gamma_seq
and features
with frequency higher than vote_threshold
will be selected in ID_voted
.
Usage
smle_select(object, ...)
## S3 method for class 'sdata'
smle_select(
object,
k_min = 1,
k_max = NULL,
subset = NULL,
gamma_ebic = 0.5,
vote = FALSE,
keyset = NULL,
criterion = "ebic",
codingtype = c("DV", "standard", "all"),
gamma_seq = c(seq(0, 1, 0.2)),
vote_threshold = 0.6,
parallel = FALSE,
num_clusters = NULL,
...
)
## Default S3 method:
smle_select(
object = NULL,
Y = NULL,
X = NULL,
family = "gaussian",
keyset = NULL,
...
)
## S3 method for class 'smle'
smle_select(object, ...)
Arguments
object |
Object of class |
... |
Further arguments passed to or from other methods. |
k_min |
The lower bound of candidate model sparsity. Default is 1. |
k_max |
The upper bound of candidate model sparsity. Default is the number of columns in feature matrix. |
subset |
An index vector indicating which features (columns of the
feature matrix) are to be selected. Not applicable if a |
gamma_ebic |
The EBIC tuning parameter, in |
vote |
The logical flag for whether to perform the voting procedure. Only available when |
keyset |
A numeric vector with column indices for the key features that do not participate in feature screening and are forced to remain in the model. See SMLE for details. |
criterion |
Selection criterion. One of " |
codingtype |
Coding types for categorical features; for more details see |
gamma_seq |
The sequence of values for |
vote_threshold |
A relative voting threshold in percentage. A feature is considered to be important when it receives votes passing the threshold. Default is 0.6. |
parallel |
A logical flag to use parallel computing to do voting selection.
Default is |
num_clusters |
The number of compute clusters to use when
|
Y |
Input response vector (when |
X |
Input features matrix (when |
family |
Model assumption; see When input is a |
Details
This function accepts three types of input objects;
1) 'smle'
object, as the output from SMLE()
;
2) 'sdata'
object, as the output from Gen_Data()
;
3) other response and feature matrix input by users.
Note that this function is mainly designed to conduct an elaborative selection after feature screening. We do not recommend using it directly for ultra-high-dimensional data without screening.
Value
call |
The call that produced this object. |
ID_selected |
A list of selected features. |
coef_selected |
Fitted model coefficients. |
intercept |
Fitted model intercept. |
criterion_value |
Values of selection criterion for the candidate models with various sparsity. |
categorical |
A logical flag whether the input feature matrix includes categorical features |
ID_pool |
A vector containing all features selected during voting. |
ID_voted |
A vector containing the features selected when |
CI |
Indices of categorical features when |
X
, Y
, family
, gamma_ebic
, gamma_seq
, criterion
, vote
,
codyingtype
, vote_threshold
are return of arguments passed in the function call.
References
Chen. J. and Chen. Z. (2012). "Extended BIC for small-n-large-p sparse GLM." Statistica Sinica, 22(2), 555-574.
Examples
set.seed(1)
Data<-Gen_Data(correlation = "MA", family = "gaussian")
fit<-SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")
fit_bic<-smle_select(fit, criterion = "bic")
summary(fit_bic)
fit_ebic<-smle_select(fit, criterion = "ebic", vote = TRUE)
summary(fit_ebic)
plot(fit_ebic)
Summarize SMLE-screening and selection
Description
This function prints a summary of a 'smle'
(or a 'selection'
) object.
In particular, it shows the features retained after SMLE-screening (or selection) with the related convergence information.
Usage
## S3 method for class 'smle'
summary(object, ...)
## S3 method for class 'selection'
summary(object, ...)
Arguments
object |
A |
... |
This argument is not used and listed for method consistency. |
Value
No return value.
Examples
set.seed(1)
Data <- Gen_Data(correlation = "MA", family = "gaussian")
fit <- SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")
summary(fit)
fit_s <- smle_select(fit)
summary(fit_s)
Synthetic genetic association study data set
Description
This simulated data set consists of 10,031 genetic variants (SNPs) and a continuous response variable measured on 800 individuals. The genotypes were sampled from genotypic distributions derived from the 1000 Genomes project using the R package sim1000G. The genotype is coded as 0, 1, or 2 by counting the number of minor alleles (the allele that is less common in the sample). The continuous response variable was simulated from a normal distribution with mean that depends additively on the causal SNPs.
Usage
data(synSNP)
Format
An object of class 'data.frame'
with 800 rows and 10,032 columns.
References
The 1000 Genomes Project Consortium (2015). Global reference for human genetic variation, Nature, 526(7571), 68-74.s
Examples
data(synSNP)
Y_SNP <- synSNP[,1]
X_SNP <- synSNP[,-1]
fit <- SMLE(Y = Y_SNP, X = X_SNP, k = 40)
summary(fit)
plot(fit)
Extract and adjust voting from SMLE selection
Description
When smle_select()
is used with criterion = "ebic"
and vote = TRUE
, users
can use vote_update()
to adjust the voting threshold without a need
of rerun smle_select()
.
Usage
vote_update(object, ...)
## S3 method for class 'selection'
vote_update(object, vote_threshold = 0.6, ...)
Arguments
object |
A |
... |
This argument is not used and listed for method consistency. |
vote_threshold |
A voting threshold in percentage. A feature is considered to be important when it receives votes passing the threshold. Default is 0.6. |
Value
The function returns a vector indicating the features selected by
EBIC voting with the specified vote_threhold
.
Examples
set.seed(1)
Data <- Gen_Data(n = 100, p = 3000, correlation = "MA", rho = 0.7, family = "gaussian")
colnames(Data$X)<- paste("X.",seq(3000) , sep = "")
fit <- SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")
fit_s <- smle_select(fit, criterion = "ebic", vote = TRUE)
plot(fit_s)
fit_s
vote_update(fit_s, vote_threshold = 0.4)