Type: | Package |
Title: | Single Cell Feature Identification with Network Topology |
Version: | 1.13 |
Date: | 2025-01-01 |
Author: | Qi Gao [aut, cre] |
Maintainer: | Qi Gao <gqi@med.umich.edu> |
Description: | Cluster-independent method based on topology structure of gene co-expression network for identifying feature gene sets, extracting cellular subpopulations, and elucidating intrinsic relationships among these subpopulations. Without prior cell clustering, SifiNet circumvents potential inaccuracies in clustering that may influence subsequent analyses. This method is introduced in Qi Gao, Zhicheng Ji, Liuyang Wang, Kouros Owzar, Qi-Jing Li, Cliburn Chan, Jichun Xie "SifiNet: a robust and accurate method to identify feature gene sets and annotate cells" (2024) <doi:10.1093/nar/gkae307>. |
License: | GPL-3 |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.1 |
NeedsCompilation: | yes |
BuildVignettes: | yes |
VignetteBuilder: | knitr |
Depends: | R (≥ 3.6.0), methods, utils, stats |
Imports: | Rcpp (≥ 1.0.9), quantreg (≥ 5.94), igraph (≥ 1.3.5), Matrix (≥ 1.5-1), ggraph (≥ 2.0.6), ggplot2 (≥ 3.3.6), |
Suggests: | rmarkdown (≥ 2.20), knitr (≥ 1.42) |
LinkingTo: | Rcpp, RcppArmadillo |
Packaged: | 2025-01-15 22:31:46 UTC; gqi |
Repository: | CRAN |
Date/Publication: | 2025-01-16 15:10:05 UTC |
EstNull This function is a Rcpp version of Wenguang Sun and Tony T. Cai's EstNull.func R function, estimating null distribution from data. Sun, W., & Cai, T. T. (2007). Oracle and Adaptive Compound Decision Rules for False Discovery Rate Control. Journal of the American Statistical Association, 102(479), 901–912.
Description
EstNull This function is a Rcpp version of Wenguang Sun and Tony T. Cai's EstNull.func R function, estimating null distribution from data. Sun, W., & Cai, T. T. (2007). Oracle and Adaptive Compound Decision Rules for False Discovery Rate Control. Journal of the American Statistical Association, 102(479), 901–912.
Usage
EstNull(x, gamma = 0.1)
Arguments
x |
Input vector of all coexpression values |
gamma |
Parameter setting the stopping threshold |
Value
List of mean and std
Author(s)
Qi Gao
The SiFINeT Class
Description
The SiFINeT Class
Slots
data
a list of cell (row) by gene (column) count matrix, either regular or sparse matrix
sparse
whether the count matrix should be analyzed as sparse matrix
meta.data
matrix of meta data, the number of rows should equal to the number of cells
gene.name
a vector of names of genes with length equal to the number of genes
data.name
name of the dataset
n
number of cells in the dataset
p
number of genes in the dataset
data.thres
binarized count matrix
coexp
matrix of genes coexpression
est_ms
estimated mean and sd of coexpression values
thres
lower bound of coexpression (or absolute value of coexpression) for network edge assignment
q5
50% quantile for each gene
kset
index of kept genes after the filtering step
conn
list of connectivities in absolute network
conn2
list of connectivities in positive sub-network
fg_id
index of the candidate feature genes
uni_fg_id
index of the candidate unique feature genes
uni_cluster
cluster result of the candidate unique feature genes
selected_cluster
selected unique feature gene clusters
featureset
detected set of feature genes
assign_shared_feature
Description
The function assigns non-unique candidate feature genes as shared feature genes into unique feature gene sets
Usage
assign_shared_feature(so, min_edge_prop = 0.4)
Arguments
so |
a SiFINeT object |
min_edge_prop |
minimum proportion of edges between a gene and a unique feature gene set for the new gene to be assigned to the set |
Details
Candidate feature genes that are not chosen as unique feature genes would be reconsidered as shared feature genes. A non-unique candidate feature gene would be assigned to a unique feature gene group if it is connected to more than min_edge_prop of the unique genes in the group.
Value
SiFINeT object with shared feature genes in featureset updated.
cal_coexp This function calculates the coexpression patterns between genes and returns the coexpression matrix.
Description
cal_coexp This function calculates the coexpression patterns between genes and returns the coexpression matrix.
Usage
cal_coexp(X)
Arguments
X |
Input binarized cell (row) by gene (column) matrix |
Value
Coexpression matrix
Author(s)
Qi Gao
cal_coexp_sp This function calculates the coexpression patterns between genes in sparse matrix and returns the coexpression matrix.
Description
cal_coexp_sp This function calculates the coexpression patterns between genes in sparse matrix and returns the coexpression matrix.
Usage
cal_coexp_sp(X)
Arguments
X |
Input binarized cell (row) by gene (column) sparse matrix |
Value
Coexpression matrix
Author(s)
Qi Gao
cal_conn This function calculates the first 3 order connectivities for each gene and returns the list of vectors of connectivities.
Description
cal_conn This function calculates the first 3 order connectivities for each gene and returns the list of vectors of connectivities.
Usage
cal_conn(data, thres = 3, m = 10L, abso = 1L, niter = 100L)
Arguments
data |
Input gene by gene coexpression matrix |
thres |
Gene pairs with coexpression exceed thres would be assigned an edge between them in the coexpression network |
m |
Sample size used for the calculation of 3rd order connectivities |
abso |
Whether to calculate connectivities in absolute network (TRUE) or positive network (FALSE) |
niter |
Number of sample used for the calculation of 3rd order connectivities |
Value
List of connectivities C1, C2, and C3 ' @export
Author(s)
Qi Gao
cal_connectivity
Description
The function calculates the 1st, 2nd and 3rd order connectivities for all genes
Usage
cal_connectivity(so, m = 10, niter = 100)
Arguments
so |
a SiFINeT object |
m |
number of neighbors sampled each time for the calculation of 3rd order connectivity |
niter |
number of samples created for the calculation of 3rd order connectivity |
Details
For gene i, First order connectivity is defined as the number of edges connected to gene i (degree of the gene node i in the network). Second order connectivity is defined as the proportion of edges between the neighbors of gene i, calculated as number of observed edges between the neighbors of gene i divided by the number of possible edges between the neighbors. Third order connectivity is defined as a weighted proportion of edges between neighbors and neighbors of neighbors of gene i. Third order connectivity is calculated as the mean of edge proportions across weighted samples. Each gene is weighted by the number of edges it has with the neighbors of gene i. Then SiFINeT repeatedly samples m genes for niter times. For each sample, the edge proportion (number of observed edges / number of possible edges) is calculated. And the mean edge proportion across the sample is the 3rd order connectivity for gene i.
Value
SiFINeT object with conn (absolute network connectivities) updated.
create_SiFINeT_object
Description
The function classifies count data based on thresholds defined by quantile regression
Usage
create_SiFINeT_object(
counts,
gene.name = NULL,
meta.data = NULL,
data.name = NULL,
sparse = FALSE,
rowfeature = TRUE
)
Arguments
counts |
count matrix |
gene.name |
name of the features |
meta.data |
data.frame of meta data |
data.name |
name of dataset |
sparse |
whether the count matrix should be analyzed as sparse matrix |
rowfeature |
whether the count matrix is feature (row) by cell (column) |
Value
a SiFINeT object
create_network
Description
The function estimates the null distribution of coexpression patterns and generates coexpression network
Usage
create_network(so, alpha = 0.05, manual = FALSE, least_edge_prop = 0.01)
Arguments
so |
a SiFINeT object |
alpha |
the Type I error rate used for FDR control procedure |
manual |
whether to manually set threshold for edge assignment |
least_edge_prop |
the minimum proportion of edges. Only used when manual = TRUE |
Details
Theoretically the distribution of coexpression patterns would converge to standard Gaussian if either one of the gene pair is not feature gene. However in genomics analysis, empirical null could be much more variable than theoretical null. SiFINeT uses estimated null mean and standard deviation to find the threshold for network edges. An edge is assigned to a pair of gene if the absolute value of coexpression pattern between the 2 genes is greater than the threshold Assuming the distribution to be Gaussian, with the estimated null mean and standard deviation, SiFINeT uses SQUAC to control the false discovery rate (FDR) for coexpression patterns. In case the signal is not strong enough and the coexpression network is too sparse, SiFINeT also accept user-defined lower bound for the least proportion of edges. Usually a coexpression network with edge proportion between 0.5% - 10% would have better performance for the detection of feature gene sets.
Value
SiFINeT object with est_ms (estimated mean and sd) and thres (network edge threshold) updated.
References
Jiashun Jin and Tony T. Cai. “Estimating the Null and the Proportion of Non-Null Effects in Large-Scale Multiple Comparisons”. In: Journal of the American Statistical Association 102 (478 2004), pp. 495–506. doi: 10.1198/016214507000000167.
Jichun Xie and Ruosha Li. “False discovery rate control for high dimensional networks of quantile associations conditioning on covariates”. In: J R Stat Soc Series B Stat Methodol (2018). doi: 10.1111/rssb.12288.
enrich_feature_set
Description
The function chooses genes that are not found to be feature genes as enriched feature genes and assigns them into unique+shared feature gene sets
Usage
enrich_feature_set(so, min_edge_prop = 0.9)
Arguments
so |
a SiFINeT object |
min_edge_prop |
minimum proportion of edges between a gene and a unique+shared feature gene set for the new feature to be assigned to the set |
Details
Genes that are not selected as feature genes would be added in the enriched section of the feature gene set if they are connected with more than min_edge_prop of the unique and shared feature genes in each of the feature gene group.
Value
SiFINeT object with enriched feature genes in geneset updated.
extract_subnetwork
Description
The function extract a subnetwork from the co-expression network
Usage
extract_subnetwork(
so,
target_gene_name = NULL,
target_gene_id = NULL,
positive = TRUE
)
Arguments
so |
a SiFINeT object |
target_gene_name |
the names of the target genes in the output network |
target_gene_id |
the indices of the target genes in the output network, not used when target_gene_name is not Null |
positive |
whether only positive (default) co-expressions or all co-expressions are considered in assigning edges |
Value
an adjacency matrix of the output subnetwork
feature_coexp
Description
The function calculates coexpression patterns between genes
Usage
feature_coexp(so)
Arguments
so |
a SiFINeT object |
Details
The coexpression pattern of a pair of genes is a normalized co-occurrence of high (or equivalently low) expression level of the 2 genes in the classified count matrix. The normalization is based on the estimated quantiles of the low-high separation instead of the quantiles used for quantile regressions. Theoretically, the distribution of coexpression patterns should asymptotically follow standard Gaussian distribution if at least one of the 2 genes is not differentially expressed feature gene.
Value
SiFINeT object with coexp (gene coexpression matrix) updated.
filter_lowexp
Description
The function filters out genes with low expression rate and high positive coexpression with genes of same expression level
Usage
filter_lowexp(so, t1 = 10, t2 = 0.9, t3 = 0.9)
Arguments
so |
a SiFINeT object |
t1 |
threshold for number of total edges connecting the feature node. Lower t1 leads to stricter filtering. |
t2 |
threshold for the proportion of positive edges. Lower t2 leads to stricter filtering. |
t3 |
threshold for the proportion of edges with features of same expression level. Lower t3 leads to stricter filtering. |
Details
When using only mean expression level as independent variable in quantile regression,
it is observed that genes with low expression level tend to have large positive S_{ij}
with genes that have same median expression level.
To reduce the coexpression noise caused by low expression level, it is preferred to filter out genes which have large amount and high proportion of positive coexpressions with genes sharing same median expression level.
Value
SiFINeT object with kset (kept index set) updated.
find_unique_feature
Description
The function finds the clustered unique feature genes
Usage
find_unique_feature(
so,
t1 = 5,
t2 = 0.4,
t3 = 0.3,
t1p = 5,
t2p = 0.7,
t3p = 0.5,
resolution = 1,
min_set_size = 5
)
Arguments
so |
a SiFINeT object |
t1 |
feature gene selection parameter, lower threshold for 1st order connectivity in absolute network |
t2 |
feature gene selection parameter, lower threshold for 2nd order connectivity in absolute network |
t3 |
feature gene selection parameter, lower threshold for 3rd order connectivity in absolute network |
t1p |
unique feature gene selection parameter, lower threshold for 1st order connectivity in positive sub-network |
t2p |
unique feature gene selection parameter, lower threshold for 2nd order connectivity in positive sub-network |
t3p |
unique feature gene selection parameter, lower threshold for 3rd order connectivity in positive sub-network |
resolution |
resolution for louvain clustering of unique feature genes |
min_set_size |
minimum size for a unique feature gene cluster to be a separate unique feature gene set |
Details
SiFINeT first find genes with high 1st (>= t1), 2nd (>= t2) and 3rd (>= t3) order connectivities in absolute network (conn) to be candidate feature genes. Then a positive sub-network is created where only candidate feature gene nodes (fg_id) and edges with positive coexpression patterns (coexp >= thres) are included. Feature genes genes with high 1st (>= t1p), 2nd (>= t2p) and at least moderate 3rd (>= t3p) order connectivities in positive sub-network (conn2) are chosen to be candidate unique feature genes. Note that when the network is not too sparse, t3p should usually be smaller than t2p for the detection of unique feature genes in transition cell types. The candidate unique feature genes are then separated into groups by louvain clustering (with resolution defined by the resolution parameter), and among them large groups (number of genes greater than min_set_size) are chosen to be unique feature gene sets that represent different cell types.
Value
SiFINeT object with fg_id (candidate feature gene index), uni_fg_id (candidate unique feature gene index), conn2 (connectivities in positive sub-network), uni_cluster (cluster of candidate unique feature genes), selected_cluster (selected unique feature gene clusters), and unique feature genes in featureset updated.
geneset_topology
Description
The function plots the topology network of the feature gene sets found by SiFINeT.
Usage
geneset_topology(
so,
weightthres = 0.3,
edge_method = 2,
node_color = "black",
shiftsize = 0.05,
boundsize = 0.3,
prefix = "",
set_name = NULL
)
Arguments
so |
a SiFINeT object |
weightthres |
edges between nodes (feature gene sets) with weight greater than weightthres would be shown in the plot |
edge_method |
SiFINeT provides 2 methods of calculating edge weight. The number of shared feature genes between feature gene sets would be used when edge_method = 1; while the edge proportion between feature gene sets would be applied if edge_method = 2. |
node_color |
color of nodes. Should have either length 1 or same length as the number of feature gene sets. |
shiftsize |
set the distance between center of label and the corresponding feature gene sets node. |
boundsize |
set the size of the boundary region. |
prefix |
the prefix of the labels |
set_name |
name of the gene sets |
Details
This function visualizes the output feature gene sets of SiFINeT in the form of network. Number of shared feature genes or proportion of edges between feature gene sets could be used to weight the edges. The layout of the nodes is created by create_layout function in ggraph package.
Value
A ggraph (ggplot) object
References
Thomas Lin Pedersen (2022). ggraph: An Implementation of Grammar of Graphics for Graphs and Networks. R package version 2.0.6. https://CRAN.R-project.org/package=ggraph
norm_FDR_SQAUC The function controls the false discovery rate (FDR) of coexpression patterns using SQAUC method Jichun Xie and Ruosha Li. "False discovery rate control for high dimensional networks of quantile associations conditioning on covariates". In: J R Stat Soc Series B Stat Methodol (2018). doi: 10.1111/rssb.12288.
Description
norm_FDR_SQAUC The function controls the false discovery rate (FDR) of coexpression patterns using SQAUC method Jichun Xie and Ruosha Li. "False discovery rate control for high dimensional networks of quantile associations conditioning on covariates". In: J R Stat Soc Series B Stat Methodol (2018). doi: 10.1111/rssb.12288.
Usage
norm_FDR_SQAUC(value, sam_mean, sam_sd, alpha, n, p)
Arguments
value |
the vector of coexpression patterns |
sam_mean |
the estimated sample mean |
sam_sd |
the estimated sample sd |
alpha |
the type I error rate |
n |
the number of cells |
p |
the number of genes |
Value
lower bound threshold for genes to be significantly coexpressed
quantile_thres
Description
The function classifies count data into binary low (0) - high (1) data, based on whether the count number is greater than a threshold.
Usage
quantile_thres(so)
Arguments
so |
a SiFINeT object |
Details
The threshold used for classification is defined by quantile regression on each gene using Frisch–Newton interior point method ("fn" option for method variable in quantreg package, rq function). By default if no meta data is provided, the quantile regression would be applied on the mean expression level of each cell. The quantile to be estimated in the quantile regression is set to be the estimated 50% quantile of the non-zero part of the expression level for each gene. If the expression level of a gene is low (with median 0), then the threshold is set to be 0.
Value
SiFINeT object with data.thres (categorized count matrix) updated.
References
Koenker, R. and S. Portnoy (1997) The Gaussian Hare and the Laplacean Tortoise: Computability of Squared-error vs Absolute Error Estimators, (with discussion). Statistical Science, 12, 279-300.
Roger Koenker (2022). quantreg: Quantile Regression. R package version 5.94. https://CRAN.R-project.org/package=quantreg