BigDataStatMeth provides efficient statistical methods and linear algebra operations for large-scale data analysis using block-wise algorithms and HDF5 storage. Designed for genomic, transcriptomic, and multi-omic data analysis, it enables processing datasets that exceed available RAM through intelligent data partitioning and disk-based computation.
The package offers both R and C++ APIs, allowing flexible integration into existing workflows while maintaining high performance for computationally intensive operations.
install.packages("BigDataStatMeth")# Install devtools if needed
install.packages("devtools")
# Install BigDataStatMeth
devtools::install_github("isglobal-brge/BigDataStatMeth")R packages: - Matrix - rhdf5 (Bioconductor) - RcppEigen - RSpectra
System dependencies: - HDF5 library (>= 1.8) - C++11 compatible compiler - For Windows: Rtools
Install Bioconductor dependencies:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c("rhdf5", "HDF5Array"))library(BigDataStatMeth)
library(rhdf5)
# Create HDF5 file from matrix
genotype_matrix <- matrix(rnorm(5000 * 10000), 5000, 10000)
bdCreate_hdf5_matrix(
filename = "genomics.hdf5",
object = genotype_matrix,
group = "data",
dataset = "genotypes"
)
# Perform block-wise PCA
pca_result <- bdPCA_hdf5(
filename = "genomics.hdf5",
group = "data",
dataset = "genotypes",
k = 4, # Number of blocks
bcenter = TRUE, # Center data
bscale = FALSE, # Don't scale
threads = 4 # Use 4 threads
)
# Access results
components <- pca_result$components
variance_explained <- pca_result$variance_prop# Matrix operations directly on HDF5
result <- bdblockmult_hdf5(
filename = "data.hdf5",
group = "matrices",
A = "matrix_A",
B = "matrix_B"
)
# Cross-product
crossp <- bdCrossprod_hdf5(
filename = "data.hdf5",
group = "matrices",
A = "matrix_A"
)
# SVD decomposition
svd_result <- bdSVD_hdf5(
filename = "data.hdf5",
group = "matrices",
dataset = "matrix_A",
k = 8,
threads = 4
)| Operation | R Function | Features |
|---|---|---|
| Matrix multiplication | bdblockmult_hdf5() |
Block-wise, parallel, HDF5 |
| Cross-product | bdCrossprod_hdf5() |
t(A) %% A, t(A) %% B |
| Transposed cross-product | bdtCrossprod_hdf5() |
A %% t(A), A %% t(B) |
| SVD | bdSVD_hdf5() |
Block-wise, hierarchical |
| QR decomposition | bdQR_hdf5() |
Block-wise |
| Cholesky | bdCholesky_hdf5() |
For positive-definite matrices |
| Matrix inversion | bdInvCholesky_hdf5() |
Via Cholesky decomposition |
| Method | R Function | Description |
|---|---|---|
| Principal Component Analysis | bdPCA_hdf5() |
Block-wise PCA with centering/scaling |
| Singular Value Decomposition | bdSVD_hdf5() |
Hierarchical block-wise SVD |
| Canonical Correlation Analysis | bdCCA_hdf5() |
Multi-omic data integration |
| Linear Regression | bdlm_hdf5() |
Large-scale regression models |
| Operation | R Function | Purpose |
|---|---|---|
| Create HDF5 dataset | bdCreate_hdf5_matrix() |
Initialize HDF5 files |
| Normalize data | bdNormalize_hdf5() |
Center and/or scale |
| Remove low-quality data | bdRemovelowdata_hdf5() |
Filter by missing values |
| Impute missing values | bdImputeSNPs_hdf5() |
Mean/median imputation |
| Split datasets | bdSplit_matrix_hdf5() |
Partition into blocks |
| Merge datasets | bdBind_hdf5_datasets() |
Combine by rows/columns |
| Function | Purpose |
|---|---|
bdgetDim_hdf5() |
Get dataset dimensions |
bdExists_hdf5_element() |
Check if dataset exists |
bdgetDatasetsList_hdf5() |
List all datasets in group |
bdRemove_hdf5_element() |
Delete dataset or group |
bdImportTextFile_hdf5() |
Import text files to HDF5 |
Comprehensive documentation is available at https://isglobal-brge.github.io/BigDataStatMeth/
# List available vignettes
vignette(package = "BigDataStatMeth")
# View specific vignette
vignette("getting-started", package = "BigDataStatMeth")
vignette("pca-genomics", package = "BigDataStatMeth")BigDataStatMeth is designed for efficiency:
BigDataStatMeth is particularly suited for:
library(BigDataStatMeth)
# Load genomic data
bdCreate_hdf5_matrix("gwas.hdf5", genotypes, "data", "snps")
# Quality control
bdRemovelowdata_hdf5("gwas.hdf5", "data", "snps",
pcent = 0.05, bycols = TRUE) # Remove SNPs >5% missing
# Impute remaining missing values
bdImputeSNPs_hdf5("gwas.hdf5", "data", "snps_filtered")
# Perform PCA
pca <- bdPCA_hdf5("gwas.hdf5", "data", "snps_filtered",
k = 8, bcenter = TRUE, threads = 4)
# Plot results
plot(pca$components[,1], pca$components[,2],
xlab = "PC1", ylab = "PC2",
main = "Population Structure")# Prepare data
bdCreate_hdf5_matrix("multi_omic.hdf5", gene_expression, "data", "genes")
bdCreate_hdf5_matrix("multi_omic.hdf5", methylation, "data", "cpgs")
# Normalize
bdNormalize_hdf5("multi_omic.hdf5", "data", "genes",
bcenter = TRUE, bscale = TRUE)
bdNormalize_hdf5("multi_omic.hdf5", "data", "cpgs",
bcenter = TRUE, bscale = TRUE)
# Canonical Correlation Analysis
cca <- bdCCA_hdf5(
filename = "multi_omic.hdf5",
X = "NORMALIZED/data/genes",
Y = "NORMALIZED/data/cpgs",
m = 10 # Number of blocks
)
# Extract canonical correlations
correlations <- h5read("multi_omic.hdf5", "Results/cor")#include <Rcpp.h>
#include "BigDataStatMeth.hpp"
using namespace BigDataStatMeth;
// [[Rcpp::export]]
void custom_analysis(std::string filename, std::string dataset) {
hdf5Dataset* ds = new hdf5Dataset(filename, dataset, false);
ds->openDataset();
// Your custom algorithm using BigDataStatMeth functions
// Block-wise processing, matrix operations, etc.
delete ds;
}See Developing Methods for complete examples.
If you use BigDataStatMeth in your research, please cite:
Pelegri-Siso D, Gonzalez JR (2024). BigDataStatMeth: Statistical Methods
for Big Data Using Block-wise Algorithms and HDF5 Storage.
R package version X.X.X, https://github.com/isglobal-brge/BigDataStatMeth
BibTeX entry:
@Manual{bigdatastatmeth,
title = {BigDataStatMeth: Statistical Methods for Big Data},
author = {Dolors Pelegri-Siso and Juan R. Gonzalez},
year = {2024},
note = {R package version X.X.X},
url = {https://github.com/isglobal-brge/BigDataStatMeth},
}Contributions are welcome! Please:
git checkout -b feature/amazing-feature)git commit -m 'Add amazing feature')git push origin feature/amazing-feature)R CMD check before submittingMIT License - see LICENSE file for details.
Dolors Pelegri-Siso
Bioinformatics Research Group in Epidemiology (BRGE)
ISGlobal - Barcelona Institute for Global Health
Juan R. Gonzalez
Bioinformatics Research Group in Epidemiology (BRGE)
ISGlobal - Barcelona Institute for Global Health
Development of BigDataStatMeth was supported by ISGlobal and the Bioinformatics Research Group in Epidemiology (BRGE).