BigDataStatMeth

Overview

BigDataStatMeth provides efficient statistical methods and linear algebra operations for large-scale data analysis using block-wise algorithms and HDF5 storage. Designed for genomic, transcriptomic, and multi-omic data analysis, it enables processing datasets that exceed available RAM through intelligent data partitioning and disk-based computation.

The package offers both R and C++ APIs, allowing flexible integration into existing workflows while maintaining high performance for computationally intensive operations.

Key Features

Block-wise algorithms: Process data larger than memory through intelligent partitioning
HDF5 integration: Seamless storage and computation with hierarchical data format
Parallel processing: Multi-threaded operations for enhanced performance
Dual API: Complete R interface with underlying C++ implementation for performance
Statistical methods: PCA, SVD, CCA, regression models, and more
Production-ready: Extensively tested on genomic datasets with millions of features

Installation

From CRAN (Stable Release)

install.packages("BigDataStatMeth")

From GitHub (Development Version)

# Install devtools if needed
install.packages("devtools")

# Install BigDataStatMeth
devtools::install_github("isglobal-brge/BigDataStatMeth")

System Requirements

R packages: - Matrix - rhdf5 (Bioconductor) - RcppEigen - RSpectra

System dependencies: - HDF5 library (>= 1.8) - C++11 compatible compiler - For Windows: Rtools

Install Bioconductor dependencies:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
    
BiocManager::install(c("rhdf5", "HDF5Array"))

Quick Start

Basic Workflow: PCA on Large Genomic Data

library(BigDataStatMeth)
library(rhdf5)

# Create HDF5 file from matrix
genotype_matrix <- matrix(rnorm(5000 * 10000), 5000, 10000)
bdCreate_hdf5_matrix(
  filename = "genomics.hdf5",
  object = genotype_matrix,
  group = "data",
  dataset = "genotypes"
)

# Perform block-wise PCA
pca_result <- bdPCA_hdf5(
  filename = "genomics.hdf5",
  group = "data",
  dataset = "genotypes",
  k = 4,              # Number of blocks
  bcenter = TRUE,     # Center data
  bscale = FALSE,     # Don't scale
  threads = 4         # Use 4 threads
)

# Access results
components <- pca_result$components
variance_explained <- pca_result$variance_prop

Working with HDF5 Files

# Matrix operations directly on HDF5
result <- bdblockmult_hdf5(
  filename = "data.hdf5",
  group = "matrices",
  A = "matrix_A",
  B = "matrix_B"
)

# Cross-product
crossp <- bdCrossprod_hdf5(
  filename = "data.hdf5",
  group = "matrices",
  A = "matrix_A"
)

# SVD decomposition
svd_result <- bdSVD_hdf5(
  filename = "data.hdf5",
  group = "matrices",
  dataset = "matrix_A",
  k = 8,
  threads = 4
)

Core Functionality

Linear Algebra Operations

Operation	R Function	Features
Matrix multiplication	`bdblockmult_hdf5()`	Block-wise, parallel, HDF5
Cross-product	`bdCrossprod_hdf5()`	t(A) %% A, t(A) %% B
Transposed cross-product	`bdtCrossprod_hdf5()`	A %% t(A), A %% t(B)
SVD	`bdSVD_hdf5()`	Block-wise, hierarchical
QR decomposition	`bdQR_hdf5()`	Block-wise
Cholesky	`bdCholesky_hdf5()`	For positive-definite matrices
Matrix inversion	`bdInvCholesky_hdf5()`	Via Cholesky decomposition

Statistical Methods

Method	R Function	Description
Principal Component Analysis	`bdPCA_hdf5()`	Block-wise PCA with centering/scaling
Singular Value Decomposition	`bdSVD_hdf5()`	Hierarchical block-wise SVD
Canonical Correlation Analysis	`bdCCA_hdf5()`	Multi-omic data integration
Linear Regression	`bdlm_hdf5()`	Large-scale regression models

Data Management

Operation	R Function	Purpose
Create HDF5 dataset	`bdCreate_hdf5_matrix()`	Initialize HDF5 files
Normalize data	`bdNormalize_hdf5()`	Center and/or scale
Remove low-quality data	`bdRemovelowdata_hdf5()`	Filter by missing values
Impute missing values	`bdImputeSNPs_hdf5()`	Mean/median imputation
Split datasets	`bdSplit_matrix_hdf5()`	Partition into blocks
Merge datasets	`bdBind_hdf5_datasets()`	Combine by rows/columns

Utility Functions

Function	Purpose
`bdgetDim_hdf5()`	Get dataset dimensions
`bdExists_hdf5_element()`	Check if dataset exists
`bdgetDatasetsList_hdf5()`	List all datasets in group
`bdRemove_hdf5_element()`	Delete dataset or group
`bdImportTextFile_hdf5()`	Import text files to HDF5

Documentation

Comprehensive documentation is available at https://isglobal-brge.github.io/BigDataStatMeth/

Sections

Getting Started: Installation and first steps
Fundamentals: HDF5 storage and block-wise computing concepts
Workflows: Complete analysis examples (PCA, CCA, cross-platform integration)
Developing Methods: Building new statistical methods with BigDataStatMeth
API Reference: Complete function documentation (R and C++)
Technical Guide: Performance optimization and benchmarking

Vignettes

# List available vignettes
vignette(package = "BigDataStatMeth")

# View specific vignette
vignette("getting-started", package = "BigDataStatMeth")
vignette("pca-genomics", package = "BigDataStatMeth")

Performance

BigDataStatMeth is designed for efficiency:

Block-wise computation: Process 100+ GB datasets with 8-16 GB RAM
Parallel algorithms: Multi-core support for matrix operations
Optimized I/O: Efficient HDF5 chunking and access patterns
Memory management: Controlled memory usage through block size tuning

Use Cases

BigDataStatMeth is particularly suited for:

Genomics: GWAS, eQTL analysis, population genetics
Transcriptomics: RNA-seq analysis, differential expression
Multi-omics: Data integration (CCA, MOFA-style analyses)
Large-scale statistics: Any analysis requiring matrix operations on big data
Method development: Building new statistical methods for big data

Examples

Example 1: Genomic PCA with Quality Control

library(BigDataStatMeth)

# Load genomic data
bdCreate_hdf5_matrix("gwas.hdf5", genotypes, "data", "snps")

# Quality control
bdRemovelowdata_hdf5("gwas.hdf5", "data", "snps", 
                     pcent = 0.05, bycols = TRUE)  # Remove SNPs >5% missing

# Impute remaining missing values
bdImputeSNPs_hdf5("gwas.hdf5", "data", "snps_filtered")

# Perform PCA
pca <- bdPCA_hdf5("gwas.hdf5", "data", "snps_filtered", 
                  k = 8, bcenter = TRUE, threads = 4)

# Plot results
plot(pca$components[,1], pca$components[,2],
     xlab = "PC1", ylab = "PC2",
     main = "Population Structure")

Example 2: Multi-Omic CCA

# Prepare data
bdCreate_hdf5_matrix("multi_omic.hdf5", gene_expression, "data", "genes")
bdCreate_hdf5_matrix("multi_omic.hdf5", methylation, "data", "cpgs")

# Normalize
bdNormalize_hdf5("multi_omic.hdf5", "data", "genes", 
                 bcenter = TRUE, bscale = TRUE)
bdNormalize_hdf5("multi_omic.hdf5", "data", "cpgs",
                 bcenter = TRUE, bscale = TRUE)

# Canonical Correlation Analysis
cca <- bdCCA_hdf5(
  filename = "multi_omic.hdf5",
  X = "NORMALIZED/data/genes",
  Y = "NORMALIZED/data/cpgs",
  m = 10  # Number of blocks
)

# Extract canonical correlations
correlations <- h5read("multi_omic.hdf5", "Results/cor")

Example 3: Custom Method Development (C++ API)

#include <Rcpp.h>
#include "BigDataStatMeth.hpp"

using namespace BigDataStatMeth;

// [[Rcpp::export]]
void custom_analysis(std::string filename, std::string dataset) {
  
  hdf5Dataset* ds = new hdf5Dataset(filename, dataset, false);
  ds->openDataset();
  
  // Your custom algorithm using BigDataStatMeth functions
  // Block-wise processing, matrix operations, etc.
  
  delete ds;
}

See Developing Methods for complete examples.

Citation

If you use BigDataStatMeth in your research, please cite:

Pelegri-Siso D, Gonzalez JR (2024). BigDataStatMeth: Statistical Methods 
for Big Data Using Block-wise Algorithms and HDF5 Storage. 
R package version X.X.X, https://github.com/isglobal-brge/BigDataStatMeth

BibTeX entry:

@Manual{bigdatastatmeth,
  title = {BigDataStatMeth: Statistical Methods for Big Data},
  author = {Dolors Pelegri-Siso and Juan R. Gonzalez},
  year = {2024},
  note = {R package version X.X.X},
  url = {https://github.com/isglobal-brge/BigDataStatMeth},
}

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow existing code style (Rcpp coding standards)
Add tests for new functionality
Update documentation (Roxygen2 for R, Doxygen for C++)
Run R CMD check before submitting

Getting Help

Documentation: https://isglobal-brge.github.io/BigDataStatMeth/
Issues: GitHub Issues

License

MIT License - see LICENSE file for details.

Authors

Dolors Pelegri-Siso
Bioinformatics Research Group in Epidemiology (BRGE)
ISGlobal - Barcelona Institute for Global Health

Juan R. Gonzalez
Bioinformatics Research Group in Epidemiology (BRGE)
ISGlobal - Barcelona Institute for Global Health

Acknowledgments

Development of BigDataStatMeth was supported by ISGlobal and the Bioinformatics Research Group in Epidemiology (BRGE).