R-CMD-check Rhub DOI Badge CRAN status

Baldur

Baldur is a hierarchical Bayesian model for the analysis of proteomics data. By leveraging empirical Bayes methods, Baldur estimates hyperparameters for variance and measurement-specific uncertainty. It then computes the posterior difference in means between conditions for each peptide, protein, or PTM, and integrates the posterior to estimate error probabilities.

Features

Installation

Install the stable release from CRAN:

install.packages('baldur')

Or, install the development version from GitHub (after installing rstan):

Follow the instructions for installing rstan: https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started

Then:

devtools::install_github('PhilipBerg/baldur', build_vignettes = TRUE)

Note:
- On Ubuntu, pandoc may be needed to build vignettes.
- On Windows, sometimes the development version of rstan is required.

Usage

For detailed examples, see the package vignettes:

vignette('baldur_yeast_tutorial')
vignette('baldur_ups_tutorial')

Main Modeling Work and Equations

Baldur implements a hierarchical Bayesian framework for label-free proteomics quantification, designed to robustly estimate differential abundance while accounting for the mean-variance relationship in mass spectrometry data. For exact details please see the original paper.

1. Observation Model

For each feature (peptide, protein, or PTM) \(i\) in sample \(j\), the observed intensity \(y_{ij}\) is modeled as:

\[y_{ij}\sim\text{Normal}(\mu_{j},\sigma u_{ij})\] \[\mu_{j}\sim\text{Normal}(\mu_{0j}+\eta_j\sigma,\sigma)\]

2. Mean-Variance Modeling

Gamma Regression

The measurement standard deviation \(s_{j}\) is not constant, but depends on the mean intensity. This relationship is modeled with gamma regression: \[s_{j} \sim \Gamma(\alpha, \frac{\alpha}{\beta(\bar{y}_j)})\] where: - \(\alpha\): shape parameter (estimated empirically) - \(\beta(\bar{y}_j)\): rate parameter as a function of peptide/protein mean intensity

Latent Gamma Mixture Regression

Regression Function

For each observation, the expected mean-variance relationship is modeled as:

\[\beta_i=\kappa\cdot\exp(\theta_i\cdot(I_L-S_Lx_i))+\exp(I-S\bar{y}_i)\]

where: - \(S, S_L\): slope parameters (common and latent) - \(I, I_L\): intercepts (common and latent) - \(\bar{y}_i\): mean - \(\theta_i\): feature-specific mixture parameter

Likelihood

Given the expected mean-variance, the observed standard deviation \(\sigma_i\) is modeled as: \[\sigma_i\sim\Gamma(\alpha,\frac{\alpha}{\beta_i})\] where: - \(\alpha\): gamma shape parameter - \(\beta_i\): expected mean-variance for observation \(i\)

Priors
NRMSE (Model Fit Metric)

The normalized root-mean-square error (NRMSE) is calculated for model diagnostics.

3. Hierarchical Modeling of Condition Means

Empirical Bayes Prior

Weakly Informative Prior

4. Differential Abundance

For differential analysis, Baldur estimates the posterior distribution of the difference in means between conditions: \[\boldsymbol{D}\sim\mathcal{N}(\boldsymbol{\mu}^\text{T}\boldsymbol{K},\sigma\boldsymbol{\xi}),\quad \xi_{m}=\sqrt{\sum_{i=1}^{C}\frac{|k_{im}|}{n_i}}\]

where: - \(\boldsymbol{K}\): contrast matrix - \(k_{im}\): contrast coefficient for condition \(i\) in contrast \(m\) - \(n_i\): number of samples in condition \(i\) - \(\boldsymbol{\xi}\): scaling factor for each contrast

The probability of error for contrast \(c\) is then:

\[P(\mathrm{error}) = 2\Phi(-|\mu_{D_c} - \mu_{h_0}| \odot \tau_{D_c})\]

where: - \(\Phi\): cumulative distribution function (CDF) of the standard normal - \(\mu_{h_0}\): null hypothesis mean (often zero) - \(\boldsymbol{\tau}_{\boldsymbol{D}}\): precision (inverse standard deviation) for each contrast - \(\odot\): element-wise multiplication


Summary:
Baldur combines hierarchical modeling, mean-variance trend estimation via gamma regression, and empirical Bayes to robustly quantify differential abundance and propagate uncertainty from individual measurements to protein/PTM level, outputting interpretable error probabilities for each feature.

For full details, see the reference publication.

Reference

Berg, Philip, and George Popescu.
“Baldur: Bayesian Hierarchical Modeling for Label-Free Proteomics with Gamma Regressing Mean-Variance Trends.”
Molecular & Cellular Proteomics (2023): 2023-12.
https://doi.org/10.1016/j.mcpro.2023.100658