Help for package DPpack

Type:

Package

Title:

Differentially Private Statistical Analysis and Machine Learning

Version:

0.2.2

Maintainer:

Spencer Giddens <giddens2spencer@gmail.com>

Description:

An implementation of common statistical analysis and models with differential privacy (Dwork et al., 2006a) <doi:10.1007/11681878_14> guarantees. The package contains, for example, functions providing differentially private computations of mean, variance, median, histograms, and contingency tables. It also implements some statistical models and machine learning algorithms such as linear regression (Kifer et al., 2012) https://proceedings.mlr.press/v23/kifer12.html and SVM (Chaudhuri et al., 2011) https://jmlr.org/papers/v12/chaudhuri11a.html. In addition, it implements some popular randomization mechanisms, including the Laplace mechanism (Dwork et al., 2006a) <doi:10.1007/11681878_14>, Gaussian mechanism (Dwork et al., 2006b) <doi:10.1007/11761679_29>, analytic Gaussian mechanism (Balle & Wang, 2018) https://proceedings.mlr.press/v80/balle18a.html, and exponential mechanism (McSherry & Talwar, 2007) <doi:10.1109/FOCS.2007.66>.

License:

GPL-3 | file LICENSE

Encoding:

UTF-8

RoxygenNote:

7.3.2

Imports:

rmutil (≥ 1.1.5), Rdpack (≥ 2.1.2), R6 (≥ 2.5.1), dplyr (≥ 1.0.1), MASS (≥ 7.3-51.6), nloptr (≥ 1.2.2.2), e1071 (≥ 1.7-9), stats (≥ 4.0.2), graphics (≥ 4.0.2), ggplot2 (≥ 3.3.2)

RdMacros:

Rdpack

Suggests:

testthat (≥ 3.0.0)

Config/testthat/edition:

NeedsCompilation:

Packaged:

2024-10-20 02:44:42 UTC; spencergiddens

Author:

Spencer Giddens [aut, cre], Fang Liu [ctb]

Repository:

CRAN

Date/Publication:

2024-10-20 04:10:05 UTC

Privacy-preserving Empirical Risk Minimization for Binary Classification

Description

This class implements differentially private empirical risk minimization (Chaudhuri et al. 2011). Either the output or the objective perturbation method can be used. It is intended to be a framework for building more specific models via inheritance. See LogisticRegressionDP for an example of this type of structure.

Details

To use this class for empirical risk minimization, first use the new method to construct an object of this class with the desired function values and hyperparameters. After constructing the object, the fit method can be applied with a provided dataset and data bounds to fit the model. In fitting, the model stores a vector of coefficients coeff which satisfy differential privacy. These can be released directly, or used in conjunction with the predict method to privately predict the outcomes of new datapoints.

Note that in order to guarantee differential privacy for empirical risk minimization, certain constraints must be satisfied for the values used to construct the object, as well as for the data used to fit. These conditions depend on the chosen perturbation method. Specifically, the provided loss function must be convex and differentiable with respect to y.hat, and the absolute value of the first derivative of the loss function must be at most 1. If objective perturbation is chosen, the loss function must also be doubly differentiable and the absolute value of the second derivative of the loss function must be bounded above by a constant c for all possible values of y.hat and y, where y.hat is the predicted label and y is the true label. The regularizer must be 1-strongly convex and differentiable. It also must be doubly differentiable if objective perturbation is chosen. Finally, it is assumed that if x represents a single row of the dataset X, then the l2-norm of x is at most 1 for all x. Note that because of this, a bias term cannot be included without appropriate scaling/preprocessing of the dataset. To ensure privacy, the add.bias argument in the fit and predict methods should only be utilized in subclasses within this package where appropriate preprocessing is implemented, not in this class.

Public fields

mapXy: Map function of the form mapXy(X, coeff) mapping input data matrix X and coefficient vector or matrix coeff to output labels y.
mapXy.gr: Function representing the gradient of the map function with respect to the values in coeff and of the form mapXy.gr(X, coeff), where X is a matrix and coeff is a matrix or numeric vector.
loss: Loss function of the form loss(y.hat, y), where y.hat and y are matrices.
loss.gr: Function representing the gradient of the loss function with respect to y.hat and of the form loss.gr(y.hat, y), where y.hat and y are matrices.
regularizer: Regularization function of the form regularizer(coeff), where coeff is a vector or matrix.
regularizer.gr: Function representing the gradient of the regularization function with respect to coeff and of the form regularizer.gr(coeff).
gamma: Nonnegative real number representing the regularization constant.
eps: Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
perturbation.method: String indicating whether to use the 'output' or the 'objective' perturbation methods (Chaudhuri et al. 2011).
c: Positive real number denoting the upper bound on the absolute value of the second derivative of the loss function, as required to ensure differential privacy for the objective perturbation method.
coeff: Numeric vector of coefficients for the model.
kernel: Value only used in child class svmDP. String indicating which kernel to use for SVM. Must be one of {'linear', 'Gaussian'}. If 'linear' (default), linear SVM is used. If 'Gaussian', uses the sampling function corresponding to the Gaussian (radial) kernel approximation.
D: Value only used in child class svmDP. Nonnegative integer indicating the dimensionality of the transform space approximating the kernel. Higher values of D provide better kernel approximations at a cost of computational efficiency.
sampling: Value only used in child class svmDP. Sampling function of the form sampling(d), where d is the input dimension, returning a (d+1)-dimensional vector of samples corresponding to the Fourier transform of the kernel to be approximated.
phi: Value only used in child class svmDP. Function of the form phi(x, theta), where x is an individual row of the original dataset, and theta is a (d+1)-dimensional vector sampled from the Fourier transform of the kernel to be approximated, where d is the dimension of x. The function returns a numeric scalar corresponding to the pre-filtered value at the given row with the given sampled vector.
kernel.param: Value only used in child class svmDP. Positive real number corresponding to the Gaussian kernel parameter.
prefilter: Value only used in child class svmDP. Matrix of pre-filter values used in converting data into transform space.

Methods

Public methods

EmpiricalRiskMinimizationDP.CMS$new()
EmpiricalRiskMinimizationDP.CMS$fit()
EmpiricalRiskMinimizationDP.CMS$predict()
EmpiricalRiskMinimizationDP.CMS$clone()

Method `new()`

Create a new EmpiricalRiskMinimizationDP.CMS object.

Usage

EmpiricalRiskMinimizationDP.CMS$new(
  mapXy,
  loss,
  regularizer,
  eps,
  gamma,
  perturbation.method = "objective",
  c = NULL,
  mapXy.gr = NULL,
  loss.gr = NULL,
  regularizer.gr = NULL
)

Arguments

mapXy: Map function of the form mapXy(X, coeff) mapping input data matrix X and coefficient vector or matrix coeff to output labels y. Should return a column matrix of predicted labels for each row of X. See mapXy.sigmoid for an example.
loss: Loss function of the form loss(y.hat, y), where y.hat and y are matrices. Should be defined such that it returns a matrix of loss values for each element of y.hat and y. See loss.cross.entropy for an example. It must be convex and differentiable, and the absolute value of the first derivative of the loss function must be at most 1. Additionally, if the objective perturbation method is chosen, it must be doubly differentiable and the absolute value of the second derivative of the loss function must be bounded above by a constant c for all possible values of y.hat and y.
regularizer: String or regularization function. If a string, must be 'l2', indicating to use l2 regularization. If a function, must have form regularizer(coeff), where coeff is a vector or matrix, and return the value of the regularizer at coeff. See regularizer.l2 for an example. Additionally, in order to ensure differential privacy, the function must be 1-strongly convex and differentiable. If the objective perturbation method is chosen, it must also be doubly differentiable.
eps: Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
gamma: Nonnegative real number representing the regularization constant.
perturbation.method: String indicating whether to use the 'output' or the 'objective' perturbation methods (Chaudhuri et al. 2011). Defaults to 'objective'.
c: Positive real number denoting the upper bound on the absolute value of the second derivative of the loss function, as required to ensure differential privacy for the objective perturbation method. This input is unnecessary if perturbation.method is 'output', but is required if perturbation.method is 'objective'. Defaults to NULL.
mapXy.gr: Optional function representing the gradient of the map function with respect to the values in coeff. If given, must be of the form mapXy.gr(X, coeff), where X is a matrix and coeff is a matrix or numeric vector. Should be defined such that the ith row of the output represents the gradient with respect to the ith coefficient. See mapXy.gr.sigmoid for an example. If not given, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.
loss.gr: Optional function representing the gradient of the loss function with respect to y.hat and of the form loss.gr(y.hat, y), where y.hat and y are matrices. Should be defined such that the ith row of the output represents the gradient of the loss function at the ith set of input values. See loss.gr.cross.entropy for an example. If not given, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.
regularizer.gr: Optional function representing the gradient of the regularization function with respect to coeff and of the form regularizer.gr(coeff). Should return a vector. See regularizer.gr.l2 for an example. If regularizer is given as a string, this value is ignored. If not given and regularizer is a function, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.

Returns

A new EmpiricalRiskMinimizationDP.CMS object.

Method `fit()`

Fit the differentially private empirical risk minimization model. This method runs either the output perturbation or the objective perturbation algorithm (Chaudhuri et al. 2011), depending on the value of perturbation.method used to construct the object, to generate an objective function. A numerical optimization method is then run to find optimal coefficients for fitting the model given the training data and hyperparameters. The built-in optim function using the "BFGS" optimization method is used. If mapXy.gr, loss.gr, and regularizer.gr are all given in the construction of the object, the gradient of the objective function is utilized by optim as well. Otherwise, non-gradient based optimization methods are used. The resulting privacy-preserving coefficients are stored in coeff.

Usage

EmpiricalRiskMinimizationDP.CMS$fit(
  X,
  y,
  upper.bounds,
  lower.bounds,
  add.bias = FALSE
)

Arguments

X: Dataframe of data to be fit.
y: Vector or matrix of true labels for each row of X.
upper.bounds: Numeric vector of length ncol(X) giving upper bounds on the values in each column of X. The ncol(X) values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X larger than the corresponding upper bound is clipped at the bound.
lower.bounds: Numeric vector of length ncol(X) giving lower bounds on the values in each column of X. The ncol(X) values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X larger than the corresponding upper bound is clipped at the bound.
add.bias: Boolean indicating whether to add a bias term to X. Defaults to FALSE.

Method `predict()`

Predict label(s) for given X using the fitted coefficients.

Usage

EmpiricalRiskMinimizationDP.CMS$predict(X, add.bias = FALSE)

Arguments

X: Dataframe of data on which to make predictions. Must be of same form as X used to fit coefficients.
add.bias: Boolean indicating whether to add a bias term to X. Defaults to FALSE. If add.bias was set to TRUE when fitting the coefficients, add.bias should be set to TRUE for predictions.

Returns

Matrix of predicted labels corresponding to each row of X.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

EmpiricalRiskMinimizationDP.CMS$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

References

Chaudhuri K, Monteleoni C, Sarwate AD (2011). “Differentially Private Empirical Risk Minimization.” Journal of Machine Learning Research, 12(29), 1069-1109. https://jmlr.org/papers/v12/chaudhuri11a.html.

Examples

# Build train dataset X and y, and test dataset Xtest and ytest
N <- 200
K <- 2
X <- data.frame()
y <- data.frame()
for (j in (1:K)){
  t <- seq(-.25, .25, length.out = N)
  if (j==1) m <- stats::rnorm(N,-.2, .1)
  if (j==2) m <- stats::rnorm(N, .2, .1)
  Xtemp <- data.frame(x1 = 3*t , x2 = m - t)
  ytemp <- data.frame(matrix(j-1, N, 1))
  X <- rbind(X, Xtemp)
  y <- rbind(y, ytemp)
}
Xtest <- X[seq(1,(N*K),10),]
ytest <- y[seq(1,(N*K),10),,drop=FALSE]
X <- X[-seq(1,(N*K),10),]
y <- y[-seq(1,(N*K),10),,drop=FALSE]

# Construct object for logistic regression
mapXy <- function(X, coeff) e1071::sigmoid(X%*%coeff)
# Cross entropy loss
loss <- function(y.hat,y) -(y*log(y.hat) + (1-y)*log(1-y.hat))
regularizer <- 'l2' # Alternatively, function(coeff) coeff%*%coeff/2
eps <- 1
gamma <- 1
perturbation.method <- 'objective'
c <- 1/4 # Required value for logistic regression
mapXy.gr <- function(X, coeff) as.numeric(e1071::dsigmoid(X%*%coeff))*t(X)
loss.gr <- function(y.hat, y) -y/y.hat + (1-y)/(1-y.hat)
regularizer.gr <- function(coeff) coeff
ermdp <- EmpiricalRiskMinimizationDP.CMS$new(mapXy, loss, regularizer, eps,
                                             gamma, perturbation.method, c,
                                             mapXy.gr, loss.gr,
                                             regularizer.gr)

# Fit with data
# Bounds for X based on construction
upper.bounds <- c( 1, 1)
lower.bounds <- c(-1,-1)
ermdp$fit(X, y, upper.bounds, lower.bounds) # No bias term
ermdp$coeff # Gets private coefficients

# Predict new data points
predicted.y <- ermdp$predict(Xtest)
n.errors <- sum(round(predicted.y)!=ytest)

Privacy-preserving Empirical Risk Minimization for Regression

Description

This class implements differentially private empirical risk minimization using the objective perturbation technique (Kifer et al. 2012). It is intended to be a framework for building more specific models via inheritance. See LinearRegressionDP for an example of this type of structure.

Details

Note that in order to guarantee differential privacy for the empirical risk minimization model, certain constraints must be satisfied for the values used to construct the object, as well as for the data used to fit. Specifically, the following constraints must be met. Let l represent the loss function for an individual dataset row x and output value y and L represent the average loss over all rows and output values. First, L must be convex with a continuous Hessian. Second, the l2-norm of the gradient of l must be bounded above by some constant zeta for all possible input values in the domain. Third, for all possible inputs to l, the Hessian of l must be of rank at most one and its Eigenvalues must be bounded above by some constant lambda. Fourth, the regularizer must be convex. Finally, the provided domain of l must be a closed convex subset of the set of all real-valued vectors of dimension p, where p is the number of columns of X. Note that because of this, a bias term cannot be included without appropriate scaling/preprocessing of the dataset. To ensure privacy, the add.bias argument in the fit and predict methods should only be utilized in subclasses within this package where appropriate preprocessing is implemented, not in this class.

Public fields

mapXy: Map function of the form mapXy(X, coeff) mapping input data matrix X and coefficient vector or matrix coeff to output labels y.
mapXy.gr: Function representing the gradient of the map function with respect to the values in coeff and of the form mapXy.gr(X, coeff), where X is a matrix and coeff is a matrix or numeric vector.
loss: Loss function of the form loss(y.hat, y), where y.hat and y are matrices.
loss.gr: Function representing the gradient of the loss function with respect to y.hat and of the form loss.gr(y.hat, y), where y.hat and y are matrices.
regularizer: Regularization function of the form regularizer(coeff), where coeff is a vector or matrix.
regularizer.gr: Function representing the gradient of the regularization function with respect to coeff and of the form regularizer.gr(coeff).
gamma: Nonnegative real number representing the regularization constant.
eps: Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
delta: Nonnegative real number defining the delta privacy parameter. If 0, reduces to pure eps-DP.
domain: List of constraint and jacobian functions representing the constraints on the search space for the objective perturbation algorithm used in Kifer et al. (2012).
zeta: Positive real number denoting the upper bound on the l2-norm value of the gradient of the loss function, as required to ensure differential privacy.
lambda: Positive real number corresponding to the upper bound of the Eigenvalues of the Hessian of the loss function for all possible inputs.
coeff: Numeric vector of coefficients for the model.

Methods

Public methods

EmpiricalRiskMinimizationDP.KST$new()
EmpiricalRiskMinimizationDP.KST$fit()
EmpiricalRiskMinimizationDP.KST$predict()
EmpiricalRiskMinimizationDP.KST$clone()

Method `new()`

Create a new EmpiricalRiskMinimizationDP.KST object.

Usage

EmpiricalRiskMinimizationDP.KST$new(
  mapXy,
  loss,
  regularizer,
  eps,
  delta,
  domain,
  zeta,
  lambda,
  gamma,
  mapXy.gr = NULL,
  loss.gr = NULL,
  regularizer.gr = NULL
)

Arguments

mapXy: Map function of the form mapXy(X, coeff) mapping input data matrix X and coefficient vector or matrix coeff to output labels y. Should return a column matrix of predicted labels for each row of X. See mapXy.linear for an example.
loss: Loss function of the form loss(y.hat, y), where y.hat and y are matrices. Should be defined such that it returns a matrix of loss values for each element of y.hat and y. See loss.squared.error for an example. This function must be convex and the l2-norm of its gradient must be bounded above by zeta for some constant zeta for all possible inputs within the given domain. Additionally, for all possible inputs within the given domain, the Hessian of the loss function must be of rank at most one and its Eigenvalues must be bounded above by some constant lambda.
regularizer: String or regularization function. If a string, must be 'l2', indicating to use l2 regularization. If a function, must have form regularizer(coeff), where coeff is a vector or matrix, and return the value of the regularizer at coeff. See regularizer.l2 for an example. Additionally, in order to ensure differential privacy, the function must be convex.
eps: Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
delta: Nonnegative real number defining the delta privacy parameter. If 0, reduces to pure eps-DP.
domain: List of functions representing the constraints on the search space for the objective perturbation algorithm. Must contain two functions, labeled "constraints" and "jacobian", respectively. The "constraints" function accepts a vector of coefficients from the search space and returns a value such that the value is nonpositive if and only if the input coefficient vector is within the constrained search space. The "jacobian" function also accepts a vector of coefficients and returns the Jacobian of the constraint function. For example, in linear regression, the square of the l2-norm of the coefficient vector \theta is assumed to be bounded above by p, where p is the length of \theta (Kifer et al. 2012). So, domain could be defined as domain <- list("constraints"=function(coeff) coeff%*%coeff-length(coeff), "jacobian"=function(coeff) 2*coeff).
zeta: Positive real number denoting the upper bound on the l2-norm value of the gradient of the loss function, as required to ensure differential privacy.
lambda: Positive real number corresponding to the upper bound of the Eigenvalues of the Hessian of the loss function for all possible inputs.
gamma: Nonnegative real number representing the regularization constant.
mapXy.gr: Optional function representing the gradient of the map function with respect to the values in coeff. If given, must be of the form mapXy.gr(X, coeff), where X is a matrix and coeff is a matrix or numeric vector. Should be defined such that the ith row of the output represents the gradient with respect to the ith coefficient. See mapXy.gr.linear for an example. If not given, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.
loss.gr: Optional function representing the gradient of the loss function with respect to y.hat and of the form loss.gr(y.hat, y), where y.hat and y are matrices. Should be defined such that the ith row of the output represents the gradient of the loss function at the ith set of input values. See loss.gr.squared.error for an example. If not given, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.
regularizer.gr: Optional function representing the gradient of the regularization function with respect to coeff and of the form regularizer.gr(coeff). Should return a vector. See regularizer.gr.l2 for an example. If regularizer is given as a string, this value is ignored. If not given and regularizer is a function, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.

Returns

A new EmpiricalRiskMinimizationDP.KST object.

Method `fit()`

Fit the differentially private emprirical risk minimization model. The function runs the objective perturbation algorithm (Kifer et al. 2012) to generate an objective function. A numerical optimization method is then run to find optimal coefficients for fitting the model given the training data and hyperparameters. The nloptr function is used. If mapXy.gr, loss.gr, and regularizer.gr are all given in the construction of the object, the gradient of the objective function and the Jacobian of the constraint function are utilized for the algorithm, and the NLOPT_LD_MMA method is used. If one or more of these gradient functions are not given, the NLOPT_LN_COBYLA method is used. The resulting privacy-preserving coefficients are stored in coeff.

Usage

EmpiricalRiskMinimizationDP.KST$fit(
  X,
  y,
  upper.bounds,
  lower.bounds,
  add.bias = FALSE
)

Arguments

X: Dataframe of data to be fit.
y: Vector or matrix of true values for each row of X.
upper.bounds: Numeric vector of length ncol(X)+1 giving upper bounds on the values in each column of X and the values of y. The last value in the vector is assumed to be the upper bound on y, while the first ncol(X) values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X and in y larger than the corresponding upper bound is clipped at the bound.
lower.bounds: Numeric vector of length ncol(X)+1 giving lower bounds on the values in each column of X and the values of y. The last value in the vector is assumed to be the lower bound on y, while the first ncol(X) values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X and in y larger than the corresponding lower bound is clipped at the bound.
add.bias: Boolean indicating whether to add a bias term to X. Defaults to FALSE.

Method `predict()`

Predict y values for given X using the fitted coefficients.

Usage

EmpiricalRiskMinimizationDP.KST$predict(X, add.bias = FALSE)

Arguments

X: Dataframe of data on which to make predictions. Must be of same form as X used to fit coefficients.
add.bias: Boolean indicating whether to add a bias term to X. Defaults to FALSE. If add.bias was set to TRUE when fitting the coefficients, add.bias should be set to TRUE for predictions.

Returns

Matrix of predicted y values corresponding to each row of X.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

EmpiricalRiskMinimizationDP.KST$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

References

Kifer D, Smith A, Thakurta A (2012). “Private Convex Empirical Risk Minimization and High-dimensional Regression.” In Mannor S, Srebro N, Williamson RC (eds.), Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, 25.1–25.40. https://proceedings.mlr.press/v23/kifer12.html.

Examples

# Build example dataset
n <- 500
X <- data.frame(X=seq(-1,1,length.out = n))
true.theta <- c(-.3,.5) # First element is bias term
p <- length(true.theta)
y <- true.theta[1] + as.matrix(X)%*%true.theta[2:p] + stats::rnorm(n=n,sd=.1)

# Construct object for linear regression
mapXy <- function(X, coeff) X%*%coeff
loss <- function(y.hat, y) (y.hat-y)^2/2
regularizer <- 'l2' # Alternatively, function(coeff) coeff%*%coeff/2
eps <- 1
delta <- 1
domain <- list("constraints"=function(coeff) coeff%*%coeff-length(coeff),
  "jacobian"=function(coeff) 2*coeff)
# Set p to be the number of predictors desired including intercept term (length of coeff)
zeta <- 2*p^(3/2) # Proper bound for linear regression
lambda <- p # Proper bound for linear regression
gamma <- 1
mapXy.gr <- function(X, coeff) t(X)
loss.gr <- function(y.hat, y) y.hat-y
regularizer.gr <- function(coeff) coeff

ermdp <- EmpiricalRiskMinimizationDP.KST$new(mapXy, loss, 'l2', eps, delta,
                                             domain, zeta, lambda,
                                             gamma, mapXy.gr, loss.gr,
                                             regularizer.gr)

# Fit with data
# We must assume y is a matrix with values between -p and p (-2 and 2
#   for this example)
upper.bounds <- c(1, 2) # Bounds for X and y
lower.bounds <- c(-1,-2) # Bounds for X and y
ermdp$fit(X, y, upper.bounds, lower.bounds, add.bias=TRUE)
ermdp$coeff # Gets private coefficients

# Predict new data points
# Build a test dataset
Xtest <- data.frame(X=c(-.5, -.25, .1, .4))
predicted.y <- ermdp$predict(Xtest, add.bias=TRUE)

Exponential Mechanism

Description

This function implements the exponential mechanism for differential privacy by selecting the index of a vector of candidates to return according to a user-specified vector of utility function values, epsilon, and global sensitivity. Sensitivity calculated based either on bounded or unbounded differential privacy can be used (Kifer and Machanavajjhala 2011). If measure is provided, the probabilities of selecting each value are scaled according to the values in measure. If candidates is provided, the function returns the value of candidates at the selected index, rather than the index itself.

Usage

ExponentialMechanism(
  utility,
  eps,
  sensitivity,
  measure = NULL,
  candidates = NULL
)

Arguments

utility

Numeric vector giving the utilities of the possible values.

eps

Positive real number defining the epsilon privacy budget.

sensitivity

Real number corresponding to the l1-global sensitivity of the function generating utility.

measure

Optional numeric vector of scaling measures for the probabilities of selecting each value. Should be same size as utility. Defaults to uniform scaling.

candidates

Optional vector of candidates of same size as utility. If given, the function returns the candidate at the selected index rather than the index itself.

Value

Indices (or values if candidates given) selected by the mechanism based on the bounded and/or unbounded definitions of differential privacy.

References

Dwork C, McSherry F, Nissim K, Smith A (2006). “Calibrating Noise to Sensitivity in Private Data Analysis.” In Halevi S, Rabin T (eds.), Theory of Cryptography, 265–284. ISBN 978-3-540-32732-5, https://doi.org/10.1007/11681878_14.

Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.

McSherry F, Talwar K (2007). “Mechanism Design via Differential Privacy.” In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07), 94-103. doi:10.1109/FOCS.2007.66.

Examples

candidates <- c('a','b','c','d','e','f','g')
# Release index
idx <- ExponentialMechanism(c(0,1,2,3,2,1,0), 1, 1)
candidates[idx] # Randomly chosen candidate

# Release candidate
ExponentialMechanism(c(0,1,2,3,2,1,0), 1, .5, measure=c(1,1,2,1,2,1,1),
  candidates=candidates)

Gaussian Mechanism

Description

This function implements the Gaussian mechanism for differential privacy by adding noise to the true value(s) of a function according to specified values of epsilon, delta, and l2-global sensitivity(-ies). Global sensitivity calculated based either on bounded or unbounded differential privacy can be used (Kifer and Machanavajjhala 2011). If true.values is a vector, the provided epsilon and delta are divided such that (epsilon, delta)-level differential privacy is satisfied across all function values. In the case that each element of true.values comes from its own function with different corresponding sensitivities, a vector of sensitivities may be provided. In this case, if desired, the user can specify how to divide epsilon and delta among the function values using alloc.proportions.

Usage

GaussianMechanism(
  true.values,
  eps,
  delta,
  sensitivities,
  type.DP = "aDP",
  alloc.proportions = NULL,
  analytic = FALSE,
  tol = 1e-12
)

Arguments

true.values

Real number or numeric vector corresponding to the true value(s) of the desired function.

eps

Positive real number defining the epsilon privacy parameter.

delta

Positive real number defining the delta privacy parameter.

sensitivities

Real number or numeric vector corresponding to the l2-global sensitivity(-ies) of the function(s) generating true.values. This value must be of length 1 or of the same length as true.values. If it is of length 1 and true.values is a vector, this indicates that the given sensitivity applies simultaneously to all elements of true.values and that the privacy budget need not be allocated (alloc.proportions is unused in this case). If it is of the same length as true.values, this indicates that each element of true.values comes from its own function with different corresponding sensitivities. In this case, the l2-norm of the provided sensitivities is used to generate the Gaussian noise.

type.DP

String indicating the type of differential privacy desired for the Gaussian mechanism. Can be either 'pDP' for probabilistic DP (Liu 2019) or 'aDP' for approximate DP (Dwork et al. 2006). Note that if 'aDP' is chosen, epsilon must be strictly less than 1.

alloc.proportions

Optional numeric vector giving the allocation proportions of epsilon and delta to the function values in the case of vector-valued sensitivities. For example, if sensitivities is of length two and alloc.proportions = c(.75, .25), then 75% of the privacy budget eps (and 75% of delta) is allocated to the noise computation for the first element of true.values, and the remaining 25% is allocated to the noise computation for the second element of true.values. This ensures (eps, delta)-level privacy across all computations. Input does not need to be normalized, meaning alloc.proportions = c(3,1) produces the same result as the example above.

analytic

Indicates whether to use the analytic Gaussian mechanism to compute the noise scale (Balle and Wang 2018). Defaults to FALSE.

tol

Error tolerance for binary search used in determining the noise parameter for the analytic Gaussian mechanism. Unused if analytic is FALSE. Defaults to 1e-12.

Value

Sanitized function values based on the bounded and/or unbounded definitions of differential privacy, sanitized via the Gaussian mechanism.

References

Balle B, Wang Y (2018). “Improving the Gaussian Mechanism for Differential Privacy: Analytical Calibration and Optimal Denoising.” In Dy J, Krause A (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 394–403. https://proceedings.mlr.press/v80/balle18a.html.

Liu F (2019). “Generalized Gaussian Mechanism for Differential Privacy.” IEEE Transactions on Knowledge and Data Engineering, 31(4), 747-756. https://doi.org/10.1109/TKDE.2018.2845388.

Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M (2006). “Our Data, Ourselves: Privacy Via Distributed Noise Generation.” In Vaudenay S (ed.), Advances in Cryptology - EUROCRYPT 2006, 486–503. ISBN 978-3-540-34547-3, doi:10.1007/11761679_29.

Examples

# Simulate dataset
n <- 100
c0 <- 5 # Lower bound
c1 <- 10 # Upper bound
D1 <- stats::runif(n, c0, c1)

# Privacy budget
epsilon <- 0.9 # eps must be in (0, 1) for approximate differential privacy
delta <- 0.01
sensitivity <- (c1-c0)/n

# Approximate differential privacy
private.mean.approx <- GaussianMechanism(mean(D1), epsilon, delta,
                                         sensitivity)
private.mean.approx

# Probabilistic differential privacy
private.mean.prob <- GaussianMechanism(mean(D1), epsilon, delta, sensitivity,
                                       type.DP = 'pDP')
private.mean.prob

# Analytic Gaussian mechanism
epsilon <- 1.1 # epsilon can be > 1 for analytic Gaussian mechanism
private.mean.analytic <- GaussianMechanism(mean(D1), epsilon, delta,
                                           sensitivity, analytic=TRUE)
private.mean.analytic

# Simulate second dataset
d0 <- 3 # Lower bound
d1 <- 6 # Upper bound
D2 <- stats::runif(n, d0, d1)
D <- matrix(c(D1,D2),ncol=2)
sensitivities <- c((c1-c0)/n, (d1-d0)/n)
epsilon <- 0.9 # Total privacy budget for all means
delta <- 0.01

# Here, sensitivities are summed and the result is used to generate Laplace
# noise. This is essentially the same as allocating epsilon proportional to
# the corresponding sensitivity. The results satisfy (0.9,0.01)-approximate
# differential privacy.
private.means <- GaussianMechanism(apply(D, 2, mean), epsilon, delta,
                                   sensitivities)
private.means

# Here, privacy budget is explicitly split so that 75% is given to the first
# vector element and 25% is given to the second.
private.means <- GaussianMechanism(apply(D, 2, mean), epsilon, delta,
                                   sensitivities,
                                   alloc.proportions = c(0.75, 0.25))
private.means

Laplace Mechanism

Description

This function implements the Laplace mechanism for differential privacy by adding noise to the true value(s) of a function according to specified values of epsilon and l1-global sensitivity(-ies). Global sensitivity calculated based either on bounded or unbounded differential privacy can be used (Kifer and Machanavajjhala 2011). If true.values is a vector, the provided epsilon is divided such that epsilon-differential privacy is satisfied across all function values. In the case that each element of true.values comes from its own function with different corresponding sensitivities, a vector of sensitivities may be provided. In this case, if desired, the user can specify how to divide epsilon among the function values using alloc.proportions.

Usage

LaplaceMechanism(true.values, eps, sensitivities, alloc.proportions = NULL)

Arguments

true.values

Real number or numeric vector corresponding to the true value(s) of the desired function.

eps

Positive real number defining the epsilon privacy parameter.

sensitivities

Real number or numeric vector corresponding to the l1-global sensitivity(-ies) of the function(s) generating true.values. This value must be of length 1 or of the same length as true.values. If it is of length 1 and true.values is a vector, this indicates that the given sensitivity applies simultaneously to all elements of true.values and that the privacy budget need not be allocated (alloc.proportions is unused in this case). If it is of the same length as true.values, this indicates that each element of true.values comes from its own function with different corresponding sensitivities. In this case, the l1-norm of the provided sensitivities is used to generate the Laplace noise.

alloc.proportions

Optional numeric vector giving the allocation proportions of epsilon to the function values in the case of vector-valued sensitivities. For example, if sensitivities is of length two and alloc.proportions = c(.75, .25), then 75% of the privacy budget eps is allocated to the noise computation for the first element of true.values, and the remaining 25% is allocated to the noise computation for the second element of true.values. This ensures eps-level privacy across all computations. Input does not need to be normalized, meaning alloc.proportions = c(3,1) produces the same result as the example above.

Value

Sanitized function values based on the bounded and/or unbounded definitions of differential privacy, sanitized via the Laplace mechanism.

References

Examples

# Simulate dataset
n <- 100
c0 <- 5 # Lower bound
c1 <- 10 # Upper bound
D1 <- stats::runif(n, c0, c1)
epsilon <- 1 # Privacy budget
sensitivity <- (c1-c0)/n

private.mean <- LaplaceMechanism(mean(D1), epsilon, sensitivity)
private.mean

# Simulate second dataset
d0 <- 3 # Lower bound
d1 <- 6 # Upper bound
D2 <- stats::runif(n, d0, d1)
D <- matrix(c(D1,D2),ncol=2)
sensitivities <- c((c1-c0)/n, (d1-d0)/n)
epsilon <- 1 # Total privacy budget for all means

# Here, sensitivities are summed and the result is used to generate Laplace
# noise. This is essentially the same as allocating epsilon proportional to
# the corresponding sensitivity. The results satisfy 1-differential privacy.
private.means <- LaplaceMechanism(apply(D, 2, mean), epsilon, sensitivities)
private.means

# Here, privacy budget is explicitly split so that 75% is given to the first
# vector element and 25% is given to the second.
private.means <- LaplaceMechanism(apply(D, 2, mean), epsilon, sensitivities,
                                  alloc.proportions = c(0.75, 0.25))
private.means

Privacy-preserving Linear Regression

Description

This class implements differentially private linear regression using the objective perturbation technique (Kifer et al. 2012).

Details

To use this class for linear regression, first use the new method to construct an object of this class with the desired function values and hyperparameters. After constructing the object, the fit method can be applied with a provided dataset and data bounds to fit the model. In fitting, the model stores a vector of coefficients coeff which satisfy differential privacy. These can be released directly, or used in conjunction with the predict method to privately predict the outcomes of new datapoints.

Note that in order to guarantee differential privacy for linear regression, certain constraints must be satisfied for the values used to construct the object, as well as for the data used to fit. The regularizer must be convex. Additionally, it is assumed that if x represents a single row of the dataset X, then the l2-norm of x is at most p for all x, where p is the number of predictors (including any possible intercept term). In order to ensure this constraint is satisfied, the dataset is preprocessed and scaled, and the resulting coefficients are postprocessed and un-scaled so that the stored coefficients correspond to the original data. Due to this constraint on x, it is best to avoid using an intercept term in the model whenever possible. If an intercept term must be used, the issue can be partially circumvented by adding a constant column to X before fitting the model, which will be scaled along with the rest of X. The fit method contains functionality to add a column of constant 1s to X before scaling, if desired.

Super class

DPpack::EmpiricalRiskMinimizationDP.KST -> LinearRegressionDP

Methods

Inherited methods

DPpack::EmpiricalRiskMinimizationDP.KST$predict()

Method `new()`

Create a new LinearRegressionDP object.

Usage

LinearRegressionDP$new(regularizer, eps, delta, gamma, regularizer.gr = NULL)

Arguments

regularizer: String or regularization function. If a string, must be 'l2', indicating to use l2 regularization. If a function, must have form regularizer(coeff), where coeff is a vector or matrix, and return the value of the regularizer at coeff. See regularizer.l2 for an example. Additionally, in order to ensure differential privacy, the function must be convex.
eps: Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
delta: Nonnegative real number defining the delta privacy parameter. If 0, reduces to pure eps-DP.
gamma: Nonnegative real number representing the regularization constant.
regularizer.gr: Optional function representing the gradient of the regularization function with respect to coeff and of the form regularizer.gr(coeff). Should return a vector. See regularizer.gr.l2 for an example. If regularizer is given as a string, this value is ignored. If not given and regularizer is a function, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.

Returns

A new LinearRegressionDP object.

Method `fit()`

Fit the differentially private linear regression model. The function runs the objective perturbation algorithm (Kifer et al. 2012) to generate an objective function. A numerical optimization method is then run to find optimal coefficients for fitting the model given the training data and hyperparameters. The nloptr function is used. If regularizer is given as 'l2' or if regularizer.gr is given in the construction of the object, the gradient of the objective function and the Jacobian of the constraint function are utilized for the algorithm, and the NLOPT_LD_MMA method is used. If this is not the case, the NLOPT_LN_COBYLA method is used. The resulting privacy-preserving coefficients are stored in coeff.

Usage

LinearRegressionDP$fit(X, y, upper.bounds, lower.bounds, add.bias = FALSE)

Arguments

X: Dataframe of data to be fit.
y: Vector or matrix of true values for each row of X.
upper.bounds: Numeric vector of length ncol(X)+1 giving upper bounds on the values in each column of X and the values of y. The last value in the vector is assumed to be the upper bound on y, while the first ncol(X) values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X and in y larger than the corresponding upper bound is clipped at the bound.
lower.bounds: Numeric vector of length ncol(X)+1 giving lower bounds on the values in each column of X and the values of y. The last value in the vector is assumed to be the lower bound on y, while the first ncol(X) values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X and in y larger than the corresponding lower bound is clipped at the bound.
add.bias: Boolean indicating whether to add a bias term to X. Defaults to FALSE.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

LinearRegressionDP$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

References

Examples

# Build example dataset
n <- 500
X <- data.frame(X=seq(-1,1,length.out = n))
true.theta <- c(-.3,.5) # First element is bias term
p <- length(true.theta)
y <- true.theta[1] + as.matrix(X)%*%true.theta[2:p] + stats::rnorm(n=n,sd=.1)

# Construct object for linear regression
regularizer <- 'l2' # Alternatively, function(coeff) coeff%*%coeff/2
eps <- 1
delta <- 0 # Indicates to use pure eps-DP
gamma <- 1
regularizer.gr <- function(coeff) coeff

lrdp <- LinearRegressionDP$new('l2', eps, delta, gamma, regularizer.gr)

# Fit with data
# We must assume y is a matrix with values between -p and p (-2 and 2
#   for this example)
upper.bounds <- c(1, 2) # Bounds for X and y
lower.bounds <- c(-1,-2) # Bounds for X and y
lrdp$fit(X, y, upper.bounds, lower.bounds, add.bias=TRUE)
lrdp$coeff # Gets private coefficients

# Predict new data points
# Build a test dataset
Xtest <- data.frame(X=c(-.5, -.25, .1, .4))
predicted.y <- lrdp$predict(Xtest, add.bias=TRUE)

Privacy-preserving Logistic Regression

Description

This class implements differentially private logistic regression (Chaudhuri et al. 2011). Either the output or the objective perturbation method can be used.

Details

To use this class for logistic regression, first use the new method to construct an object of this class with the desired function values and hyperparameters. After constructing the object, the fit method can be applied with a provided dataset and data bounds to fit the model. In fitting, the model stores a vector of coefficients coeff which satisfy differential privacy. These can be released directly, or used in conjunction with the predict method to privately predict the outcomes of new datapoints.

Note that in order to guarantee differential privacy for logistic regression, certain constraints must be satisfied for the values used to construct the object, as well as for the data used to fit. These conditions depend on the chosen perturbation method. The regularizer must be 1-strongly convex and differentiable. It also must be doubly differentiable if objective perturbation is chosen. Additionally, it is assumed that if x represents a single row of the dataset X, then the l2-norm of x is at most 1 for all x. In order to ensure this constraint is satisfied, the dataset is preprocessed and scaled, and the resulting coefficients are postprocessed and un-scaled so that the stored coefficients correspond to the original data. Due to this constraint on x, it is best to avoid using a bias term in the model whenever possible. If a bias term must be used, the issue can be partially circumvented by adding a constant column to X before fitting the model, which will be scaled along with the rest of X. The fit method contains functionality to add a column of constant 1s to X before scaling, if desired.

Super class

DPpack::EmpiricalRiskMinimizationDP.CMS -> LogisticRegressionDP

Methods

Public methods

LogisticRegressionDP$new()
LogisticRegressionDP$fit()
LogisticRegressionDP$predict()
LogisticRegressionDP$clone()

Method `new()`

Create a new LogisticRegressionDP object.

Usage

LogisticRegressionDP$new(
  regularizer,
  eps,
  gamma,
  perturbation.method = "objective",
  regularizer.gr = NULL
)

Arguments

regularizer: String or regularization function. If a string, must be 'l2', indicating to use l2 regularization. If a function, must have form regularizer(coeff), where coeff is a vector or matrix, and return the value of the regularizer at coeff. See regularizer.l2 for an example. Additionally, in order to ensure differential privacy, the function must be 1-strongly convex and doubly differentiable.
eps: Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
gamma: Nonnegative real number representing the regularization constant.
perturbation.method: String indicating whether to use the 'output' or the 'objective' perturbation methods (Chaudhuri et al. 2011). Defaults to 'objective'.
regularizer.gr: Optional function representing the gradient of the regularization function with respect to coeff and of the form regularizer.gr(coeff). Should return a vector. See regularizer.gr.l2 for an example. If regularizer is given as a string, this value is ignored. If not given and regularizer is a function, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.

Returns

A new LogisticRegressionDP object.

Method `fit()`

Fit the differentially private logistic regression model. This method runs either the output perturbation or the objective perturbation algorithm (Chaudhuri et al. 2011), depending on the value of perturbation.method used to construct the object, to generate an objective function. A numerical optimization method is then run to find optimal coefficients for fitting the model given the training data and hyperparameters. The built-in optim function using the "BFGS" optimization method is used. If regularizer is given as 'l2' or if regularizer.gr is given in the construction of the object, the gradient of the objective function is utilized by optim as well. Otherwise, non-gradient based optimization methods are used. The resulting privacy-preserving coefficients are stored in coeff.

Usage

LogisticRegressionDP$fit(X, y, upper.bounds, lower.bounds, add.bias = FALSE)

Arguments

X: Dataframe of data to be fit.
y: Vector or matrix of true labels for each row of X.
upper.bounds: Numeric vector of length ncol(X) giving upper bounds on the values in each column of X. The ncol(X) values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X larger than the corresponding upper bound is clipped at the bound.
lower.bounds: Numeric vector of length ncol(X) giving lower bounds on the values in each column of X. The ncol(X) values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X larger than the corresponding upper bound is clipped at the bound.
add.bias: Boolean indicating whether to add a bias term to X. Defaults to FALSE.

Method `predict()`

Predict label(s) for given X using the fitted coefficients.

Usage

LogisticRegressionDP$predict(X, add.bias = FALSE, raw.value = FALSE)

Arguments

X: Dataframe of data on which to make predictions. Must be of same form as X used to fit coefficients.
add.bias: Boolean indicating whether to add a bias term to X. Defaults to FALSE. If add.bias was set to TRUE when fitting the coefficients, add.bias should be set to TRUE for predictions.
raw.value: Boolean indicating whether to return the raw predicted value or the rounded class label. If FALSE (default), outputs the predicted labels 0 or 1. If TRUE, returns the raw score from the logistic regression.

Returns

Matrix of predicted labels or scores corresponding to each row of X.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

LogisticRegressionDP$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

References

Chaudhuri K, Monteleoni C (2009). “Privacy-preserving logistic regression.” In Koller D, Schuurmans D, Bengio Y, Bottou L (eds.), Advances in Neural Information Processing Systems, volume 21. https://proceedings.neurips.cc/paper/2008/file/8065d07da4a77621450aa84fee5656d9-Paper.pdf.

Examples

# Build train dataset X and y, and test dataset Xtest and ytest
N <- 200
K <- 2
X <- data.frame()
y <- data.frame()
for (j in (1:K)){
  t <- seq(-.25, .25, length.out = N)
  if (j==1) m <- stats::rnorm(N,-.2, .1)
  if (j==2) m <- stats::rnorm(N, .2, .1)
  Xtemp <- data.frame(x1 = 3*t , x2 = m - t)
  ytemp <- data.frame(matrix(j-1, N, 1))
  X <- rbind(X, Xtemp)
  y <- rbind(y, ytemp)
}
Xtest <- X[seq(1,(N*K),10),]
ytest <- y[seq(1,(N*K),10),,drop=FALSE]
X <- X[-seq(1,(N*K),10),]
y <- y[-seq(1,(N*K),10),,drop=FALSE]

# Construct object for logistic regression
regularizer <- 'l2' # Alternatively, function(coeff) coeff%*%coeff/2
eps <- 1
gamma <- 1
lrdp <- LogisticRegressionDP$new(regularizer, eps, gamma)

# Fit with data
# Bounds for X based on construction
upper.bounds <- c( 1, 1)
lower.bounds <- c(-1,-1)
lrdp$fit(X, y, upper.bounds, lower.bounds) # No bias term
lrdp$coeff # Gets private coefficients

# Predict new data points
predicted.y <- lrdp$predict(Xtest)
n.errors <- sum(predicted.y!=ytest)

Privacy-preserving Weighted Empirical Risk Minimization

Description

This class implements differentially private empirical risk minimization in the case where weighted observation-level losses are desired (such as weighted SVM (Yang et al. 2005)). Currently, only the output perturbation method is implemented.

Details

To use this class for weighted empirical risk minimization, first use the new method to construct an object of this class with the desired function values and hyperparameters. After constructing the object, the fit method can be applied with a provided dataset, data bounds, weights, and weight bounds to fit the model. In fitting, the model stores a vector of coefficients coeff which satisfy differential privacy. These can be released directly, or used in conjunction with the predict method to privately predict the outcomes of new datapoints.

Note that in order to guarantee differential privacy for weighted empirical risk minimization, certain constraints must be satisfied for the values used to construct the object, as well as for the data used to fit. These conditions depend on the chosen perturbation method, though currently only output perturbation is implemented. Specifically, the provided loss function must be convex and differentiable with respect to y.hat, and the absolute value of the first derivative of the loss function must be at most 1. If objective perturbation is chosen (not currently implemented), the loss function must also be doubly differentiable and the absolute value of the second derivative of the loss function must be bounded above by a constant c for all possible values of y.hat and y, where y.hat is the predicted label and y is the true label. The regularizer must be 1-strongly convex and differentiable. It also must be doubly differentiable if objective perturbation is chosen. For the data x, it is assumed that if x represents a single row of the dataset X, then the l2-norm of x is at most 1 for all x. Note that because of this, a bias term cannot be included without appropriate scaling/preprocessing of the dataset. To ensure privacy, the add.bias argument in the fit and predict methods should only be utilized in subclasses within this package where appropriate preprocessing is implemented, not in this class. Finally, if weights are provided, they should be nonnegative, of the same length as y, and be upper bounded by a global or public bound which must also be provided.

Super class

DPpack::EmpiricalRiskMinimizationDP.CMS -> WeightedERMDP.CMS

Methods

Public methods

WeightedERMDP.CMS$new()
WeightedERMDP.CMS$fit()
WeightedERMDP.CMS$predict()
WeightedERMDP.CMS$clone()

Method `new()`

Create a new WeightedERMDP.CMS object.

Usage

WeightedERMDP.CMS$new(
  mapXy,
  loss,
  regularizer,
  eps,
  gamma,
  perturbation.method = "objective",
  c = NULL,
  mapXy.gr = NULL,
  loss.gr = NULL,
  regularizer.gr = NULL
)

Arguments

mapXy: Map function of the form mapXy(X, coeff) mapping input data matrix X and coefficient vector or matrix coeff to output labels y. Should return a column matrix of predicted labels for each row of X. See mapXy.sigmoid for an example.
loss: Loss function of the form loss(y.hat, y, w), where y.hat and y are matrices and w is a matrix or vector of weights of the same length as y. Should be defined such that it returns a matrix of weighted loss values for each element of y.hat and y. If w is not given, the function should operate as if uniform weights were given. See generate.loss.huber for an example. It must be convex and differentiable, and the absolute value of the first derivative of the loss function must be at most 1. Additionally, if the objective perturbation method is chosen, it must be doubly differentiable and the absolute value of the second derivative of the loss function must be bounded above by a constant c for all possible values of y.hat and y.
regularizer: String or regularization function. If a string, must be 'l2', indicating to use l2 regularization. If a function, must have form regularizer(coeff), where coeff is a vector or matrix, and return the value of the regularizer at coeff. See regularizer.l2 for an example. Additionally, in order to ensure differential privacy, the function must be 1-strongly convex and differentiable. If the objective perturbation method is chosen, it must also be doubly differentiable.
eps: Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
gamma: Nonnegative real number representing the regularization constant.
perturbation.method: String indicating whether to use the 'output' or the 'objective' perturbation methods (Chaudhuri et al. 2011). Defaults to 'objective'. Currently, only the output perturbation method is supported.
c: Positive real number denoting the upper bound on the absolute value of the second derivative of the loss function, as required to ensure differential privacy for the objective perturbation method. This input is unnecessary if perturbation.method is 'output', but is required if perturbation.method is 'objective'. Defaults to NULL.
mapXy.gr: Optional function representing the gradient of the map function with respect to the values in coeff. If given, must be of the form mapXy.gr(X, coeff), where X is a matrix and coeff is a matrix or numeric vector. Should be defined such that the ith row of the output represents the gradient with respect to the ith coefficient. See mapXy.gr.sigmoid for an example. If not given, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.
loss.gr: Optional function representing the gradient of the loss function with respect to y.hat and of the form loss.gr(y.hat, y, w), where y.hat and y are matrices and w is a matrix or vector of weights. Should be defined such that the ith row of the output represents the gradient of the (possibly weighted) loss function at the ith set of input values. See generate.loss.gr.huber for an example. If not given, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.
regularizer.gr: Optional function representing the gradient of the regularization function with respect to coeff and of the form regularizer.gr(coeff). Should return a vector. See regularizer.gr.l2 for an example. If regularizer is given as a string, this value is ignored. If not given and regularizer is a function, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.

Returns

A new WeightedERMDP.CMS object.

Method `fit()`

Fit the differentially private weighted empirical risk minimization model. This method runs either the output perturbation or the objective perturbation algorithm (Chaudhuri et al. 2011) (only output is currently implemented), depending on the value of perturbation.method used to construct the object, to generate an objective function. A numerical optimization method is then run to find optimal coefficients for fitting the model given the training data, weights, and hyperparameters. The built-in optim function using the "BFGS" optimization method is used. If mapXy.gr, loss.gr, and regularizer.gr are all given in the construction of the object, the gradient of the objective function is utilized by optim as well. Otherwise, non-gradient based optimization methods are used. The resulting privacy-preserving coefficients are stored in coeff.

Usage

WeightedERMDP.CMS$fit(
  X,
  y,
  upper.bounds,
  lower.bounds,
  add.bias = FALSE,
  weights = NULL,
  weights.upper.bound = NULL
)

Arguments

X: Dataframe of data to be fit.
y: Vector or matrix of true labels for each row of X.
upper.bounds: Numeric vector of length ncol(X) giving upper bounds on the values in each column of X. The ncol(X) values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X larger than the corresponding upper bound is clipped at the bound.
lower.bounds: Numeric vector of length ncol(X) giving lower bounds on the values in each column of X. The ncol(X) values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X larger than the corresponding upper bound is clipped at the bound.
add.bias: Boolean indicating whether to add a bias term to X. Defaults to FALSE.
weights: Numeric vector of observation weights of the same length as y.
weights.upper.bound: Numeric value representing the global or public upper bound on the weights.

Method `predict()`

Predict label(s) for given X using the fitted coefficients.

Usage

WeightedERMDP.CMS$predict(X, add.bias = FALSE)

Arguments

X: Dataframe of data on which to make predictions. Must be of same form as X used to fit coefficients.
add.bias: Boolean indicating whether to add a bias term to X. Defaults to FALSE. If add.bias was set to TRUE when fitting the coefficients, add.bias should be set to TRUE for predictions.

Returns

Matrix of predicted labels corresponding to each row of X.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

WeightedERMDP.CMS$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

References

Yang X, Song Q, Cao A (2005). “Weighted support vector machine for data classification.” In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 2, 859-864 vol. 2. doi:10.1109/IJCNN.2005.1555965.

Examples

# Build train dataset X and y, and test dataset Xtest and ytest
N <- 200
K <- 2
X <- data.frame()
y <- data.frame()
for (j in (1:K)){
  t <- seq(-.25, .25, length.out = N)
  if (j==1) m <- stats::rnorm(N,-.2, .1)
  if (j==2) m <- stats::rnorm(N, .2, .1)
  Xtemp <- data.frame(x1 = 3*t , x2 = m - t)
  ytemp <- data.frame(matrix(j-1, N, 1))
  X <- rbind(X, Xtemp)
  y <- rbind(y, ytemp)
}
Xtest <- X[seq(1,(N*K),10),]
ytest <- y[seq(1,(N*K),10),,drop=FALSE]
X <- X[-seq(1,(N*K),10),]
y <- y[-seq(1,(N*K),10),,drop=FALSE]

# Construct object for weighted linear SVM
mapXy <- function(X, coeff) X%*%coeff
# Huber loss from DPpack
huber.h <- 0.5
loss <- generate.loss.huber(huber.h)
regularizer <- 'l2' # Alternatively, function(coeff) coeff%*%coeff/2
eps <- 1
gamma <- 1
perturbation.method <- 'output'
c <- 1/(2*huber.h) # Required value for SVM
mapXy.gr <- function(X, coeff) t(X)
loss.gr <- generate.loss.gr.huber(huber.h)
regularizer.gr <- function(coeff) coeff
wermdp <- WeightedERMDP.CMS$new(mapXy, loss, regularizer, eps,
                                gamma, perturbation.method, c,
                                mapXy.gr, loss.gr,
                                regularizer.gr)

# Fit with data
# Bounds for X based on construction
upper.bounds <- c( 1, 1)
lower.bounds <- c(-1,-1)
weights <- rep(1, nrow(y)) # Uniform weighting
weights[nrow(y)] <- 0.5 # half weight for last observation
wub <- 1 # Public upper bound for weights
wermdp$fit(X, y, upper.bounds, lower.bounds, weights=weights,
           weights.upper.bound=wub)
wermdp$coeff # Gets private coefficients

# Predict new data points
predicted.y <- wermdp$predict(Xtest)
n.errors <- sum(round(predicted.y)!=ytest)

Calibrate Analytic Gaussian Mechanism

Description

Calibrate a Gaussian perturbation for differential privacy using the analytic Gaussian mechanism (Balle and Wang 2018).

Usage

calibrateAnalyticGaussianMechanism(epsilon, delta, sensitivity, tol = 1e-12)

Arguments

epsilon

Positive real number defining the epsilon privacy parameter.

delta

Positive real number defining the delta privacy parameter.

sensitivity

Real number corresponding to the l2-global sensitivity.

tol

Error tolerance for binary search.

Value

Standard deviation of Gaussian noise needed to achieve (epsilon, delta)-DP for given global sensitivity.

References

Differentially Private Covariance

Description

This function computes the differentially private covariance of a pair of vectors at user-specified privacy levels of epsilon and delta.

Usage

covDP(
  x1,
  x2,
  eps,
  lower.bound1,
  upper.bound1,
  lower.bound2,
  upper.bound2,
  which.sensitivity = "bounded",
  mechanism = "Laplace",
  delta = 0,
  type.DP = "aDP"
)

Arguments

x1, x2

Numeric vectors whose covariance is desired.

eps

Positive real number defining the epsilon privacy budget.

lower.bound1, lower.bound2

Real numbers giving the global or public lower bounds of x1 and x2, respectively.

upper.bound1, upper.bound2

Real numbers giving the global or public upper bounds of x1 and x2, respectively.

which.sensitivity

String indicating which type of sensitivity to use. Can be one of {'bounded', 'unbounded', 'both'}. If 'bounded' (default), returns result based on bounded definition for differential privacy. If 'unbounded', returns result based on unbounded definition. If 'both', returns result based on both methods (Kifer and Machanavajjhala 2011). Note that if 'both' is chosen, each result individually satisfies (eps, delta)-differential privacy, but may not do so collectively and in composition. Care must be taken not to violate differential privacy in this case.

mechanism

String indicating which mechanism to use for differential privacy. Currently the following mechanisms are supported: {'Laplace', 'Gaussian', 'analytic'}. Default is Laplace. See LaplaceMechanism and GaussianMechanism for descriptions of the supported mechanisms.

delta

Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP.

type.DP

String indicating the type of differential privacy desired for the Gaussian mechanism (if selected). Can be either 'pDP' for probabilistic DP (Machanavajjhala et al. 2008) or 'aDP' for approximate DP (Dwork et al. 2006). Note that if 'aDP' is chosen, epsilon must be strictly less than 1.

Value

Sanitized covariance based on the bounded and/or unbounded definitions of differential privacy.

References

Machanavajjhala A, Kifer D, Abowd J, Gehrke J, Vilhuber L (2008). “Privacy: Theory meets Practice on the Map.” In 2008 IEEE 24th International Conference on Data Engineering, 277-286. doi:10.1109/ICDE.2008.4497436.

Liu F (2019). “Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanism with Univariate Bounding Constraints.” Transactions on Data Privacy, 12(3), 169-195. http://www.tdp.cat/issues16/tdp.a316a18.pdf.

Examples

D1 <- sort(stats::rnorm(500, mean=3, sd=2))
D2 <- sort(stats::rnorm(500, mean=-1,sd=0.5))
lb1 <- -3 # 3 std devs below mean
lb2 <- -2.5 # 3 std devs below mean
ub1 <- 9 # 3 std devs above mean
ub2 <- .5 # 3 std devs above mean
covDP(D1, D2, 1, lb1, ub1, lb2, ub2)
covDP(D1, D2, .5, lb1, ub1, lb2, ub2, which.sensitivity='unbounded',
  mechanism='Gaussian', delta=0.01)

Differentially Private Covariance Data Access Function

Description

This function performs the data access step in the computation of a differentially private covariance. The true values are computed using cov, while the sensitivities are calculated based on bounded and unbounded differential privacy (Kifer and Machanavajjhala 2011) according to the theoretical values (Liu 2019). For the covariance, the sensitivities based on bounded and unbounded differential privacy are identical, so only one value is returned.

Usage

covDataAccess(x1, x2, lower.bound1, upper.bound1, lower.bound2, upper.bound2)

Arguments

x1, x2

Numeric vectors whose covariance is desired.

lower.bound1, lower.bound2

Real numbers giving the lower bounds of x1 and x2, respectively.

upper.bound1, upper.bound2

Real numbers giving the upper bounds of x1 and x2, respectively.

Value

List of the true covariance and the sensitivity calculated based on theoretical values.

References

Examples

covDataAccess(c(1,4,3,2), c(-2,-3,-4,-1), 0, 5, -5, 0)

Generator for Huber Loss Function Gradient

Description

This function generates and returns the Huber loss function gradient used for privacy-preserving SVM at the specified value of h in the form required by EmpiricalRiskMinimizationDP.CMS.

Usage

generate.loss.gr.huber(h)

Arguments

h

Positive real number for the Huber loss parameter. Lower values more closely approximate hinge loss. Higher values produce smoother Huber loss functions.

Value

Huber loss function gradient with parameter h in the form required by EmpiricalRiskMinimizationDP.CMS.

Examples

  h <- 1
  huber <- generate.loss.gr.huber(h)
  y.hat <- c(-.5, 1.2, -0.9)
  y <- c(-1, 1, -1)
  huber(y.hat,y)
  huber(y.hat, y, w=c(0.1, 0.5, 1)) # Weights observation-level loss gradient

Generator for Huber Loss Function

Description

This function generates and returns the Huber loss function used for privacy-preserving SVM at the specified value of h in the form required by EmpiricalRiskMinimizationDP.CMS.

Usage

generate.loss.huber(h)

Arguments

h

Positive real number for the Huber loss parameter. Lower values more closely approximate hinge loss. Higher values produce smoother Huber loss functions.

Value

Huber loss function with parameter h in the form required by EmpiricalRiskMinimizationDP.CMS.

Examples

  h <- 0.5
  huber <- generate.loss.huber(h)
  y.hat <- c(-.5, 1.2, -0.9)
  y <- c(-1, 1, -1)
  huber(y.hat,y)
  huber(y.hat, y, w=c(0.1, 0.5, 1)) # Weights observation-level loss

Generator for Sampling Distribution Function for Gaussian Kernel

Description

This function generates and returns a sampling function corresponding to the Fourier transform of a Gaussian kernel with given parameter (Chaudhuri et al. 2011) of form needed for svmDP.

Usage

generate.sampling(kernel.param)

Arguments

kernel.param

Positive real number for the Gaussian (radial) kernel parameter.

Value

Sampling function for the Gaussian kernel of form required by svmDP.

References

Examples

  kernel.param <- 0.1
  sample <- generate.sampling(kernel.param)
  d <- 5
  sample(d)

Differentially Private Histogram

Description

This function computes a differentially private histogram from a vector at user-specified privacy levels of epsilon and delta. A histogram object is returned with sanitized values for the counts for easy plotting.

Usage

histogramDP(
  x,
  eps,
  lower.bound,
  upper.bound,
  breaks = "Sturges",
  normalize = FALSE,
  which.sensitivity = "bounded",
  mechanism = "Laplace",
  delta = 0,
  type.DP = "aDP",
  allow.negative = FALSE
)

Arguments

x

Numeric vector from which the histogram will be formed.

eps

Positive real number defining the epsilon privacy budget.

lower.bound

Scalar representing the global or public lower bound on values of x.

upper.bound

Scalar representing the global or public upper bound on values of x.

breaks

Identical to the argument with the same name from hist.

normalize

Logical value. If FALSE (default), returned histogram counts correspond to frequencies. If TRUE, returned histogram counts correspond to densities (i.e. area of histogram is one).

which.sensitivity

mechanism

delta

Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP.

type.DP

allow.negative

Logical value. If FALSE (default), any negative values in the sanitized histogram due to the added noise will be set to 0. If TRUE, the negative values (if any) will be returned.

Value

Sanitized histogram based on the bounded and/or unbounded definitions of differential privacy.

References

Examples

x <- stats::rnorm(500)
graphics::hist(x) # Non-private histogram
result <- histogramDP(x, 1, -3, 3)
plot(result) # Private histogram

graphics::hist(x, freq=FALSE) # Normalized non-private histogram
result <- histogramDP(x, .5, -3, 3, normalize=TRUE,
  which.sensitivity='unbounded', mechanism='Gaussian', delta=0.01,
  allow.negative=TRUE)
plot(result) # Normalized private histogram (note negative values allowed)

Differentially Private Histogram Data Access Function

Description

This function performs the data access step in the computation of a differentially private histogram. The true values are computed using hist, while the sensitivities are calculated based on bounded and unbounded differential privacy (Kifer and Machanavajjhala 2011) according to the theoretical values (Liu 2019).

Usage

histogramDataAccess(x, breaks, mechanism)

Arguments

x

Numeric vector from which the histogram will be formed..

breaks

Identical to the argument with the same name from hist.

mechanism

String indicating which mechanism to use for differential privacy. If the 'Laplace' mechanism is chosen, l1 sensitivities are returned. If the 'Gaussian' or 'analytic' mechanisms are chosen, l2 sensitivities are returned.

Value

List of the true histogram and the sensitivities calculated based on bounded and unbounded differential privacy.

References

Examples

histogramDataAccess(c(1,4,3,2,3), 'Sturges', 'Laplace')

Cross Entropy Loss Function

Description

This function implements cross entropy loss used for logistic regression in the form required by EmpiricalRiskMinimizationDP.CMS.

Usage

loss.cross.entropy(y.hat, y)

Arguments

y.hat

Vector or matrix of estimated labels.

y

Vector or matrix of true labels.

Value

Vector or matrix of the cross entropy loss for each element of y.hat and y.

Examples

  y.hat <- c(0.1, 0.88, 0.02)
  y <- c(0, 1, 0)
  loss.cross.entropy(y.hat,y)

Cross Entropy Loss Function Gradient

Description

This function implements cross entropy loss gradient with respect to y.hat used for logistic regression in the form required by EmpiricalRiskMinimizationDP.CMS.

Usage

loss.gr.cross.entropy(y.hat, y)

Arguments

y.hat

Vector or matrix of estimated labels.

y

Vector or matrix of true labels.

Value

Vector or matrix of the cross entropy loss gradient for each element of y.hat and y.

Examples

  y.hat <- c(0.1, 0.88, 0.02)
  y <- c(0, 1, 0)
  loss.gr.cross.entropy(y.hat,y)

Squared error Loss Function Gradient

Description

This function implements the squared error loss gradient with respect to y.hat used for linear regression in the form required by EmpiricalRiskMinimizationDP.KST.

Usage

loss.gr.squared.error(y.hat, y)

Arguments

y.hat

Vector or matrix of estimated values.

y

Vector or matrix of true values.

Value

Vector or matrix of the squared error loss gradient for each element of y.hat and y.

Examples

  y.hat <- c(0.1, 0.88, 0.02)
  y <- c(-0.1, 1, .2)
  loss.gr.squared.error(y.hat,y)

Squared Error Loss Function

Description

This function implements squared error loss used for linear regression in the form required by EmpiricalRiskMinimizationDP.KST.

Usage

loss.squared.error(y.hat, y)

Arguments

y.hat

Vector or matrix of estimated labels.

y

Vector or matrix of true labels.

Value

Vector or matrix of the squared error loss for each element of y.hat and y.

Examples

  y.hat <- c(0.1, 0.88, 0.02)
  y <- c(0, 1, 0)
  loss.squared.error(y.hat,y)

Linear Map Function Gradient

Description

This function implements the gradient of the linear map function with respect to coeff used for linear SVM and linear regression in the form required by EmpiricalRiskMinimizationDP.CMS and EmpiricalRiskMinimizationDP.KST.

Usage

mapXy.gr.linear(X, coeff)

Arguments

X

Matrix of data.

coeff

Vector or matrix of coefficients or weights.

Value

Matrix of values of the gradient of the linear map function with respect to each value of coeff.

Examples

  X <- matrix(c(1,2,3,4,5,6),nrow=2)
  coeff <- c(0.5,-1,2)
  mapXy.gr.linear(X,coeff)

Sigmoid Map Function Gradient

Description

This function implements the gradient of the sigmoid map function with respect to coeff used for logistic regression in the form required by EmpiricalRiskMinimizationDP.CMS.

Usage

mapXy.gr.sigmoid(X, coeff)

Arguments

X

Matrix of data.

coeff

Vector or matrix of coefficients or weights.

Value

Matrix of values of the gradient of the sigmoid function with respect to each value of coeff.

Examples

  X <- matrix(c(1,2,3,4,5,6),nrow=2)
  coeff <- c(0.5,-1,2)
  mapXy.gr.sigmoid(X,coeff)

Linear Map Function

Description

This function implements the linear map function from data X to labels y used for linear SVM and linear regression in the form required by EmpiricalRiskMinimizationDP.CMS and EmpiricalRiskMinimizationDP.KST.

Usage

mapXy.linear(X, coeff)

Arguments

X

Matrix of data.

coeff

Vector or matrix of coefficients or weights.

Value

Matrix of values of the linear map function corresponding to each row of X.

Examples

  X <- matrix(c(1,2,3,4,5,6),nrow=2)
  coeff <- c(0.5,-1,2)
  mapXy.linear(X,coeff)

Sigmoid Map Function

Description

This function implements the sigmoid map function from data X to labels y used for logistic regression in the form required by EmpiricalRiskMinimizationDP.CMS.

Usage

mapXy.sigmoid(X, coeff)

Arguments

X

Matrix of data.

coeff

Vector or matrix of coefficients or weights.

Value

Matrix of values of the sigmoid function corresponding to each row of X.

Examples

  X <- matrix(c(1,2,3,4,5,6),nrow=2)
  coeff <- c(0.5,-1,2)
  mapXy.sigmoid(X,coeff)

Differentially Private Mean

Description

This function computes the differentially private mean of a given dataset at user-specified privacy levels of epsilon and delta.

Usage

meanDP(
  x,
  eps,
  lower.bound,
  upper.bound,
  which.sensitivity = "bounded",
  mechanism = "Laplace",
  delta = 0,
  type.DP = "aDP"
)

Arguments

x

Dataset whose mean is desired.

eps

Positive real number defining the epsilon privacy budget.

lower.bound

Scalar representing the global or public lower bound on values of x.

upper.bound

Scalar representing the global or public upper bound on values of x.

which.sensitivity

mechanism

delta

Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP.

type.DP

Value

Sanitized mean based on the bounded and/or unbounded definitions of differential privacy.

References

Examples

D <- stats::rnorm(500, mean=3, sd=2)
lb <- -3 # 3 std devs below mean
ub <- 9 # 3 std devs above mean
meanDP(D,  1, lb, ub)
meanDP(D, .5, lb, ub, which.sensitivity='unbounded', mechanism='Gaussian',
  delta=0.01)

Differentially Private Mean Data Access Function

Description

This function performs the data access step in the computation of a differentially private mean. The true values are computed using mean, while the sensitivities are calculated based on bounded and unbounded differential privacy (Kifer and Machanavajjhala 2011) according to the theoretical values (Liu 2019). For the mean, the sensitivities based on bounded and unbounded differential privacy are identical, so only one value is returned.

Usage

meanDataAccess(x, lower.bound, upper.bound)

Arguments

x

Dataset whose mean is desired.

lower.bound

Scalar representing the global or public lower bound on values of x.

upper.bound

Scalar representing the global or public upper bound on values of x.

Value

List of the true mean and the sensitivity calculated based on theoretical values.

References

Examples

meanDataAccess(c(1,4,3,2), 0, 5)

Differentially Private Median

Description

This function computes the differentially private median of an input vector at a user-specified privacy level of epsilon.

Usage

medianDP(
  x,
  eps,
  lower.bound,
  upper.bound,
  which.sensitivity = "bounded",
  mechanism = "exponential"
)

Arguments

x

Numeric vector of which the median will be taken.

eps

Positive real number defining the epsilon privacy budget.

lower.bound

Real number giving the global or public lower bound of x.

upper.bound

Real number giving the global or public upper bound of x.

which.sensitivity

String indicating which type of sensitivity to use. Can be one of {'bounded', 'unbounded', 'both'}. If 'bounded' (default), returns result based on bounded definition for differential privacy. If 'unbounded', returns result based on unbounded definition. If 'both', returns result based on both methods (Kifer and Machanavajjhala 2011). Note that if 'both' is chosen, each result individually satisfies (eps, 0)-differential privacy, but may not do so collectively and in composition. Care must be taken not to violate differential privacy in this case.

mechanism

String indicating which mechanism to use for differential privacy. Currently the following mechanisms are supported: {'exponential'}. See ExponentialMechanism for a description of the supported mechanisms.

Value

Sanitized median based on the bounded and/or unbounded definitions of differential privacy.

References

Smith A (2011). “Privacy-Preserving Statistical Estimation with Optimal Convergence Rates.” In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, STOC '11, 813–822. ISBN 9781450306911, doi:10.1145/1993636.1993743.

Examples

D <- stats::rnorm(500)
lower.bound <- -3 # 3 standard deviations below mean
upper.bound <- 3 # 3 standard deviations above mean

eps <- 1
# Get median satisfying pure 1-differential privacy
private.median <- medianDP(D, eps, lower.bound, upper.bound)
private.median

Transform Function for Gaussian Kernel Approximation

Description

This function maps an input data row x with a given prefilter to an output value in such a way as to approximate the Gaussian kernel (Chaudhuri et al. 2011).

Usage

phi.gaussian(x, theta)

Arguments

x

Vector or matrix corresponding to one row of the dataset X.

theta

Randomly sampled prefilter vector of length n+1, where n is the length of x.

Value

Mapped value corresponding to one element of the transformed space.

References

Examples

  x <- c(1,2,3)
  theta <- c(0.1, 1.1, -0.8, 3)
  phi.gaussian(x, theta)

Differentially Private Pooled Covariance

Description

This function computes the differentially private pooled covariance from two or more two-column matrices of data at user-specified privacy levels of epsilon and delta.

Usage

pooledCovDP(
  ...,
  eps = 1,
  lower.bound1,
  upper.bound1,
  lower.bound2,
  upper.bound2,
  which.sensitivity = "bounded",
  mechanism = "Laplace",
  delta = 0,
  type.DP = "aDP",
  approx.n.max = FALSE
)

Arguments

...

Two or more matrices, each with two columns from which to compute the pooled covariance.

eps

Positive real number defining the epsilon privacy budget.

lower.bound1, lower.bound2

Real numbers giving the global or public lower bounds over the first and second columns of all input data, respectively.

upper.bound1, upper.bound2

Real numbers giving the global or public upper bounds over the first and second columns of all input data, respectively.

which.sensitivity

mechanism

delta

Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP.

type.DP

approx.n.max

Logical indicating whether to approximate n.max (defined to be the length of the largest input vector) in the computation of the global sensitivity based on the upper and lower bounds of the data (Liu 2019). Approximation is best if n.max is very large.

Value

Sanitized pooled covariance based on the bounded and/or unbounded definitions of differential privacy.

References

Examples

# Build datasets
D1 <- sort(stats::rnorm(500, mean=3, sd=2))
D2 <- sort(stats::rnorm(500, mean=-1, sd=0.5))
D3 <- sort(stats::rnorm(200, mean=3, sd=2))
D4 <- sort(stats::rnorm(200, mean=-1, sd=0.5))
M1 <- matrix(c(D1, D2), ncol=2)
M2 <- matrix(c(D3, D4), ncol=2)

lb1 <- -3 # 3 std devs below mean
lb2 <- -2.5 # 3 std devs below mean
ub1 <- 9 # 3 std devs above mean
ub2 <- .5 # 3 std devs above mean
# Pooled covariance satisfying (1,0)-differential privacy
private.pooled.cov <- pooledCovDP(M1, M2, eps = 1, lower.bound1 = lb1,
                                  lower.bound2 = lb2, upper.bound1 = ub1,
                                  upper.bound2 = ub2)
private.pooled.cov

# Pooled covariance satisfying approximate (0.9, 0.01)-differential privacy
# and approximating n.max in the sensitivity calculation
private.pooled.cov <- pooledCovDP(M1, M2, eps = 0.9, lower.bound1 = lb1,
                                  lower.bound2 = lb2, upper.bound1 = ub1,
                                  upper.bound2 = ub2, mechanism = 'Gaussian',
                                  delta = 0.01, approx.n.max = TRUE)
private.pooled.cov

Differentially Private Pooled Covariance Data Access Function

Description

This function performs the data access step in the computation of a differentially private pooled covariance. The true values are computed using the theoretical formula and cov, while the sensitivities are calculated based on bounded and unbounded differential privacy (Kifer and Machanavajjhala 2011) according to the theoretical values (Liu 2019).

Usage

pooledCovDataAccess(
  samples,
  lower.bound1,
  upper.bound1,
  lower.bound2,
  upper.bound2,
  approx.n.max
)

Arguments

samples

List of two-column matrices from which to compute the pooled covariance.

lower.bound1, lower.bound2

Real numbers giving the lower bounds of the first and second columns of samples, respectively.

upper.bound1, upper.bound2

Real numbers giving the upper bounds of the first and second columns of samples, respectively.

approx.n.max

Logical indicating whether to approximate n.max, which is defined to be the length of the largest input vector. Approximation is best if n.max is very large.

Value

List of the true pooled covariance and the sensitivities calculated based on bounded and unbounded differential privacy.

References

Examples

x1 <- matrix(c(1,4,-2,8,-6,-3),ncol=2)
x2 <- matrix(c(1,2,-5,7),ncol=2)
pooledCovDataAccess(list(x1,x2),-10,10,-10,10,FALSE)

Differentially Private Pooled Variance

Description

This function computes the differentially private pooled variance from two or more vectors of data at user-specified privacy levels of epsilon and delta.

Usage

pooledVarDP(
  ...,
  eps = 1,
  lower.bound,
  upper.bound,
  which.sensitivity = "bounded",
  mechanism = "Laplace",
  delta = 0,
  type.DP = "aDP",
  approx.n.max = FALSE
)

Arguments

...

Two or more vectors from which to compute the pooled variance.

eps

Positive real number defining the epsilon privacy budget.

lower.bound

Real number giving the global or public lower bound of the input data.

upper.bound

Real number giving the global or public upper bound of the input data.

which.sensitivity

mechanism

delta

Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP.

type.DP

approx.n.max

Value

Sanitized pooled variance based on the bounded and/or unbounded definitions of differential privacy.

References

Examples

# Build datasets
D1 <- stats::rnorm(500, mean=3, sd=2)
D2 <- stats::rnorm(200, mean=3, sd=2)
D3 <- stats::rnorm(100, mean=3, sd=2)
lower.bound <- -3 # 3 standard deviations below mean
upper.bound <- 9 # 3 standard deviations above mean

# Get private pooled variance without approximating n.max
private.pooled.var <- pooledVarDP(D1, D2, D3, eps=1, lower.bound=lower.bound,
                                  upper.bound = upper.bound)
private.pooled.var

# If n.max is sensitive, we can also use
private.pooled.var <- pooledVarDP(D1, D2, D3, eps=1, lower.bound=lower.bound,
                                  upper.bound = upper.bound,
                                  approx.n.max = TRUE)
private.pooled.var

Differentially Private Pooled Variance Data Access Function

Description

This function performs the data access step in the computation of a differentially private pooled variance. The true values are computed using the theoretical formula and var, while the sensitivities are calculated based on bounded and unbounded differential privacy (Kifer and Machanavajjhala 2011) according to the theoretical values (Liu 2019).

Usage

pooledVarDataAccess(samples, lower.bound, upper.bound, approx.n.max)

Arguments

samples

List of vectors from which to compute the pooled variance.

lower.bound

Real number giving the lower bound of the input data.

upper.bound

Real number giving the upper bound of the input data.

approx.n.max

Logical indicating whether to approximate n.max, which is defined to be the length of the largest input vector. Approximation is best if n.max is very large.

Value

List of the true pooled variance and the sensitivities calculated based on bounded and unbounded differential privacy.

References

Examples

pooledVarDataAccess(list(c(1,4,-2,8,-6),c(1,2),c(-5,-7)),-10,10,FALSE)

Differentially Private Quantile

Description

This function computes the differentially private quantile of an input vector at a user-specified privacy level of epsilon.

Usage

quantileDP(
  x,
  quant,
  eps,
  lower.bound,
  upper.bound,
  which.sensitivity = "bounded",
  mechanism = "exponential"
)

Arguments

x

Numeric vector of which the quantile will be taken.

quant

Real number between 0 and 1 indicating which quantile to return.

eps

Positive real number defining the epsilon privacy budget.

lower.bound

Real number giving the global or public lower bound of x.

upper.bound

Real number giving the global or public upper bound of x.

which.sensitivity

String indicating which type of sensitivity to use. Can be one of {'bounded', 'unbounded', 'both'}. If 'bounded' (default), returns result based on bounded definition for differential privacy. If 'unbounded', returns result based on unbounded definition. If 'both', returns result based on both methods (Kifer and Machanavajjhala 2011). Note that if 'both' is chosen, each result individually satisfies (eps, 0)-differential privacy, but may not do so collectively and in composition. Care must be taken not to violate differential privacy in this case.

mechanism

Value

Sanitized quantile based on the bounded and/or unbounded definitions of differential privacy.

References

Examples

D <- stats::rnorm(500)
lower.bound <- -3 # 3 standard deviations below mean
upper.bound <- 3 # 3 standard deviations above mean

quant <- 0.25
eps <- 1
# Get 25th quantile satisfying pure 1-differential privacy
private.quantile <- quantileDP(D, quant, eps, lower.bound, upper.bound)
private.quantile

Differentially Private Quantile Data Access Function

Description

This function performs the data access step in the computation of a differentially private quantile. The utility vector is computed as in Smith (2011), while the sensitivities are calculated based on bounded and unbounded differential privacy (Kifer and Machanavajjhala 2011) according to the theoretical values (Gillenwater et al. 2021).

Usage

quantileDataAccess(x, quant, lower.bound, upper.bound)

Arguments

x

Numeric vector.

quant

Real number between 0 and 1 indicating which quantile to return.

lower.bound

Real number giving the lower bound of the input data.

upper.bound

Real number giving the upper bound of the input data.

Value

List of a vector corresponding to the utility function, the sorted and clipped vector of inputs and the sensitivity calculated based on theoretical values.

References

Gillenwater J, Joseph M, Kulesza A (2021). “Differentially Private Quantiles.” In Meila M, Zhang T (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 3713–3722. http://proceedings.mlr.press/v139/gillenwater21a/gillenwater21a.pdf.

Examples

quantileDataAccess(c(1,1,-2,8,-6),.25,-10,10)

l2 Regularizer Gradient

Description

This function implements the l2 regularizer gradient in the form required by EmpiricalRiskMinimizationDP.CMS and EmpiricalRiskMinimizationDP.KST.

Usage

regularizer.gr.l2(coeff)

Arguments

coeff

Vector or matrix of coefficients or weights.

Value

Regularizer gradient value at the given coefficient vector.

Examples

  coeff <- c(0.5,-1,2)
  regularizer.gr.l2(coeff)

l2 Regularizer

Description

This function implements the l2 regularizer in the form required by EmpiricalRiskMinimizationDP.CMS and EmpiricalRiskMinimizationDP.KST.

Usage

regularizer.l2(coeff)

Arguments

coeff

Vector or matrix of coefficients or weights.

Value

Regularizer value at the given coefficient vector.

Examples

  coeff <- c(0.5,-1,2)
  regularizer.l2(coeff)

Differentially Private Standard Deviation

Description

This function computes the differentially private standard deviation of a given dataset at user-specified privacy levels of epsilon and delta.

Usage

sdDP(
  x,
  eps,
  lower.bound,
  upper.bound,
  which.sensitivity = "bounded",
  mechanism = "Laplace",
  delta = 0,
  type.DP = "aDP"
)

Arguments

x

Numeric vector whose variance is desired.

eps

Positive real number defining the epsilon privacy budget.

lower.bound

Scalar representing the global or public lower bound on values of x.

upper.bound

Scalar representing the global or public upper bound on values of x.

which.sensitivity

mechanism

delta

Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP.

type.DP

Value

Sanitized standard deviation based on the bounded and/or unbounded definitions of differential privacy.

References

Examples

D <- stats::rnorm(500, mean=3, sd=2)
lb <- -3 # 3 std devs below mean
ub <- 9 # 3 std devs above mean
sdDP(D, 1, lb, ub)
sdDP(D,.5, lb, ub, which.sensitivity='unbounded', mechanism='Gaussian',
  delta=0.01)

Privacy-preserving Support Vector Machine

Description

This class implements differentially private support vector machine (SVM) (Chaudhuri et al. 2011). It can be either weighted (Yang et al. 2005) or unweighted. Either the output or the objective perturbation method can be used for unweighted SVM, though only the output perturbation method is currently supported for weighted SVM.

Details

To use this class for SVM, first use the new method to construct an object of this class with the desired function values and hyperparameters, including a choice of the desired kernel. After constructing the object, the fit method can be applied to fit the model with a provided dataset, data bounds, and optional observation weights and weight upper bound. In fitting, the model stores a vector of coefficients coeff which satisfy differential privacy. Additionally, if a nonlinear kernel is chosen, the models stores a mapping function from the input data X to a higher dimensional embedding V in the form of a method XtoV as required (Chaudhuri et al. 2011). These can be released directly, or used in conjunction with the predict method to privately predict the label of new datapoints. Note that the mapping function XtoV is based on an approximation method via Fourier transforms (Rahimi and Recht 2007; Rahimi and Recht 2008).

Note that in order to guarantee differential privacy for the SVM model, certain constraints must be satisfied for the values used to construct the object, as well as for the data used to fit. These conditions depend on the chosen perturbation method. First, the loss function is assumed to be differentiable (and doubly differentiable if the objective perturbation method is used). The hinge loss, which is typically used for SVM, is not differentiable at 1. Thus, to satisfy this constraint, this class utilizes the Huber loss, a smooth approximation to the hinge loss (Chapelle 2007). The level of approximation to the hinge loss is determined by a user-specified constant, h, which defaults to 0.5, a typical value. Additionally, the regularizer must be 1-strongly convex and differentiable. It also must be doubly differentiable if objective perturbation is chosen. If weighted SVM is desired, the provided weights must be nonnegative and bounded above by a global or public value, which must also be provided.

Finally, it is assumed that if x represents a single row of the dataset X, then the l2-norm of x is at most 1 for all x. In order to ensure this constraint is satisfied, the dataset is preprocessed and scaled, and the resulting coefficients are postprocessed and un-scaled so that the stored coefficients correspond to the original data. Due to this constraint on x, it is best to avoid using a bias term in the model whenever possible. If a bias term must be used, the issue can be partially circumvented by adding a constant column to X before fitting the model, which will be scaled along with the rest of X. The fit method contains functionality to add a column of constant 1s to X before scaling, if desired.

Super classes

DPpack::EmpiricalRiskMinimizationDP.CMS -> DPpack::WeightedERMDP.CMS -> svmDP

Methods

Public methods

svmDP$new()
svmDP$fit()
svmDP$XtoV()
svmDP$predict()
svmDP$clone()

Method `new()`

Create a new svmDP object.

Usage

svmDP$new(
  regularizer,
  eps,
  gamma,
  perturbation.method = "objective",
  kernel = "linear",
  D = NULL,
  kernel.param = NULL,
  regularizer.gr = NULL,
  huber.h = 0.5
)

Arguments

regularizer: String or regularization function. If a string, must be 'l2', indicating to use l2 regularization. If a function, must have form regularizer(coeff), where coeff is a vector or matrix, and return the value of the regularizer at coeff. See regularizer.l2 for an example. Additionally, in order to ensure differential privacy, the function must be 1-strongly convex and doubly differentiable.
eps: Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
gamma: Nonnegative real number representing the regularization constant.
perturbation.method: String indicating whether to use the 'output' or the 'objective' perturbation methods (Chaudhuri et al. 2011). Defaults to 'objective'.
kernel: String indicating which kernel to use for SVM. Must be one of {'linear', 'Gaussian'}. If 'linear' (default), linear SVM is used. If 'Gaussian,' uses the sampling function corresponding to the Gaussian (radial) kernel approximation.
D: Nonnegative integer indicating the dimensionality of the transform space approximating the kernel if a nonlinear kernel is used. Higher values of D provide better kernel approximations at a cost of computational efficiency. This value must be specified if a nonlinear kernel is used.
kernel.param: Positive real number corresponding to the Gaussian kernel parameter. Defaults to 1/p, where p is the number of predictors.
regularizer.gr: Optional function representing the gradient of the regularization function with respect to coeff and of the form regularizer.gr(coeff). Should return a vector. See regularizer.gr.l2 for an example. If regularizer is given as a string, this value is ignored. If not given and regularizer is a function, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.
huber.h: Positive real number indicating the degree to which the Huber loss approximates the hinge loss. Defaults to 0.5 (Chapelle 2007).

Returns

A new svmDP object.

Method `fit()`

Fit the differentially private SVM model. This method runs either the output perturbation or the objective perturbation algorithm (Chaudhuri et al. 2011), depending on the value of perturbation.method used to construct the object, to generate an objective function. A numerical optimization method is then run to find optimal coefficients for fitting the model given the training data, weights, and hyperparameters. The built-in optim function using the "BFGS" optimization method is used. If regularizer is given as 'l2' or if regularizer.gr is given in the construction of the object, the gradient of the objective function is utilized by optim as well. Otherwise, non-gradient based optimization methods are used. The resulting privacy-preserving coefficients are stored in coeff.

Usage

svmDP$fit(
  X,
  y,
  upper.bounds,
  lower.bounds,
  add.bias = FALSE,
  weights = NULL,
  weights.upper.bound = NULL
)

Arguments

X: Dataframe of data to be fit.
y: Vector or matrix of true labels for each row of X.
upper.bounds: Numeric vector of length ncol(X) giving upper bounds on the values in each column of X. The ncol(X) values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X larger than the corresponding upper bound is clipped at the bound.
lower.bounds: Numeric vector of length ncol(X) giving lower bounds on the values in each column of X. The ncol(X) values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X larger than the corresponding upper bound is clipped at the bound.
add.bias: Boolean indicating whether to add a bias term to X. Defaults to FALSE.
weights: Numeric vector of observation weights of the same length as y. If not given, no observation weighting is performed.
weights.upper.bound: Numeric value representing the global or public upper bound on the weights. Required if weights are given.

Method `XtoV()`

Convert input data X into transformed data V. Uses sampled pre-filter values and a mapping function based on the chosen kernel to produce D-dimensional data V on which to train the model or predict future values. This method is only used if the kernel is nonlinear. See Chaudhuri et al. (2011) for more details.

Usage

svmDP$XtoV(X)

Arguments

X: Matrix corresponding to the original dataset.

Returns

Matrix V of size n by D representing the transformed dataset, where n is the number of rows of X, and D is the provided transformed space dimension.

Method `predict()`

Predict label(s) for given X using the fitted coefficients.

Usage

svmDP$predict(X, add.bias = FALSE, raw.value = FALSE)

Arguments

X: Dataframe of data on which to make predictions. Must be of same form as X used to fit coefficients.
add.bias: Boolean indicating whether to add a bias term to X. Defaults to FALSE. If add.bias was set to TRUE when fitting the coefficients, add.bias should be set to TRUE for predictions.
raw.value: Boolean indicating whether to return the raw predicted value or the rounded class label. If FALSE (default), outputs the predicted labels 0 or 1. If TRUE, returns the raw score from the SVM model.

Returns

Matrix of predicted labels or scores corresponding to each row of X.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

svmDP$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

References

Chapelle O (2007). “Training a Support Vector Machine in the Primal.” Neural Computation, 19(5), 1155-1178. doi:10.1162/neco.2007.19.5.1155.

Rahimi A, Recht B (2007). “Random Features for Large-Scale Kernel Machines.” In Platt J, Koller D, Singer Y, Roweis S (eds.), Advances in Neural Information Processing Systems, volume 20. https://proceedings.neurips.cc/paper/2007/file/013a006f03dbc5392effeb8f18fda755-Paper.pdf.

Rahimi A, Recht B (2008). “Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning.” In Koller D, Schuurmans D, Bengio Y, Bottou L (eds.), Advances in Neural Information Processing Systems, volume 21. https://proceedings.neurips.cc/paper/2008/file/0efe32849d230d7f53049ddc4a4b0c60-Paper.pdf.

Examples

# Build train dataset X and y, and test dataset Xtest and ytest
N <- 400
X <- data.frame()
y <- data.frame()
for (i in (1:N)){
  Xtemp <- data.frame(x1 = stats::rnorm(1,sd=.28) , x2 = stats::rnorm(1,sd=.28))
  if (sum(Xtemp^2)<.15) ytemp <- data.frame(y=0)
  else ytemp <- data.frame(y=1)
  X <- rbind(X, Xtemp)
  y <- rbind(y, ytemp)
}
Xtest <- X[seq(1,N,10),]
ytest <- y[seq(1,N,10),,drop=FALSE]
X <- X[-seq(1,N,10),]
y <- y[-seq(1,N,10),,drop=FALSE]

# Construct object for SVM
regularizer <- 'l2' # Alternatively, function(coeff) coeff%*%coeff/2
eps <- 1
gamma <- 1
perturbation.method <- 'output'
kernel <- 'Gaussian'
D <- 20
svmdp <- svmDP$new(regularizer, eps, gamma, perturbation.method,
                   kernel=kernel, D=D)

# Fit with data
# Bounds for X based on construction
upper.bounds <- c( 1, 1)
lower.bounds <- c(-1,-1)
weights <- rep(1, nrow(y)) # Uniform weighting
weights[nrow(y)] <- 0.5 # Half weight for last observation
wub <- 1 # Public upper bound for weights
svmdp$fit(X, y, upper.bounds, lower.bounds, weights=weights,
          weights.upper.bound=wub) # No bias term

# Predict new data points
predicted.y <- svmdp$predict(Xtest)
n.errors <- sum(predicted.y!=ytest)

Differentially Private Contingency Table

Description

This function computes a differentially private contingency table from given vectors of data at user-specified privacy levels of epsilon and delta.

Usage

tableDP(
  ...,
  eps = 1,
  which.sensitivity = "bounded",
  mechanism = "Laplace",
  delta = 0,
  type.DP = "aDP",
  allow.negative = FALSE
)

Arguments

...

Vectors of data from which to create the contingency table.

eps

Positive real number defining the epsilon privacy budget.

which.sensitivity

mechanism

delta

Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP.

type.DP

allow.negative

Logical value. If FALSE (default), any negative values in the sanitized table due to the added noise will be set to 0. If TRUE, the negative values (if any) will be returned.

Value

Sanitized contingency table based on the bounded and/or unbounded definitions of differential privacy.

References

Examples

x <- MASS::Cars93$Type
y <- MASS::Cars93$Origin
z <- MASS::Cars93$AirBags
tableDP(x,y,eps=1,which.sensitivity='bounded',mechanism='Laplace',
  type.DP='pDP')
tableDP(x,y,z,eps=.5,which.sensitivity='unbounded',mechanism='Gaussian',
  delta=0.01)

Differentially Private Contingency Table Data Access Function

Description

This function performs the data access step in the computation of a differentially private contingency table. The true values are computed using table,while the sensitivities are calculated based on bounded and unbounded differential privacy (Kifer and Machanavajjhala 2011) according to the theoretical values (Liu 2019).

Usage

tableDataAccess(..., mechanism = "Laplace")

Arguments

...

Vectors of data from which to create the contingency table.

mechanism

Value

List of the true contingency table and the sensitivities calculated based on bounded and unbounded differential privacy.

References

Examples

x <- MASS::Cars93$Type;
y <- MASS::Cars93$Origin;
tableDataAccess(x, y, mechanism='Laplace')

Privacy-preserving Hyperparameter Tuning for Binary Classification Models

Description

This function implements the privacy-preserving hyperparameter tuning function for binary classification (Chaudhuri et al. 2011) using the exponential mechanism. It accepts a list of DP models with various chosen hyperparameters, a dataset X with corresponding labels y, upper and lower bounds on the columns of X, and a boolean indicating whether to add bias in the construction of each of the models. The data are split into m+1 equal groups, where m is the number of models being compared. One group is set aside as the validation group, and each of the other m groups are used to train each of the given m models. The number of errors on the validation set is counted for each model and used as the utility values in the exponential mechanism (ExponentialMechanism) to select a tuned model in a privacy-preserving way.

Usage

tune_classification_model(
  DPmodels,
  X,
  y,
  upper.bounds,
  lower.bounds,
  add.bias = FALSE,
  weights = NULL,
  weights.upper.bound = NULL
)

Arguments

DPmodels

Vector of binary classification model objects, each initialized with a different combination of hyperparameter values from the search space for tuning. Each model should be initialized with the same epsilon privacy parameter value eps. The tuned model satisfies eps-level differential privacy.

X

Dataframe of data to be used in tuning the model. Note it is assumed the data rows and corresponding labels are randomly shuffled.

y

Vector or matrix of true labels for each row of X.

upper.bounds

Numeric vector giving upper bounds on the values in each column of X. Should be of length ncol(X). The values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X larger than the corresponding upper bound is clipped at the bound.

lower.bounds

Numeric vector giving lower bounds on the values in each column of X. Should be of length ncol(X). The values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X smaller than the corresponding lower bound is clipped at the bound.

add.bias

Boolean indicating whether to add a bias term to X. Defaults to FALSE.

weights

Numeric vector of observation weights of the same length as y.

weights.upper.bound

Numeric value representing the global or public upper bound on the weights.

Value

Single model object selected from the input list DPmodels with tuned parameters.

References

Examples

# Build train dataset X and y, and test dataset Xtest and ytest
N <- 200
K <- 2
X <- data.frame()
y <- data.frame()
for (j in (1:K)){
  t <- seq(-.25,.25,length.out = N)
  if (j==1) m <- stats::rnorm(N,-.2,.1)
  if (j==2) m <- stats::rnorm(N, .2,.1)
  Xtemp <- data.frame(x1 = 3*t , x2 = m - t)
  ytemp <- data.frame(matrix(j-1, N, 1))
  X <- rbind(X, Xtemp)
  y <- rbind(y, ytemp)
}
Xtest <- X[seq(1,(N*K),10),]
ytest <- y[seq(1,(N*K),10),,drop=FALSE]
X <- X[-seq(1,(N*K),10),]
y <- y[-seq(1,(N*K),10),,drop=FALSE]
y <- as.matrix(y)
weights <- rep(1, nrow(y)) # Uniform weighting
weights[nrow(y)] <- 0.5 # half weight for last observation
wub <- 1 # Public upper bound for weights

# Grid of possible gamma values for tuning logistic regression model
grid.search <- c(100, 1, .0001)

# Construct objects for SVM parameter tuning
eps <- 1 # Privacy budget should be the same for all models
svmdp1 <- svmDP$new("l2", eps, grid.search[1], perturbation.method='output')
svmdp2 <- svmDP$new("l2", eps, grid.search[2], perturbation.method='output')
svmdp3 <- svmDP$new("l2", eps, grid.search[3], perturbation.method='output')
DPmodels <- c(svmdp1, svmdp2, svmdp3)

# Tune using data and bounds for X based on its construction
upper.bounds <- c( 1, 1)
lower.bounds <- c(-1,-1)
tuned.model <- tune_classification_model(DPmodels, X, y, upper.bounds,
                                         lower.bounds, weights=weights,
                                         weights.upper.bound=wub)
tuned.model$gamma # Gives resulting selected hyperparameter

# tuned.model result can be used the same as a trained LogisticRegressionDP model
# Predict new data points
predicted.y <- tuned.model$predict(Xtest)
n.errors <- sum(predicted.y!=ytest)

Privacy-preserving Hyperparameter Tuning for Linear Regression Models

Description

This function implements the privacy-preserving hyperparameter tuning function for linear regression (Kifer et al. 2012) using the exponential mechanism. It accepts a list of DP models with various chosen hyperparameters, a dataset X with corresponding values y, upper and lower bounds on the columns of X and the values of y, and a boolean indicating whether to add bias in the construction of each of the models. The data are split into m+1 equal groups, where m is the number of models being compared. One group is set aside as the validation group, and each of the other m groups are used to train each of the given m models. The negative of the sum of the squared error for each model on the validation set is used as the utility values in the exponential mechanism (ExponentialMechanism) to select a tuned model in a privacy-preserving way.

Usage

tune_linear_regression_model(
  DPmodels,
  X,
  y,
  upper.bounds,
  lower.bounds,
  add.bias = FALSE
)

Arguments

DPmodels

Vector of linear regression model objects, each initialized with a different combination of hyperparameter values from the search space for tuning. Each model should be initialized with the same epsilon privacy parameter value eps. The tuned model satisfies eps-level differential privacy.

X

Dataframe of data to be used in tuning the model. Note it is assumed the data rows and corresponding labels are randomly shuffled.

y

Vector or matrix of true values for each row of X.

upper.bounds

Numeric vector giving upper bounds on the values in each column of X and the values in y. Should be length ncol(X)+1. The first ncol(X) values are assumed to be in the same order as the corresponding columns of X, while the last value in the vector is assumed to be the upper bound on y. Any value in the columns of X and y larger than the corresponding upper bound is clipped at the bound.

lower.bounds

Numeric vector giving lower bounds on the values in each column of X and the values in y. Should be length ncol(X)+1. The first ncol(X) values are assumed to be in the same order as the corresponding columns of X, while the last value in the vector is assumed to be the lower bound on y. Any value in the columns of X and y smaller than the corresponding lower bound is clipped at the bound.

add.bias

Boolean indicating whether to add a bias term to X. Defaults to FALSE.

Value

Single model object selected from the input list DPmodels with tuned parameters.

References

Examples

# Build example dataset
n <- 500
X <- data.frame(X=seq(-1,1,length.out = n))
true.theta <- c(-.3,.5) # First element is bias term
p <- length(true.theta)
y <- true.theta[1] + as.matrix(X)%*%true.theta[2:p] + stats::rnorm(n=n,sd=.1)

# Grid of possible gamma values for tuning linear regression model
grid.search <- c(100, 1, .0001)

# Construct objects for logistic regression parameter tuning
# Privacy budget should be the same for all models
eps <- 1
delta <- 0.01
linrdp1 <- LinearRegressionDP$new("l2", eps, delta, grid.search[1])
linrdp2 <- LinearRegressionDP$new("l2", eps, delta, grid.search[2])
linrdp3 <- LinearRegressionDP$new("l2", eps, delta, grid.search[3])
DPmodels <- c(linrdp1, linrdp2, linrdp3)

# Tune using data and bounds for X and y based on their construction
upper.bounds <- c( 1, 2) # Bounds for X and y
lower.bounds <- c(-1,-2) # Bounds for X and y
tuned.model <- tune_linear_regression_model(DPmodels, X, y, upper.bounds,
                                            lower.bounds, add.bias=TRUE)
tuned.model$gamma # Gives resulting selected hyperparameter

# tuned.model result can be used the same as a trained LogisticRegressionDP model
tuned.model$coeff # Gives coefficients for tuned model

# Build a test dataset for prediction
Xtest <- data.frame(X=c(-.5, -.25, .1, .4))
predicted.y <- tuned.model$predict(Xtest, add.bias=TRUE)

Differentially Private Variance

Description

This function computes the differentially private variance of a given dataset at user-specified privacy levels of epsilon and delta.

Usage

varDP(
  x,
  eps,
  lower.bound,
  upper.bound,
  which.sensitivity = "bounded",
  mechanism = "Laplace",
  delta = 0,
  type.DP = "aDP"
)

Arguments

x

Numeric vector whose variance is desired.

eps

Positive real number defining the epsilon privacy budget.

lower.bound

Scalar representing the global or public lower bound on values of x.

upper.bound

Scalar representing the global or public upper bound on values of x.

which.sensitivity

mechanism

delta

Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP.

type.DP

Value

Sanitized variance based on the bounded and/or unbounded definitions of differential privacy.

References

Examples

D <- stats::rnorm(500, mean=3, sd=2)
lb <- -3 # 3 std devs below mean
ub <- 9 # 3 std devs above mean
varDP(D, 1, lb, ub)
varDP(D,.5, lb, ub, which.sensitivity='unbounded', mechanism='Gaussian',
  delta=0.01)

Differentially Private Variance Data Access Function

Description

This function performs the data access step in the computation of a differentially private variance. The true values are computed using var, while the sensitivities are calculated based on bounded and unbounded differential privacy (Kifer and Machanavajjhala 2011) according to the theoretical values (Liu 2019). For the variance, the sensitivities based on bounded and unbounded differential privacy are identical, so only one value is returned.

Usage

varDataAccess(x, lower.bound, upper.bound)

Arguments

x

Dataset whose variance is desired.

lower.bound

Scalar representing the global or public lower bound on values of x.

upper.bound

Scalar representing the global or public upper bound on values of x.

Value

List of the true variance and the sensitivity calculated based on theoretical values.

References

Examples

varDataAccess(c(1,4,3,2), 0, 5)

Privacy-preserving Empirical Risk Minimization for Binary Classification

Description

Details

Public fields

Methods

Public methods

Method new()

Usage

Arguments

Returns

Method fit()

Usage

Arguments

Method predict()

Usage

Arguments

Returns

Method clone()

Usage

Arguments

References

Examples

Privacy-preserving Empirical Risk Minimization for Regression

Description

Details

Public fields

Methods

Public methods

Method new()

Usage

Arguments

Returns

Method fit()

Usage

Arguments

Method predict()

Usage

Arguments

Returns

Method clone()

Usage

Arguments

References

Examples

Exponential Mechanism

Description

Usage

Arguments

Value

References

Examples

Gaussian Mechanism

Description

Usage

Arguments

Value

References

Examples

Laplace Mechanism

Description

Usage

Arguments

Value

References

Examples

Privacy-preserving Linear Regression

Description

Details

Super class

Methods

Public methods

Method new()

Usage

Arguments

Returns

Method fit()

Usage

Arguments

Method clone()

Usage

Method `new()`

Method `fit()`

Method `predict()`

Method `clone()`

Method `new()`

Method `fit()`

Method `predict()`

Method `clone()`

Method `new()`

Method `fit()`

Method `clone()`

Method `new()`

Method `fit()`

Method `predict()`

Method `clone()`

Method `new()`

Method `fit()`

Method `predict()`

Method `clone()`