Type: | Package |
Title: | Differentially Private Statistical Analysis and Machine Learning |
Version: | 0.2.2 |
Maintainer: | Spencer Giddens <giddens2spencer@gmail.com> |
Description: | An implementation of common statistical analysis and models with differential privacy (Dwork et al., 2006a) <doi:10.1007/11681878_14> guarantees. The package contains, for example, functions providing differentially private computations of mean, variance, median, histograms, and contingency tables. It also implements some statistical models and machine learning algorithms such as linear regression (Kifer et al., 2012) https://proceedings.mlr.press/v23/kifer12.html and SVM (Chaudhuri et al., 2011) https://jmlr.org/papers/v12/chaudhuri11a.html. In addition, it implements some popular randomization mechanisms, including the Laplace mechanism (Dwork et al., 2006a) <doi:10.1007/11681878_14>, Gaussian mechanism (Dwork et al., 2006b) <doi:10.1007/11761679_29>, analytic Gaussian mechanism (Balle & Wang, 2018) https://proceedings.mlr.press/v80/balle18a.html, and exponential mechanism (McSherry & Talwar, 2007) <doi:10.1109/FOCS.2007.66>. |
License: | GPL-3 | file LICENSE |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
Imports: | rmutil (≥ 1.1.5), Rdpack (≥ 2.1.2), R6 (≥ 2.5.1), dplyr (≥ 1.0.1), MASS (≥ 7.3-51.6), nloptr (≥ 1.2.2.2), e1071 (≥ 1.7-9), stats (≥ 4.0.2), graphics (≥ 4.0.2), ggplot2 (≥ 3.3.2) |
RdMacros: | Rdpack |
Suggests: | testthat (≥ 3.0.0) |
Config/testthat/edition: | 3 |
NeedsCompilation: | no |
Packaged: | 2024-10-20 02:44:42 UTC; spencergiddens |
Author: | Spencer Giddens [aut, cre], Fang Liu [ctb] |
Repository: | CRAN |
Date/Publication: | 2024-10-20 04:10:05 UTC |
Privacy-preserving Empirical Risk Minimization for Binary Classification
Description
This class implements differentially private empirical risk
minimization (Chaudhuri et al. 2011). Either the output or the
objective perturbation method can be used. It is intended to be a framework
for building more specific models via inheritance. See
LogisticRegressionDP
for an example of this type of
structure.
Details
To use this class for empirical risk minimization, first use the
new
method to construct an object of this class with the desired
function values and hyperparameters. After constructing the object, the
fit
method can be applied with a provided dataset and data bounds to
fit the model. In fitting, the model stores a vector of coefficients
coeff
which satisfy differential privacy. These can be released
directly, or used in conjunction with the predict
method to
privately predict the outcomes of new datapoints.
Note that in order to guarantee differential privacy for empirical risk
minimization, certain constraints must be satisfied for the values used to
construct the object, as well as for the data used to fit. These conditions
depend on the chosen perturbation method. Specifically, the provided loss
function must be convex and differentiable with respect to y.hat
,
and the absolute value of the first derivative of the loss function must be
at most 1. If objective perturbation is chosen, the loss function must also
be doubly differentiable and the absolute value of the second derivative of
the loss function must be bounded above by a constant c for all possible
values of y.hat
and y
, where y.hat
is the predicted
label and y
is the true label. The regularizer must be 1-strongly
convex and differentiable. It also must be doubly differentiable if
objective perturbation is chosen. Finally, it is assumed that if x
represents a single row of the dataset X, then the l2-norm of x is at most
1 for all x. Note that because of this, a bias term cannot be included
without appropriate scaling/preprocessing of the dataset. To ensure
privacy, the add.bias argument in the fit
and predict
methods
should only be utilized in subclasses within this package where appropriate
preprocessing is implemented, not in this class.
Public fields
mapXy
Map function of the form
mapXy(X, coeff)
mapping input data matrixX
and coefficient vector or matrixcoeff
to output labelsy
.mapXy.gr
Function representing the gradient of the map function with respect to the values in
coeff
and of the formmapXy.gr(X, coeff)
, whereX
is a matrix andcoeff
is a matrix or numeric vector.loss
Loss function of the form
loss(y.hat, y)
, wherey.hat
andy
are matrices.loss.gr
Function representing the gradient of the loss function with respect to
y.hat
and of the formloss.gr(y.hat, y)
, wherey.hat
andy
are matrices.regularizer
Regularization function of the form
regularizer(coeff)
, wherecoeff
is a vector or matrix.regularizer.gr
Function representing the gradient of the regularization function with respect to
coeff
and of the formregularizer.gr(coeff)
.gamma
Nonnegative real number representing the regularization constant.
eps
Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
perturbation.method
String indicating whether to use the 'output' or the 'objective' perturbation methods (Chaudhuri et al. 2011).
c
Positive real number denoting the upper bound on the absolute value of the second derivative of the loss function, as required to ensure differential privacy for the objective perturbation method.
coeff
Numeric vector of coefficients for the model.
kernel
Value only used in child class
svmDP
. String indicating which kernel to use for SVM. Must be one of {'linear', 'Gaussian'}. If 'linear' (default), linear SVM is used. If 'Gaussian', uses the sampling function corresponding to the Gaussian (radial) kernel approximation.D
Value only used in child class
svmDP
. Nonnegative integer indicating the dimensionality of the transform space approximating the kernel. Higher values ofD
provide better kernel approximations at a cost of computational efficiency.sampling
Value only used in child class
svmDP
. Sampling function of the formsampling(d)
, whered
is the input dimension, returning a (d
+1)-dimensional vector of samples corresponding to the Fourier transform of the kernel to be approximated.phi
Value only used in child class
svmDP
. Function of the formphi(x, theta)
, wherex
is an individual row of the original dataset, and theta is a (d
+1)-dimensional vector sampled from the Fourier transform of the kernel to be approximated, whered
is the dimension ofx
. The function returns a numeric scalar corresponding to the pre-filtered value at the given row with the given sampled vector.kernel.param
Value only used in child class
svmDP
. Positive real number corresponding to the Gaussian kernel parameter.prefilter
Value only used in child class
svmDP
. Matrix of pre-filter values used in converting data into transform space.
Methods
Public methods
Method new()
Create a new EmpiricalRiskMinimizationDP.CMS
object.
Usage
EmpiricalRiskMinimizationDP.CMS$new( mapXy, loss, regularizer, eps, gamma, perturbation.method = "objective", c = NULL, mapXy.gr = NULL, loss.gr = NULL, regularizer.gr = NULL )
Arguments
mapXy
Map function of the form
mapXy(X, coeff)
mapping input data matrixX
and coefficient vector or matrixcoeff
to output labelsy
. Should return a column matrix of predicted labels for each row ofX
. SeemapXy.sigmoid
for an example.loss
Loss function of the form
loss(y.hat, y)
, wherey.hat
andy
are matrices. Should be defined such that it returns a matrix of loss values for each element ofy.hat
andy
. Seeloss.cross.entropy
for an example. It must be convex and differentiable, and the absolute value of the first derivative of the loss function must be at most 1. Additionally, if the objective perturbation method is chosen, it must be doubly differentiable and the absolute value of the second derivative of the loss function must be bounded above by a constant c for all possible values ofy.hat
andy
.regularizer
String or regularization function. If a string, must be 'l2', indicating to use l2 regularization. If a function, must have form
regularizer(coeff)
, wherecoeff
is a vector or matrix, and return the value of the regularizer atcoeff
. Seeregularizer.l2
for an example. Additionally, in order to ensure differential privacy, the function must be 1-strongly convex and differentiable. If the objective perturbation method is chosen, it must also be doubly differentiable.eps
Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
gamma
Nonnegative real number representing the regularization constant.
perturbation.method
String indicating whether to use the 'output' or the 'objective' perturbation methods (Chaudhuri et al. 2011). Defaults to 'objective'.
c
Positive real number denoting the upper bound on the absolute value of the second derivative of the loss function, as required to ensure differential privacy for the objective perturbation method. This input is unnecessary if perturbation.method is 'output', but is required if perturbation.method is 'objective'. Defaults to NULL.
mapXy.gr
Optional function representing the gradient of the map function with respect to the values in
coeff
. If given, must be of the formmapXy.gr(X, coeff)
, whereX
is a matrix andcoeff
is a matrix or numeric vector. Should be defined such that the ith row of the output represents the gradient with respect to the ith coefficient. SeemapXy.gr.sigmoid
for an example. If not given, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.loss.gr
Optional function representing the gradient of the loss function with respect to
y.hat
and of the formloss.gr(y.hat, y)
, wherey.hat
andy
are matrices. Should be defined such that the ith row of the output represents the gradient of the loss function at the ith set of input values. Seeloss.gr.cross.entropy
for an example. If not given, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.regularizer.gr
Optional function representing the gradient of the regularization function with respect to
coeff
and of the formregularizer.gr(coeff)
. Should return a vector. Seeregularizer.gr.l2
for an example. Ifregularizer
is given as a string, this value is ignored. If not given andregularizer
is a function, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.
Returns
A new EmpiricalRiskMinimizationDP.CMS
object.
Method fit()
Fit the differentially private empirical risk minimization
model. This method runs either the output perturbation or the objective
perturbation algorithm (Chaudhuri et al. 2011), depending on
the value of perturbation.method used to construct the object, to
generate an objective function. A numerical optimization method is then
run to find optimal coefficients for fitting the model given the training
data and hyperparameters. The built-in optim
function using
the "BFGS" optimization method is used. If mapXy.gr
,
loss.gr
, and regularizer.gr
are all given in the
construction of the object, the gradient of the objective function is
utilized by optim
as well. Otherwise, non-gradient based
optimization methods are used. The resulting privacy-preserving
coefficients are stored in coeff
.
Usage
EmpiricalRiskMinimizationDP.CMS$fit( X, y, upper.bounds, lower.bounds, add.bias = FALSE )
Arguments
X
Dataframe of data to be fit.
y
Vector or matrix of true labels for each row of
X
.upper.bounds
Numeric vector of length
ncol(X)
giving upper bounds on the values in each column of X. Thencol(X)
values are assumed to be in the same order as the corresponding columns ofX
. Any value in the columns ofX
larger than the corresponding upper bound is clipped at the bound.lower.bounds
Numeric vector of length
ncol(X)
giving lower bounds on the values in each column ofX
. Thencol(X)
values are assumed to be in the same order as the corresponding columns ofX
. Any value in the columns ofX
larger than the corresponding upper bound is clipped at the bound.add.bias
Boolean indicating whether to add a bias term to
X
. Defaults to FALSE.
Method predict()
Predict label(s) for given X
using the fitted
coefficients.
Usage
EmpiricalRiskMinimizationDP.CMS$predict(X, add.bias = FALSE)
Arguments
X
Dataframe of data on which to make predictions. Must be of same form as
X
used to fit coefficients.add.bias
Boolean indicating whether to add a bias term to
X
. Defaults to FALSE. If add.bias was set to TRUE when fitting the coefficients, add.bias should be set to TRUE for predictions.
Returns
Matrix of predicted labels corresponding to each row of X
.
Method clone()
The objects of this class are cloneable with this method.
Usage
EmpiricalRiskMinimizationDP.CMS$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
References
Chaudhuri K, Monteleoni C, Sarwate AD (2011). “Differentially Private Empirical Risk Minimization.” Journal of Machine Learning Research, 12(29), 1069-1109. https://jmlr.org/papers/v12/chaudhuri11a.html.
Examples
# Build train dataset X and y, and test dataset Xtest and ytest
N <- 200
K <- 2
X <- data.frame()
y <- data.frame()
for (j in (1:K)){
t <- seq(-.25, .25, length.out = N)
if (j==1) m <- stats::rnorm(N,-.2, .1)
if (j==2) m <- stats::rnorm(N, .2, .1)
Xtemp <- data.frame(x1 = 3*t , x2 = m - t)
ytemp <- data.frame(matrix(j-1, N, 1))
X <- rbind(X, Xtemp)
y <- rbind(y, ytemp)
}
Xtest <- X[seq(1,(N*K),10),]
ytest <- y[seq(1,(N*K),10),,drop=FALSE]
X <- X[-seq(1,(N*K),10),]
y <- y[-seq(1,(N*K),10),,drop=FALSE]
# Construct object for logistic regression
mapXy <- function(X, coeff) e1071::sigmoid(X%*%coeff)
# Cross entropy loss
loss <- function(y.hat,y) -(y*log(y.hat) + (1-y)*log(1-y.hat))
regularizer <- 'l2' # Alternatively, function(coeff) coeff%*%coeff/2
eps <- 1
gamma <- 1
perturbation.method <- 'objective'
c <- 1/4 # Required value for logistic regression
mapXy.gr <- function(X, coeff) as.numeric(e1071::dsigmoid(X%*%coeff))*t(X)
loss.gr <- function(y.hat, y) -y/y.hat + (1-y)/(1-y.hat)
regularizer.gr <- function(coeff) coeff
ermdp <- EmpiricalRiskMinimizationDP.CMS$new(mapXy, loss, regularizer, eps,
gamma, perturbation.method, c,
mapXy.gr, loss.gr,
regularizer.gr)
# Fit with data
# Bounds for X based on construction
upper.bounds <- c( 1, 1)
lower.bounds <- c(-1,-1)
ermdp$fit(X, y, upper.bounds, lower.bounds) # No bias term
ermdp$coeff # Gets private coefficients
# Predict new data points
predicted.y <- ermdp$predict(Xtest)
n.errors <- sum(round(predicted.y)!=ytest)
Privacy-preserving Empirical Risk Minimization for Regression
Description
This class implements differentially private empirical risk
minimization using the objective perturbation technique
(Kifer et al. 2012). It is intended to be a framework for
building more specific models via inheritance. See
LinearRegressionDP
for an example of this type of structure.
Details
To use this class for empirical risk minimization, first use the
new
method to construct an object of this class with the desired
function values and hyperparameters. After constructing the object, the
fit
method can be applied with a provided dataset and data bounds to
fit the model. In fitting, the model stores a vector of coefficients
coeff
which satisfy differential privacy. These can be released
directly, or used in conjunction with the predict
method to privately
predict the outcomes of new datapoints.
Note that in order to guarantee differential privacy for the empirical risk
minimization model, certain constraints must be satisfied for the values
used to construct the object, as well as for the data used to fit.
Specifically, the following constraints must be met. Let l
represent
the loss function for an individual dataset row x and output value y and
L
represent the average loss over all rows and output values. First,
L
must be convex with a continuous Hessian. Second, the l2-norm of the
gradient of l
must be bounded above by some constant zeta for all
possible input values in the domain. Third, for all possible inputs to
l
, the Hessian of l
must be of rank at most one and its
Eigenvalues must be bounded above by some constant lambda. Fourth, the
regularizer must be convex. Finally, the provided domain of l
must be
a closed convex subset of the set of all real-valued vectors of dimension p,
where p is the number of columns of X. Note that because of this, a bias
term cannot be included without appropriate scaling/preprocessing of the
dataset. To ensure privacy, the add.bias argument in the fit
and
predict
methods should only be utilized in subclasses within this
package where appropriate preprocessing is implemented, not in this class.
Public fields
mapXy
Map function of the form
mapXy(X, coeff)
mapping input data matrixX
and coefficient vector or matrixcoeff
to output labelsy
.mapXy.gr
Function representing the gradient of the map function with respect to the values in
coeff
and of the formmapXy.gr(X, coeff)
, whereX
is a matrix andcoeff
is a matrix or numeric vector.loss
Loss function of the form
loss(y.hat, y)
, wherey.hat
andy
are matrices.loss.gr
Function representing the gradient of the loss function with respect to
y.hat
and of the formloss.gr(y.hat, y)
, wherey.hat
andy
are matrices.regularizer
Regularization function of the form
regularizer(coeff)
, wherecoeff
is a vector or matrix.regularizer.gr
Function representing the gradient of the regularization function with respect to
coeff
and of the formregularizer.gr(coeff)
.gamma
Nonnegative real number representing the regularization constant.
eps
Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
delta
Nonnegative real number defining the delta privacy parameter. If 0, reduces to pure eps-DP.
domain
List of constraint and jacobian functions representing the constraints on the search space for the objective perturbation algorithm used in Kifer et al. (2012).
zeta
Positive real number denoting the upper bound on the l2-norm value of the gradient of the loss function, as required to ensure differential privacy.
lambda
Positive real number corresponding to the upper bound of the Eigenvalues of the Hessian of the loss function for all possible inputs.
coeff
Numeric vector of coefficients for the model.
Methods
Public methods
Method new()
Create a new EmpiricalRiskMinimizationDP.KST
object.
Usage
EmpiricalRiskMinimizationDP.KST$new( mapXy, loss, regularizer, eps, delta, domain, zeta, lambda, gamma, mapXy.gr = NULL, loss.gr = NULL, regularizer.gr = NULL )
Arguments
mapXy
Map function of the form
mapXy(X, coeff)
mapping input data matrixX
and coefficient vector or matrixcoeff
to output labelsy
. Should return a column matrix of predicted labels for each row ofX
. SeemapXy.linear
for an example.loss
Loss function of the form
loss(y.hat, y)
, wherey.hat
andy
are matrices. Should be defined such that it returns a matrix of loss values for each element ofy.hat
andy
. Seeloss.squared.error
for an example. This function must be convex and the l2-norm of its gradient must be bounded above by zeta for some constant zeta for all possible inputs within the given domain. Additionally, for all possible inputs within the given domain, the Hessian of the loss function must be of rank at most one and its Eigenvalues must be bounded above by some constant lambda.regularizer
String or regularization function. If a string, must be 'l2', indicating to use l2 regularization. If a function, must have form
regularizer(coeff)
, wherecoeff
is a vector or matrix, and return the value of the regularizer atcoeff
. Seeregularizer.l2
for an example. Additionally, in order to ensure differential privacy, the function must be convex.eps
Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
delta
Nonnegative real number defining the delta privacy parameter. If 0, reduces to pure eps-DP.
domain
List of functions representing the constraints on the search space for the objective perturbation algorithm. Must contain two functions, labeled "constraints" and "jacobian", respectively. The "constraints" function accepts a vector of coefficients from the search space and returns a value such that the value is nonpositive if and only if the input coefficient vector is within the constrained search space. The "jacobian" function also accepts a vector of coefficients and returns the Jacobian of the constraint function. For example, in linear regression, the square of the l2-norm of the coefficient vector
\theta
is assumed to be bounded above by p, where p is the length of\theta
(Kifer et al. 2012). So, domain could be defined asdomain <- list("constraints"=function(coeff) coeff%*%coeff-length(coeff), "jacobian"=function(coeff) 2*coeff)
.zeta
Positive real number denoting the upper bound on the l2-norm value of the gradient of the loss function, as required to ensure differential privacy.
lambda
Positive real number corresponding to the upper bound of the Eigenvalues of the Hessian of the loss function for all possible inputs.
gamma
Nonnegative real number representing the regularization constant.
mapXy.gr
Optional function representing the gradient of the map function with respect to the values in
coeff
. If given, must be of the formmapXy.gr(X, coeff)
, whereX
is a matrix andcoeff
is a matrix or numeric vector. Should be defined such that the ith row of the output represents the gradient with respect to the ith coefficient. SeemapXy.gr.linear
for an example. If not given, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.loss.gr
Optional function representing the gradient of the loss function with respect to
y.hat
and of the formloss.gr(y.hat, y)
, wherey.hat
andy
are matrices. Should be defined such that the ith row of the output represents the gradient of the loss function at the ith set of input values. Seeloss.gr.squared.error
for an example. If not given, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.regularizer.gr
Optional function representing the gradient of the regularization function with respect to
coeff
and of the formregularizer.gr(coeff)
. Should return a vector. Seeregularizer.gr.l2
for an example. Ifregularizer
is given as a string, this value is ignored. If not given andregularizer
is a function, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.
Returns
A new EmpiricalRiskMinimizationDP.KST object.
Method fit()
Fit the differentially private emprirical risk minimization
model. The function runs the objective perturbation algorithm
(Kifer et al. 2012) to generate an objective function. A
numerical optimization method is then run to find optimal coefficients
for fitting the model given the training data and hyperparameters. The
nloptr
function is used. If mapXy.gr, loss.gr, and
regularizer.gr are all given in the construction of the object, the
gradient of the objective function and the Jacobian of the constraint
function are utilized for the algorithm, and the NLOPT_LD_MMA method is
used. If one or more of these gradient functions are not given, the
NLOPT_LN_COBYLA method is used. The resulting privacy-preserving
coefficients are stored in coeff.
Usage
EmpiricalRiskMinimizationDP.KST$fit( X, y, upper.bounds, lower.bounds, add.bias = FALSE )
Arguments
X
Dataframe of data to be fit.
y
Vector or matrix of true values for each row of
X
.upper.bounds
Numeric vector of length
ncol(X)+1
giving upper bounds on the values in each column ofX
and the values ofy
. The last value in the vector is assumed to be the upper bound ony
, while the firstncol(X)
values are assumed to be in the same order as the corresponding columns ofX
. Any value in the columns ofX
and iny
larger than the corresponding upper bound is clipped at the bound.lower.bounds
Numeric vector of length
ncol(X)+1
giving lower bounds on the values in each column ofX
and the values ofy
. The last value in the vector is assumed to be the lower bound ony
, while the firstncol(X)
values are assumed to be in the same order as the corresponding columns ofX
. Any value in the columns ofX
and iny
larger than the corresponding lower bound is clipped at the bound.add.bias
Boolean indicating whether to add a bias term to
X
. Defaults to FALSE.
Method predict()
Predict y values for given X using the fitted coefficients.
Usage
EmpiricalRiskMinimizationDP.KST$predict(X, add.bias = FALSE)
Arguments
X
Dataframe of data on which to make predictions. Must be of same form as
X
used to fit coefficients.add.bias
Boolean indicating whether to add a bias term to
X
. Defaults to FALSE. If add.bias was set to TRUE when fitting the coefficients, add.bias should be set to TRUE for predictions.
Returns
Matrix of predicted y values corresponding to each row of X.
Method clone()
The objects of this class are cloneable with this method.
Usage
EmpiricalRiskMinimizationDP.KST$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
References
Kifer D, Smith A, Thakurta A (2012). “Private Convex Empirical Risk Minimization and High-dimensional Regression.” In Mannor S, Srebro N, Williamson RC (eds.), Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, 25.1–25.40. https://proceedings.mlr.press/v23/kifer12.html.
Examples
# Build example dataset
n <- 500
X <- data.frame(X=seq(-1,1,length.out = n))
true.theta <- c(-.3,.5) # First element is bias term
p <- length(true.theta)
y <- true.theta[1] + as.matrix(X)%*%true.theta[2:p] + stats::rnorm(n=n,sd=.1)
# Construct object for linear regression
mapXy <- function(X, coeff) X%*%coeff
loss <- function(y.hat, y) (y.hat-y)^2/2
regularizer <- 'l2' # Alternatively, function(coeff) coeff%*%coeff/2
eps <- 1
delta <- 1
domain <- list("constraints"=function(coeff) coeff%*%coeff-length(coeff),
"jacobian"=function(coeff) 2*coeff)
# Set p to be the number of predictors desired including intercept term (length of coeff)
zeta <- 2*p^(3/2) # Proper bound for linear regression
lambda <- p # Proper bound for linear regression
gamma <- 1
mapXy.gr <- function(X, coeff) t(X)
loss.gr <- function(y.hat, y) y.hat-y
regularizer.gr <- function(coeff) coeff
ermdp <- EmpiricalRiskMinimizationDP.KST$new(mapXy, loss, 'l2', eps, delta,
domain, zeta, lambda,
gamma, mapXy.gr, loss.gr,
regularizer.gr)
# Fit with data
# We must assume y is a matrix with values between -p and p (-2 and 2
# for this example)
upper.bounds <- c(1, 2) # Bounds for X and y
lower.bounds <- c(-1,-2) # Bounds for X and y
ermdp$fit(X, y, upper.bounds, lower.bounds, add.bias=TRUE)
ermdp$coeff # Gets private coefficients
# Predict new data points
# Build a test dataset
Xtest <- data.frame(X=c(-.5, -.25, .1, .4))
predicted.y <- ermdp$predict(Xtest, add.bias=TRUE)
Exponential Mechanism
Description
This function implements the exponential mechanism for differential privacy by selecting the index of a vector of candidates to return according to a user-specified vector of utility function values, epsilon, and global sensitivity. Sensitivity calculated based either on bounded or unbounded differential privacy can be used (Kifer and Machanavajjhala 2011). If measure is provided, the probabilities of selecting each value are scaled according to the values in measure. If candidates is provided, the function returns the value of candidates at the selected index, rather than the index itself.
Usage
ExponentialMechanism(
utility,
eps,
sensitivity,
measure = NULL,
candidates = NULL
)
Arguments
utility |
Numeric vector giving the utilities of the possible values. |
eps |
Positive real number defining the epsilon privacy budget. |
sensitivity |
Real number corresponding to the l1-global sensitivity of the function generating utility. |
measure |
Optional numeric vector of scaling measures for the probabilities of selecting each value. Should be same size as utility. Defaults to uniform scaling. |
candidates |
Optional vector of candidates of same size as utility. If given, the function returns the candidate at the selected index rather than the index itself. |
Value
Indices (or values if candidates given) selected by the mechanism based on the bounded and/or unbounded definitions of differential privacy.
References
Dwork C, McSherry F, Nissim K, Smith A (2006). “Calibrating Noise to Sensitivity in Private Data Analysis.” In Halevi S, Rabin T (eds.), Theory of Cryptography, 265–284. ISBN 978-3-540-32732-5, https://doi.org/10.1007/11681878_14.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
McSherry F, Talwar K (2007). “Mechanism Design via Differential Privacy.” In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07), 94-103. doi:10.1109/FOCS.2007.66.
Examples
candidates <- c('a','b','c','d','e','f','g')
# Release index
idx <- ExponentialMechanism(c(0,1,2,3,2,1,0), 1, 1)
candidates[idx] # Randomly chosen candidate
# Release candidate
ExponentialMechanism(c(0,1,2,3,2,1,0), 1, .5, measure=c(1,1,2,1,2,1,1),
candidates=candidates)
Gaussian Mechanism
Description
This function implements the Gaussian mechanism for differential privacy by adding noise to the true value(s) of a function according to specified values of epsilon, delta, and l2-global sensitivity(-ies). Global sensitivity calculated based either on bounded or unbounded differential privacy can be used (Kifer and Machanavajjhala 2011). If true.values is a vector, the provided epsilon and delta are divided such that (epsilon, delta)-level differential privacy is satisfied across all function values. In the case that each element of true.values comes from its own function with different corresponding sensitivities, a vector of sensitivities may be provided. In this case, if desired, the user can specify how to divide epsilon and delta among the function values using alloc.proportions.
Usage
GaussianMechanism(
true.values,
eps,
delta,
sensitivities,
type.DP = "aDP",
alloc.proportions = NULL,
analytic = FALSE,
tol = 1e-12
)
Arguments
true.values |
Real number or numeric vector corresponding to the true value(s) of the desired function. |
eps |
Positive real number defining the epsilon privacy parameter. |
delta |
Positive real number defining the delta privacy parameter. |
sensitivities |
Real number or numeric vector corresponding to the l2-global sensitivity(-ies) of the function(s) generating true.values. This value must be of length 1 or of the same length as true.values. If it is of length 1 and true.values is a vector, this indicates that the given sensitivity applies simultaneously to all elements of true.values and that the privacy budget need not be allocated (alloc.proportions is unused in this case). If it is of the same length as true.values, this indicates that each element of true.values comes from its own function with different corresponding sensitivities. In this case, the l2-norm of the provided sensitivities is used to generate the Gaussian noise. |
type.DP |
String indicating the type of differential privacy desired for the Gaussian mechanism. Can be either 'pDP' for probabilistic DP (Liu 2019) or 'aDP' for approximate DP (Dwork et al. 2006). Note that if 'aDP' is chosen, epsilon must be strictly less than 1. |
alloc.proportions |
Optional numeric vector giving the allocation proportions of epsilon and delta to the function values in the case of vector-valued sensitivities. For example, if sensitivities is of length two and alloc.proportions = c(.75, .25), then 75% of the privacy budget eps (and 75% of delta) is allocated to the noise computation for the first element of true.values, and the remaining 25% is allocated to the noise computation for the second element of true.values. This ensures (eps, delta)-level privacy across all computations. Input does not need to be normalized, meaning alloc.proportions = c(3,1) produces the same result as the example above. |
analytic |
Indicates whether to use the analytic Gaussian mechanism to compute the noise scale (Balle and Wang 2018). Defaults to FALSE. |
tol |
Error tolerance for binary search used in determining the noise parameter for the analytic Gaussian mechanism. Unused if analytic is FALSE. Defaults to 1e-12. |
Value
Sanitized function values based on the bounded and/or unbounded definitions of differential privacy, sanitized via the Gaussian mechanism.
References
Dwork C, McSherry F, Nissim K, Smith A (2006). “Calibrating Noise to Sensitivity in Private Data Analysis.” In Halevi S, Rabin T (eds.), Theory of Cryptography, 265–284. ISBN 978-3-540-32732-5, https://doi.org/10.1007/11681878_14.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Balle B, Wang Y (2018). “Improving the Gaussian Mechanism for Differential Privacy: Analytical Calibration and Optimal Denoising.” In Dy J, Krause A (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 394–403. https://proceedings.mlr.press/v80/balle18a.html.
Liu F (2019). “Generalized Gaussian Mechanism for Differential Privacy.” IEEE Transactions on Knowledge and Data Engineering, 31(4), 747-756. https://doi.org/10.1109/TKDE.2018.2845388.
Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M (2006). “Our Data, Ourselves: Privacy Via Distributed Noise Generation.” In Vaudenay S (ed.), Advances in Cryptology - EUROCRYPT 2006, 486–503. ISBN 978-3-540-34547-3, doi:10.1007/11761679_29.
Examples
# Simulate dataset
n <- 100
c0 <- 5 # Lower bound
c1 <- 10 # Upper bound
D1 <- stats::runif(n, c0, c1)
# Privacy budget
epsilon <- 0.9 # eps must be in (0, 1) for approximate differential privacy
delta <- 0.01
sensitivity <- (c1-c0)/n
# Approximate differential privacy
private.mean.approx <- GaussianMechanism(mean(D1), epsilon, delta,
sensitivity)
private.mean.approx
# Probabilistic differential privacy
private.mean.prob <- GaussianMechanism(mean(D1), epsilon, delta, sensitivity,
type.DP = 'pDP')
private.mean.prob
# Analytic Gaussian mechanism
epsilon <- 1.1 # epsilon can be > 1 for analytic Gaussian mechanism
private.mean.analytic <- GaussianMechanism(mean(D1), epsilon, delta,
sensitivity, analytic=TRUE)
private.mean.analytic
# Simulate second dataset
d0 <- 3 # Lower bound
d1 <- 6 # Upper bound
D2 <- stats::runif(n, d0, d1)
D <- matrix(c(D1,D2),ncol=2)
sensitivities <- c((c1-c0)/n, (d1-d0)/n)
epsilon <- 0.9 # Total privacy budget for all means
delta <- 0.01
# Here, sensitivities are summed and the result is used to generate Laplace
# noise. This is essentially the same as allocating epsilon proportional to
# the corresponding sensitivity. The results satisfy (0.9,0.01)-approximate
# differential privacy.
private.means <- GaussianMechanism(apply(D, 2, mean), epsilon, delta,
sensitivities)
private.means
# Here, privacy budget is explicitly split so that 75% is given to the first
# vector element and 25% is given to the second.
private.means <- GaussianMechanism(apply(D, 2, mean), epsilon, delta,
sensitivities,
alloc.proportions = c(0.75, 0.25))
private.means
Laplace Mechanism
Description
This function implements the Laplace mechanism for differential privacy by adding noise to the true value(s) of a function according to specified values of epsilon and l1-global sensitivity(-ies). Global sensitivity calculated based either on bounded or unbounded differential privacy can be used (Kifer and Machanavajjhala 2011). If true.values is a vector, the provided epsilon is divided such that epsilon-differential privacy is satisfied across all function values. In the case that each element of true.values comes from its own function with different corresponding sensitivities, a vector of sensitivities may be provided. In this case, if desired, the user can specify how to divide epsilon among the function values using alloc.proportions.
Usage
LaplaceMechanism(true.values, eps, sensitivities, alloc.proportions = NULL)
Arguments
true.values |
Real number or numeric vector corresponding to the true value(s) of the desired function. |
eps |
Positive real number defining the epsilon privacy parameter. |
sensitivities |
Real number or numeric vector corresponding to the l1-global sensitivity(-ies) of the function(s) generating true.values. This value must be of length 1 or of the same length as true.values. If it is of length 1 and true.values is a vector, this indicates that the given sensitivity applies simultaneously to all elements of true.values and that the privacy budget need not be allocated (alloc.proportions is unused in this case). If it is of the same length as true.values, this indicates that each element of true.values comes from its own function with different corresponding sensitivities. In this case, the l1-norm of the provided sensitivities is used to generate the Laplace noise. |
alloc.proportions |
Optional numeric vector giving the allocation proportions of epsilon to the function values in the case of vector-valued sensitivities. For example, if sensitivities is of length two and alloc.proportions = c(.75, .25), then 75% of the privacy budget eps is allocated to the noise computation for the first element of true.values, and the remaining 25% is allocated to the noise computation for the second element of true.values. This ensures eps-level privacy across all computations. Input does not need to be normalized, meaning alloc.proportions = c(3,1) produces the same result as the example above. |
Value
Sanitized function values based on the bounded and/or unbounded definitions of differential privacy, sanitized via the Laplace mechanism.
References
Dwork C, McSherry F, Nissim K, Smith A (2006). “Calibrating Noise to Sensitivity in Private Data Analysis.” In Halevi S, Rabin T (eds.), Theory of Cryptography, 265–284. ISBN 978-3-540-32732-5, https://doi.org/10.1007/11681878_14.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Examples
# Simulate dataset
n <- 100
c0 <- 5 # Lower bound
c1 <- 10 # Upper bound
D1 <- stats::runif(n, c0, c1)
epsilon <- 1 # Privacy budget
sensitivity <- (c1-c0)/n
private.mean <- LaplaceMechanism(mean(D1), epsilon, sensitivity)
private.mean
# Simulate second dataset
d0 <- 3 # Lower bound
d1 <- 6 # Upper bound
D2 <- stats::runif(n, d0, d1)
D <- matrix(c(D1,D2),ncol=2)
sensitivities <- c((c1-c0)/n, (d1-d0)/n)
epsilon <- 1 # Total privacy budget for all means
# Here, sensitivities are summed and the result is used to generate Laplace
# noise. This is essentially the same as allocating epsilon proportional to
# the corresponding sensitivity. The results satisfy 1-differential privacy.
private.means <- LaplaceMechanism(apply(D, 2, mean), epsilon, sensitivities)
private.means
# Here, privacy budget is explicitly split so that 75% is given to the first
# vector element and 25% is given to the second.
private.means <- LaplaceMechanism(apply(D, 2, mean), epsilon, sensitivities,
alloc.proportions = c(0.75, 0.25))
private.means
Privacy-preserving Linear Regression
Description
This class implements differentially private linear regression using the objective perturbation technique (Kifer et al. 2012).
Details
To use this class for linear regression, first use the new
method to construct an object of this class with the desired function
values and hyperparameters. After constructing the object, the fit
method can be applied with a provided dataset and data bounds to fit the
model. In fitting, the model stores a vector of coefficients coeff
which satisfy differential privacy. These can be released directly, or used
in conjunction with the predict
method to privately predict the
outcomes of new datapoints.
Note that in order to guarantee differential privacy for linear regression,
certain constraints must be satisfied for the values used to construct the
object, as well as for the data used to fit. The regularizer must be
convex. Additionally, it is assumed that if x represents a single row of
the dataset X, then the l2-norm of x is at most p for all x, where p is the
number of predictors (including any possible intercept term). In order to
ensure this constraint is satisfied, the dataset is preprocessed and
scaled, and the resulting coefficients are postprocessed and un-scaled so
that the stored coefficients correspond to the original data. Due to this
constraint on x, it is best to avoid using an intercept term in the model
whenever possible. If an intercept term must be used, the issue can be
partially circumvented by adding a constant column to X before fitting the
model, which will be scaled along with the rest of X. The fit
method
contains functionality to add a column of constant 1s to X before scaling,
if desired.
Super class
DPpack::EmpiricalRiskMinimizationDP.KST
-> LinearRegressionDP
Methods
Public methods
Inherited methods
Method new()
Create a new LinearRegressionDP object.
Usage
LinearRegressionDP$new(regularizer, eps, delta, gamma, regularizer.gr = NULL)
Arguments
regularizer
String or regularization function. If a string, must be 'l2', indicating to use l2 regularization. If a function, must have form
regularizer(coeff)
, wherecoeff
is a vector or matrix, and return the value of the regularizer atcoeff
. Seeregularizer.l2
for an example. Additionally, in order to ensure differential privacy, the function must be convex.eps
Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
delta
Nonnegative real number defining the delta privacy parameter. If 0, reduces to pure eps-DP.
gamma
Nonnegative real number representing the regularization constant.
regularizer.gr
Optional function representing the gradient of the regularization function with respect to
coeff
and of the formregularizer.gr(coeff)
. Should return a vector. Seeregularizer.gr.l2
for an example. Ifregularizer
is given as a string, this value is ignored. If not given andregularizer
is a function, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.
Returns
A new LinearRegressionDP object.
Method fit()
Fit the differentially private linear regression model. The
function runs the objective perturbation algorithm
(Kifer et al. 2012) to generate an objective function. A
numerical optimization method is then run to find optimal coefficients
for fitting the model given the training data and hyperparameters. The
nloptr
function is used. If regularizer
is given as
'l2' or if regularizer.gr
is given in the construction of the
object, the gradient of the objective function and the Jacobian of the
constraint function are utilized for the algorithm, and the NLOPT_LD_MMA
method is used. If this is not the case, the NLOPT_LN_COBYLA method is
used. The resulting privacy-preserving coefficients are stored in coeff.
Usage
LinearRegressionDP$fit(X, y, upper.bounds, lower.bounds, add.bias = FALSE)
Arguments
X
Dataframe of data to be fit.
y
Vector or matrix of true values for each row of
X
.upper.bounds
Numeric vector of length
ncol(X)+1
giving upper bounds on the values in each column ofX
and the values ofy
. The last value in the vector is assumed to be the upper bound ony
, while the firstncol(X)
values are assumed to be in the same order as the corresponding columns ofX
. Any value in the columns ofX
and iny
larger than the corresponding upper bound is clipped at the bound.lower.bounds
Numeric vector of length
ncol(X)+1
giving lower bounds on the values in each column ofX
and the values ofy
. The last value in the vector is assumed to be the lower bound ony
, while the firstncol(X)
values are assumed to be in the same order as the corresponding columns ofX
. Any value in the columns ofX
and iny
larger than the corresponding lower bound is clipped at the bound.add.bias
Boolean indicating whether to add a bias term to
X
. Defaults to FALSE.
Method clone()
The objects of this class are cloneable with this method.
Usage
LinearRegressionDP$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
References
Kifer D, Smith A, Thakurta A (2012). “Private Convex Empirical Risk Minimization and High-dimensional Regression.” In Mannor S, Srebro N, Williamson RC (eds.), Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, 25.1–25.40. https://proceedings.mlr.press/v23/kifer12.html.
Examples
# Build example dataset
n <- 500
X <- data.frame(X=seq(-1,1,length.out = n))
true.theta <- c(-.3,.5) # First element is bias term
p <- length(true.theta)
y <- true.theta[1] + as.matrix(X)%*%true.theta[2:p] + stats::rnorm(n=n,sd=.1)
# Construct object for linear regression
regularizer <- 'l2' # Alternatively, function(coeff) coeff%*%coeff/2
eps <- 1
delta <- 0 # Indicates to use pure eps-DP
gamma <- 1
regularizer.gr <- function(coeff) coeff
lrdp <- LinearRegressionDP$new('l2', eps, delta, gamma, regularizer.gr)
# Fit with data
# We must assume y is a matrix with values between -p and p (-2 and 2
# for this example)
upper.bounds <- c(1, 2) # Bounds for X and y
lower.bounds <- c(-1,-2) # Bounds for X and y
lrdp$fit(X, y, upper.bounds, lower.bounds, add.bias=TRUE)
lrdp$coeff # Gets private coefficients
# Predict new data points
# Build a test dataset
Xtest <- data.frame(X=c(-.5, -.25, .1, .4))
predicted.y <- lrdp$predict(Xtest, add.bias=TRUE)
Privacy-preserving Logistic Regression
Description
This class implements differentially private logistic regression (Chaudhuri et al. 2011). Either the output or the objective perturbation method can be used.
Details
To use this class for logistic regression, first use the new
method to construct an object of this class with the desired function
values and hyperparameters. After constructing the object, the fit
method can be applied with a provided dataset and data bounds to fit the
model. In fitting, the model stores a vector of coefficients coeff
which satisfy differential privacy. These can be released directly, or used
in conjunction with the predict
method to privately predict the
outcomes of new datapoints.
Note that in order to guarantee differential privacy for logistic
regression, certain constraints must be satisfied for the values used to
construct the object, as well as for the data used to fit. These conditions
depend on the chosen perturbation method. The regularizer must be
1-strongly convex and differentiable. It also must be doubly differentiable
if objective perturbation is chosen. Additionally, it is assumed that if x
represents a single row of the dataset X, then the l2-norm of x is at most
1 for all x. In order to ensure this constraint is satisfied, the dataset
is preprocessed and scaled, and the resulting coefficients are
postprocessed and un-scaled so that the stored coefficients correspond to
the original data. Due to this constraint on x, it is best to avoid using a
bias term in the model whenever possible. If a bias term must be used, the
issue can be partially circumvented by adding a constant column to X before
fitting the model, which will be scaled along with the rest of X. The
fit
method contains functionality to add a column of constant 1s to
X before scaling, if desired.
Super class
DPpack::EmpiricalRiskMinimizationDP.CMS
-> LogisticRegressionDP
Methods
Public methods
Method new()
Create a new LogisticRegressionDP
object.
Usage
LogisticRegressionDP$new( regularizer, eps, gamma, perturbation.method = "objective", regularizer.gr = NULL )
Arguments
regularizer
String or regularization function. If a string, must be 'l2', indicating to use l2 regularization. If a function, must have form
regularizer(coeff)
, wherecoeff
is a vector or matrix, and return the value of the regularizer atcoeff
. Seeregularizer.l2
for an example. Additionally, in order to ensure differential privacy, the function must be 1-strongly convex and doubly differentiable.eps
Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
gamma
Nonnegative real number representing the regularization constant.
perturbation.method
String indicating whether to use the 'output' or the 'objective' perturbation methods (Chaudhuri et al. 2011). Defaults to 'objective'.
regularizer.gr
Optional function representing the gradient of the regularization function with respect to
coeff
and of the formregularizer.gr(coeff)
. Should return a vector. Seeregularizer.gr.l2
for an example. Ifregularizer
is given as a string, this value is ignored. If not given andregularizer
is a function, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.
Returns
A new LogisticRegressionDP
object.
Method fit()
Fit the differentially private logistic regression model. This
method runs either the output perturbation or the objective perturbation
algorithm (Chaudhuri et al. 2011), depending on the value of
perturbation.method used to construct the object, to generate an
objective function. A numerical optimization method is then run to find
optimal coefficients for fitting the model given the training data and
hyperparameters. The built-in optim
function using the
"BFGS" optimization method is used. If regularizer
is given as
'l2' or if regularizer.gr
is given in the construction of the
object, the gradient of the objective function is utilized by
optim
as well. Otherwise, non-gradient based optimization methods
are used. The resulting privacy-preserving coefficients are stored in
coeff
.
Usage
LogisticRegressionDP$fit(X, y, upper.bounds, lower.bounds, add.bias = FALSE)
Arguments
X
Dataframe of data to be fit.
y
Vector or matrix of true labels for each row of
X
.upper.bounds
Numeric vector of length
ncol(X)
giving upper bounds on the values in each column of X. Thencol(X)
values are assumed to be in the same order as the corresponding columns ofX
. Any value in the columns ofX
larger than the corresponding upper bound is clipped at the bound.lower.bounds
Numeric vector of length
ncol(X)
giving lower bounds on the values in each column ofX
. Thencol(X)
values are assumed to be in the same order as the corresponding columns ofX
. Any value in the columns ofX
larger than the corresponding upper bound is clipped at the bound.add.bias
Boolean indicating whether to add a bias term to
X
. Defaults to FALSE.
Method predict()
Predict label(s) for given X
using the fitted
coefficients.
Usage
LogisticRegressionDP$predict(X, add.bias = FALSE, raw.value = FALSE)
Arguments
X
Dataframe of data on which to make predictions. Must be of same form as
X
used to fit coefficients.add.bias
Boolean indicating whether to add a bias term to
X
. Defaults to FALSE. If add.bias was set to TRUE when fitting the coefficients, add.bias should be set to TRUE for predictions.raw.value
Boolean indicating whether to return the raw predicted value or the rounded class label. If FALSE (default), outputs the predicted labels 0 or 1. If TRUE, returns the raw score from the logistic regression.
Returns
Matrix of predicted labels or scores corresponding to each row of
X
.
Method clone()
The objects of this class are cloneable with this method.
Usage
LogisticRegressionDP$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
References
Chaudhuri K, Monteleoni C, Sarwate AD (2011). “Differentially Private Empirical Risk Minimization.” Journal of Machine Learning Research, 12(29), 1069-1109. https://jmlr.org/papers/v12/chaudhuri11a.html.
Chaudhuri K, Monteleoni C (2009). “Privacy-preserving logistic regression.” In Koller D, Schuurmans D, Bengio Y, Bottou L (eds.), Advances in Neural Information Processing Systems, volume 21. https://proceedings.neurips.cc/paper/2008/file/8065d07da4a77621450aa84fee5656d9-Paper.pdf.
Examples
# Build train dataset X and y, and test dataset Xtest and ytest
N <- 200
K <- 2
X <- data.frame()
y <- data.frame()
for (j in (1:K)){
t <- seq(-.25, .25, length.out = N)
if (j==1) m <- stats::rnorm(N,-.2, .1)
if (j==2) m <- stats::rnorm(N, .2, .1)
Xtemp <- data.frame(x1 = 3*t , x2 = m - t)
ytemp <- data.frame(matrix(j-1, N, 1))
X <- rbind(X, Xtemp)
y <- rbind(y, ytemp)
}
Xtest <- X[seq(1,(N*K),10),]
ytest <- y[seq(1,(N*K),10),,drop=FALSE]
X <- X[-seq(1,(N*K),10),]
y <- y[-seq(1,(N*K),10),,drop=FALSE]
# Construct object for logistic regression
regularizer <- 'l2' # Alternatively, function(coeff) coeff%*%coeff/2
eps <- 1
gamma <- 1
lrdp <- LogisticRegressionDP$new(regularizer, eps, gamma)
# Fit with data
# Bounds for X based on construction
upper.bounds <- c( 1, 1)
lower.bounds <- c(-1,-1)
lrdp$fit(X, y, upper.bounds, lower.bounds) # No bias term
lrdp$coeff # Gets private coefficients
# Predict new data points
predicted.y <- lrdp$predict(Xtest)
n.errors <- sum(predicted.y!=ytest)
Privacy-preserving Weighted Empirical Risk Minimization
Description
This class implements differentially private empirical risk minimization in the case where weighted observation-level losses are desired (such as weighted SVM (Yang et al. 2005)). Currently, only the output perturbation method is implemented.
Details
To use this class for weighted empirical risk minimization, first
use the new
method to construct an object of this class with the
desired function values and hyperparameters. After constructing the object,
the fit
method can be applied with a provided dataset, data bounds,
weights, and weight bounds to fit the model. In fitting, the model stores a
vector of coefficients coeff
which satisfy differential privacy.
These can be released directly, or used in conjunction with the
predict
method to privately predict the outcomes of new datapoints.
Note that in order to guarantee differential privacy for weighted empirical
risk minimization, certain constraints must be satisfied for the values
used to construct the object, as well as for the data used to fit. These
conditions depend on the chosen perturbation method, though currently only
output perturbation is implemented. Specifically, the provided loss
function must be convex and differentiable with respect to y.hat
,
and the absolute value of the first derivative of the loss function must be
at most 1. If objective perturbation is chosen (not currently implemented),
the loss function must also be doubly differentiable and the absolute value
of the second derivative of the loss function must be bounded above by a
constant c for all possible values of y.hat
and y
, where
y.hat
is the predicted label and y
is the true label. The
regularizer must be 1-strongly convex and differentiable. It also must be
doubly differentiable if objective perturbation is chosen. For the data x,
it is assumed that if x represents a single row of the dataset X, then the
l2-norm of x is at most 1 for all x. Note that because of this, a bias term
cannot be included without appropriate scaling/preprocessing of the
dataset. To ensure privacy, the add.bias argument in the fit
and
predict
methods should only be utilized in subclasses within this
package where appropriate preprocessing is implemented, not in this class.
Finally, if weights are provided, they should be nonnegative, of the same
length as y, and be upper bounded by a global or public bound which must
also be provided.
Super class
DPpack::EmpiricalRiskMinimizationDP.CMS
-> WeightedERMDP.CMS
Methods
Public methods
Method new()
Create a new WeightedERMDP.CMS
object.
Usage
WeightedERMDP.CMS$new( mapXy, loss, regularizer, eps, gamma, perturbation.method = "objective", c = NULL, mapXy.gr = NULL, loss.gr = NULL, regularizer.gr = NULL )
Arguments
mapXy
Map function of the form
mapXy(X, coeff)
mapping input data matrixX
and coefficient vector or matrixcoeff
to output labelsy
. Should return a column matrix of predicted labels for each row ofX
. SeemapXy.sigmoid
for an example.loss
Loss function of the form
loss(y.hat, y, w)
, wherey.hat
andy
are matrices andw
is a matrix or vector of weights of the same length asy
. Should be defined such that it returns a matrix of weighted loss values for each element ofy.hat
andy
. Ifw
is not given, the function should operate as if uniform weights were given. Seegenerate.loss.huber
for an example. It must be convex and differentiable, and the absolute value of the first derivative of the loss function must be at most 1. Additionally, if the objective perturbation method is chosen, it must be doubly differentiable and the absolute value of the second derivative of the loss function must be bounded above by a constant c for all possible values ofy.hat
andy
.regularizer
String or regularization function. If a string, must be 'l2', indicating to use l2 regularization. If a function, must have form
regularizer(coeff)
, wherecoeff
is a vector or matrix, and return the value of the regularizer atcoeff
. Seeregularizer.l2
for an example. Additionally, in order to ensure differential privacy, the function must be 1-strongly convex and differentiable. If the objective perturbation method is chosen, it must also be doubly differentiable.eps
Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
gamma
Nonnegative real number representing the regularization constant.
perturbation.method
String indicating whether to use the 'output' or the 'objective' perturbation methods (Chaudhuri et al. 2011). Defaults to 'objective'. Currently, only the output perturbation method is supported.
c
Positive real number denoting the upper bound on the absolute value of the second derivative of the loss function, as required to ensure differential privacy for the objective perturbation method. This input is unnecessary if perturbation.method is 'output', but is required if perturbation.method is 'objective'. Defaults to NULL.
mapXy.gr
Optional function representing the gradient of the map function with respect to the values in
coeff
. If given, must be of the formmapXy.gr(X, coeff)
, whereX
is a matrix andcoeff
is a matrix or numeric vector. Should be defined such that the ith row of the output represents the gradient with respect to the ith coefficient. SeemapXy.gr.sigmoid
for an example. If not given, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.loss.gr
Optional function representing the gradient of the loss function with respect to
y.hat
and of the formloss.gr(y.hat, y, w)
, wherey.hat
andy
are matrices andw
is a matrix or vector of weights. Should be defined such that the ith row of the output represents the gradient of the (possibly weighted) loss function at the ith set of input values. Seegenerate.loss.gr.huber
for an example. If not given, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.regularizer.gr
Optional function representing the gradient of the regularization function with respect to
coeff
and of the formregularizer.gr(coeff)
. Should return a vector. Seeregularizer.gr.l2
for an example. Ifregularizer
is given as a string, this value is ignored. If not given andregularizer
is a function, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.
Returns
A new WeightedERMDP.CMS
object.
Method fit()
Fit the differentially private weighted empirical risk
minimization model. This method runs either the output perturbation or
the objective perturbation algorithm (Chaudhuri et al. 2011)
(only output is currently implemented), depending on the value of
perturbation.method used to construct the object, to generate an
objective function. A numerical optimization method is then run to find
optimal coefficients for fitting the model given the training data,
weights, and hyperparameters. The built-in optim
function
using the "BFGS" optimization method is used. If mapXy.gr
,
loss.gr
, and regularizer.gr
are all given in the
construction of the object, the gradient of the objective function is
utilized by optim
as well. Otherwise, non-gradient based
optimization methods are used. The resulting privacy-preserving
coefficients are stored in coeff
.
Usage
WeightedERMDP.CMS$fit( X, y, upper.bounds, lower.bounds, add.bias = FALSE, weights = NULL, weights.upper.bound = NULL )
Arguments
X
Dataframe of data to be fit.
y
Vector or matrix of true labels for each row of
X
.upper.bounds
Numeric vector of length
ncol(X)
giving upper bounds on the values in each column of X. Thencol(X)
values are assumed to be in the same order as the corresponding columns ofX
. Any value in the columns ofX
larger than the corresponding upper bound is clipped at the bound.lower.bounds
Numeric vector of length
ncol(X)
giving lower bounds on the values in each column ofX
. Thencol(X)
values are assumed to be in the same order as the corresponding columns ofX
. Any value in the columns ofX
larger than the corresponding upper bound is clipped at the bound.add.bias
Boolean indicating whether to add a bias term to
X
. Defaults to FALSE.weights
Numeric vector of observation weights of the same length as
y
.weights.upper.bound
Numeric value representing the global or public upper bound on the weights.
Method predict()
Predict label(s) for given X
using the fitted
coefficients.
Usage
WeightedERMDP.CMS$predict(X, add.bias = FALSE)
Arguments
X
Dataframe of data on which to make predictions. Must be of same form as
X
used to fit coefficients.add.bias
Boolean indicating whether to add a bias term to
X
. Defaults to FALSE. If add.bias was set to TRUE when fitting the coefficients, add.bias should be set to TRUE for predictions.
Returns
Matrix of predicted labels corresponding to each row of X
.
Method clone()
The objects of this class are cloneable with this method.
Usage
WeightedERMDP.CMS$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
References
Chaudhuri K, Monteleoni C, Sarwate AD (2011). “Differentially Private Empirical Risk Minimization.” Journal of Machine Learning Research, 12(29), 1069-1109. https://jmlr.org/papers/v12/chaudhuri11a.html.
Yang X, Song Q, Cao A (2005). “Weighted support vector machine for data classification.” In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 2, 859-864 vol. 2. doi:10.1109/IJCNN.2005.1555965.
Examples
# Build train dataset X and y, and test dataset Xtest and ytest
N <- 200
K <- 2
X <- data.frame()
y <- data.frame()
for (j in (1:K)){
t <- seq(-.25, .25, length.out = N)
if (j==1) m <- stats::rnorm(N,-.2, .1)
if (j==2) m <- stats::rnorm(N, .2, .1)
Xtemp <- data.frame(x1 = 3*t , x2 = m - t)
ytemp <- data.frame(matrix(j-1, N, 1))
X <- rbind(X, Xtemp)
y <- rbind(y, ytemp)
}
Xtest <- X[seq(1,(N*K),10),]
ytest <- y[seq(1,(N*K),10),,drop=FALSE]
X <- X[-seq(1,(N*K),10),]
y <- y[-seq(1,(N*K),10),,drop=FALSE]
# Construct object for weighted linear SVM
mapXy <- function(X, coeff) X%*%coeff
# Huber loss from DPpack
huber.h <- 0.5
loss <- generate.loss.huber(huber.h)
regularizer <- 'l2' # Alternatively, function(coeff) coeff%*%coeff/2
eps <- 1
gamma <- 1
perturbation.method <- 'output'
c <- 1/(2*huber.h) # Required value for SVM
mapXy.gr <- function(X, coeff) t(X)
loss.gr <- generate.loss.gr.huber(huber.h)
regularizer.gr <- function(coeff) coeff
wermdp <- WeightedERMDP.CMS$new(mapXy, loss, regularizer, eps,
gamma, perturbation.method, c,
mapXy.gr, loss.gr,
regularizer.gr)
# Fit with data
# Bounds for X based on construction
upper.bounds <- c( 1, 1)
lower.bounds <- c(-1,-1)
weights <- rep(1, nrow(y)) # Uniform weighting
weights[nrow(y)] <- 0.5 # half weight for last observation
wub <- 1 # Public upper bound for weights
wermdp$fit(X, y, upper.bounds, lower.bounds, weights=weights,
weights.upper.bound=wub)
wermdp$coeff # Gets private coefficients
# Predict new data points
predicted.y <- wermdp$predict(Xtest)
n.errors <- sum(round(predicted.y)!=ytest)
Calibrate Analytic Gaussian Mechanism
Description
Calibrate a Gaussian perturbation for differential privacy using the analytic Gaussian mechanism (Balle and Wang 2018).
Usage
calibrateAnalyticGaussianMechanism(epsilon, delta, sensitivity, tol = 1e-12)
Arguments
epsilon |
Positive real number defining the epsilon privacy parameter. |
delta |
Positive real number defining the delta privacy parameter. |
sensitivity |
Real number corresponding to the l2-global sensitivity. |
tol |
Error tolerance for binary search. |
Value
Standard deviation of Gaussian noise needed to achieve
(epsilon, delta)
-DP for given global sensitivity.
References
Balle B, Wang Y (2018). “Improving the Gaussian Mechanism for Differential Privacy: Analytical Calibration and Optimal Denoising.” In Dy J, Krause A (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 394–403. https://proceedings.mlr.press/v80/balle18a.html.
Differentially Private Covariance
Description
This function computes the differentially private covariance of a pair of vectors at user-specified privacy levels of epsilon and delta.
Usage
covDP(
x1,
x2,
eps,
lower.bound1,
upper.bound1,
lower.bound2,
upper.bound2,
which.sensitivity = "bounded",
mechanism = "Laplace",
delta = 0,
type.DP = "aDP"
)
Arguments
x1 , x2 |
Numeric vectors whose covariance is desired. |
eps |
Positive real number defining the epsilon privacy budget. |
lower.bound1 , lower.bound2 |
Real numbers giving the global or public lower bounds of x1 and x2, respectively. |
upper.bound1 , upper.bound2 |
Real numbers giving the global or public upper bounds of x1 and x2, respectively. |
which.sensitivity |
String indicating which type of sensitivity to use. Can be one of {'bounded', 'unbounded', 'both'}. If 'bounded' (default), returns result based on bounded definition for differential privacy. If 'unbounded', returns result based on unbounded definition. If 'both', returns result based on both methods (Kifer and Machanavajjhala 2011). Note that if 'both' is chosen, each result individually satisfies (eps, delta)-differential privacy, but may not do so collectively and in composition. Care must be taken not to violate differential privacy in this case. |
mechanism |
String indicating which mechanism to use for differential
privacy. Currently the following mechanisms are supported: {'Laplace',
'Gaussian', 'analytic'}. Default is Laplace. See |
delta |
Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP. |
type.DP |
String indicating the type of differential privacy desired for the Gaussian mechanism (if selected). Can be either 'pDP' for probabilistic DP (Machanavajjhala et al. 2008) or 'aDP' for approximate DP (Dwork et al. 2006). Note that if 'aDP' is chosen, epsilon must be strictly less than 1. |
Value
Sanitized covariance based on the bounded and/or unbounded definitions of differential privacy.
References
Dwork C, McSherry F, Nissim K, Smith A (2006). “Calibrating Noise to Sensitivity in Private Data Analysis.” In Halevi S, Rabin T (eds.), Theory of Cryptography, 265–284. ISBN 978-3-540-32732-5, https://doi.org/10.1007/11681878_14.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Machanavajjhala A, Kifer D, Abowd J, Gehrke J, Vilhuber L (2008). “Privacy: Theory meets Practice on the Map.” In 2008 IEEE 24th International Conference on Data Engineering, 277-286. doi:10.1109/ICDE.2008.4497436.
Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M (2006). “Our Data, Ourselves: Privacy Via Distributed Noise Generation.” In Vaudenay S (ed.), Advances in Cryptology - EUROCRYPT 2006, 486–503. ISBN 978-3-540-34547-3, doi:10.1007/11761679_29.
Liu F (2019). “Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanism with Univariate Bounding Constraints.” Transactions on Data Privacy, 12(3), 169-195. http://www.tdp.cat/issues16/tdp.a316a18.pdf.
Examples
D1 <- sort(stats::rnorm(500, mean=3, sd=2))
D2 <- sort(stats::rnorm(500, mean=-1,sd=0.5))
lb1 <- -3 # 3 std devs below mean
lb2 <- -2.5 # 3 std devs below mean
ub1 <- 9 # 3 std devs above mean
ub2 <- .5 # 3 std devs above mean
covDP(D1, D2, 1, lb1, ub1, lb2, ub2)
covDP(D1, D2, .5, lb1, ub1, lb2, ub2, which.sensitivity='unbounded',
mechanism='Gaussian', delta=0.01)
Differentially Private Covariance Data Access Function
Description
This function performs the data access step in the computation of a
differentially private covariance. The true values are computed using
cov
, while the sensitivities are calculated based on
bounded and unbounded differential privacy (Kifer and Machanavajjhala 2011)
according to the theoretical values (Liu 2019). For the
covariance, the sensitivities based on bounded and unbounded differential
privacy are identical, so only one value is returned.
Usage
covDataAccess(x1, x2, lower.bound1, upper.bound1, lower.bound2, upper.bound2)
Arguments
x1 , x2 |
Numeric vectors whose covariance is desired. |
lower.bound1 , lower.bound2 |
Real numbers giving the lower bounds of x1 and x2, respectively. |
upper.bound1 , upper.bound2 |
Real numbers giving the upper bounds of x1 and x2, respectively. |
Value
List of the true covariance and the sensitivity calculated based on theoretical values.
References
Liu F (2019). “Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanism with Univariate Bounding Constraints.” Transactions on Data Privacy, 12(3), 169-195. http://www.tdp.cat/issues16/tdp.a316a18.pdf.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Examples
covDataAccess(c(1,4,3,2), c(-2,-3,-4,-1), 0, 5, -5, 0)
Generator for Huber Loss Function Gradient
Description
This function generates and returns the Huber loss function gradient used for
privacy-preserving SVM at the specified value of h in the form required by
EmpiricalRiskMinimizationDP.CMS
.
Usage
generate.loss.gr.huber(h)
Arguments
h |
Positive real number for the Huber loss parameter. Lower values more closely approximate hinge loss. Higher values produce smoother Huber loss functions. |
Value
Huber loss function gradient with parameter h in the form required by
EmpiricalRiskMinimizationDP.CMS
.
Examples
h <- 1
huber <- generate.loss.gr.huber(h)
y.hat <- c(-.5, 1.2, -0.9)
y <- c(-1, 1, -1)
huber(y.hat,y)
huber(y.hat, y, w=c(0.1, 0.5, 1)) # Weights observation-level loss gradient
Generator for Huber Loss Function
Description
This function generates and returns the Huber loss function used for
privacy-preserving SVM at the specified value of h in the form required by
EmpiricalRiskMinimizationDP.CMS
.
Usage
generate.loss.huber(h)
Arguments
h |
Positive real number for the Huber loss parameter. Lower values more closely approximate hinge loss. Higher values produce smoother Huber loss functions. |
Value
Huber loss function with parameter h in the form required by
EmpiricalRiskMinimizationDP.CMS
.
Examples
h <- 0.5
huber <- generate.loss.huber(h)
y.hat <- c(-.5, 1.2, -0.9)
y <- c(-1, 1, -1)
huber(y.hat,y)
huber(y.hat, y, w=c(0.1, 0.5, 1)) # Weights observation-level loss
Generator for Sampling Distribution Function for Gaussian Kernel
Description
This function generates and returns a sampling function corresponding to the
Fourier transform of a Gaussian kernel with given parameter
(Chaudhuri et al. 2011) of form needed for svmDP
.
Usage
generate.sampling(kernel.param)
Arguments
kernel.param |
Positive real number for the Gaussian (radial) kernel parameter. |
Value
Sampling function for the Gaussian kernel of form required by
svmDP
.
References
Chaudhuri K, Monteleoni C, Sarwate AD (2011). “Differentially Private Empirical Risk Minimization.” Journal of Machine Learning Research, 12(29), 1069-1109. https://jmlr.org/papers/v12/chaudhuri11a.html.
Examples
kernel.param <- 0.1
sample <- generate.sampling(kernel.param)
d <- 5
sample(d)
Differentially Private Histogram
Description
This function computes a differentially private histogram from a vector at user-specified privacy levels of epsilon and delta. A histogram object is returned with sanitized values for the counts for easy plotting.
Usage
histogramDP(
x,
eps,
lower.bound,
upper.bound,
breaks = "Sturges",
normalize = FALSE,
which.sensitivity = "bounded",
mechanism = "Laplace",
delta = 0,
type.DP = "aDP",
allow.negative = FALSE
)
Arguments
x |
Numeric vector from which the histogram will be formed. |
eps |
Positive real number defining the epsilon privacy budget. |
lower.bound |
Scalar representing the global or public lower bound on values of x. |
upper.bound |
Scalar representing the global or public upper bound on values of x. |
breaks |
Identical to the argument with the same name from
|
normalize |
Logical value. If FALSE (default), returned histogram counts correspond to frequencies. If TRUE, returned histogram counts correspond to densities (i.e. area of histogram is one). |
which.sensitivity |
String indicating which type of sensitivity to use. Can be one of {'bounded', 'unbounded', 'both'}. If 'bounded' (default), returns result based on bounded definition for differential privacy. If 'unbounded', returns result based on unbounded definition. If 'both', returns result based on both methods (Kifer and Machanavajjhala 2011). Note that if 'both' is chosen, each result individually satisfies (eps, delta)-differential privacy, but may not do so collectively and in composition. Care must be taken not to violate differential privacy in this case. |
mechanism |
String indicating which mechanism to use for differential
privacy. Currently the following mechanisms are supported: {'Laplace',
'Gaussian', 'analytic'}. Default is Laplace. See |
delta |
Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP. |
type.DP |
String indicating the type of differential privacy desired for the Gaussian mechanism (if selected). Can be either 'pDP' for probabilistic DP (Machanavajjhala et al. 2008) or 'aDP' for approximate DP (Dwork et al. 2006). Note that if 'aDP' is chosen, epsilon must be strictly less than 1. |
allow.negative |
Logical value. If FALSE (default), any negative values in the sanitized histogram due to the added noise will be set to 0. If TRUE, the negative values (if any) will be returned. |
Value
Sanitized histogram based on the bounded and/or unbounded definitions of differential privacy.
References
Dwork C, McSherry F, Nissim K, Smith A (2006). “Calibrating Noise to Sensitivity in Private Data Analysis.” In Halevi S, Rabin T (eds.), Theory of Cryptography, 265–284. ISBN 978-3-540-32732-5, https://doi.org/10.1007/11681878_14.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Machanavajjhala A, Kifer D, Abowd J, Gehrke J, Vilhuber L (2008). “Privacy: Theory meets Practice on the Map.” In 2008 IEEE 24th International Conference on Data Engineering, 277-286. doi:10.1109/ICDE.2008.4497436.
Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M (2006). “Our Data, Ourselves: Privacy Via Distributed Noise Generation.” In Vaudenay S (ed.), Advances in Cryptology - EUROCRYPT 2006, 486–503. ISBN 978-3-540-34547-3, doi:10.1007/11761679_29.
Examples
x <- stats::rnorm(500)
graphics::hist(x) # Non-private histogram
result <- histogramDP(x, 1, -3, 3)
plot(result) # Private histogram
graphics::hist(x, freq=FALSE) # Normalized non-private histogram
result <- histogramDP(x, .5, -3, 3, normalize=TRUE,
which.sensitivity='unbounded', mechanism='Gaussian', delta=0.01,
allow.negative=TRUE)
plot(result) # Normalized private histogram (note negative values allowed)
Differentially Private Histogram Data Access Function
Description
This function performs the data access step in the computation of a
differentially private histogram. The true values are computed using
hist
, while the sensitivities are calculated based on
bounded and unbounded differential privacy (Kifer and Machanavajjhala 2011)
according to the theoretical values (Liu 2019).
Usage
histogramDataAccess(x, breaks, mechanism)
Arguments
x |
Numeric vector from which the histogram will be formed.. |
breaks |
Identical to the argument with the same name from
|
mechanism |
String indicating which mechanism to use for differential privacy. If the 'Laplace' mechanism is chosen, l1 sensitivities are returned. If the 'Gaussian' or 'analytic' mechanisms are chosen, l2 sensitivities are returned. |
Value
List of the true histogram and the sensitivities calculated based on bounded and unbounded differential privacy.
References
Liu F (2019). “Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanism with Univariate Bounding Constraints.” Transactions on Data Privacy, 12(3), 169-195. http://www.tdp.cat/issues16/tdp.a316a18.pdf.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Examples
histogramDataAccess(c(1,4,3,2,3), 'Sturges', 'Laplace')
Cross Entropy Loss Function
Description
This function implements cross entropy loss used for logistic regression in
the form required by EmpiricalRiskMinimizationDP.CMS
.
Usage
loss.cross.entropy(y.hat, y)
Arguments
y.hat |
Vector or matrix of estimated labels. |
y |
Vector or matrix of true labels. |
Value
Vector or matrix of the cross entropy loss for each element of y.hat and y.
Examples
y.hat <- c(0.1, 0.88, 0.02)
y <- c(0, 1, 0)
loss.cross.entropy(y.hat,y)
Cross Entropy Loss Function Gradient
Description
This function implements cross entropy loss gradient with respect to y.hat
used for logistic regression in the form required by
EmpiricalRiskMinimizationDP.CMS
.
Usage
loss.gr.cross.entropy(y.hat, y)
Arguments
y.hat |
Vector or matrix of estimated labels. |
y |
Vector or matrix of true labels. |
Value
Vector or matrix of the cross entropy loss gradient for each element of y.hat and y.
Examples
y.hat <- c(0.1, 0.88, 0.02)
y <- c(0, 1, 0)
loss.gr.cross.entropy(y.hat,y)
Squared error Loss Function Gradient
Description
This function implements the squared error loss gradient with respect to
y.hat used for linear regression in the form required by
EmpiricalRiskMinimizationDP.KST
.
Usage
loss.gr.squared.error(y.hat, y)
Arguments
y.hat |
Vector or matrix of estimated values. |
y |
Vector or matrix of true values. |
Value
Vector or matrix of the squared error loss gradient for each element of y.hat and y.
Examples
y.hat <- c(0.1, 0.88, 0.02)
y <- c(-0.1, 1, .2)
loss.gr.squared.error(y.hat,y)
Squared Error Loss Function
Description
This function implements squared error loss used for linear regression in
the form required by EmpiricalRiskMinimizationDP.KST
.
Usage
loss.squared.error(y.hat, y)
Arguments
y.hat |
Vector or matrix of estimated labels. |
y |
Vector or matrix of true labels. |
Value
Vector or matrix of the squared error loss for each element of y.hat and y.
Examples
y.hat <- c(0.1, 0.88, 0.02)
y <- c(0, 1, 0)
loss.squared.error(y.hat,y)
Linear Map Function Gradient
Description
This function implements the gradient of the linear map function with respect
to coeff used for linear SVM and linear regression in the form required by
EmpiricalRiskMinimizationDP.CMS
and
EmpiricalRiskMinimizationDP.KST
.
Usage
mapXy.gr.linear(X, coeff)
Arguments
X |
Matrix of data. |
coeff |
Vector or matrix of coefficients or weights. |
Value
Matrix of values of the gradient of the linear map function with respect to each value of coeff.
Examples
X <- matrix(c(1,2,3,4,5,6),nrow=2)
coeff <- c(0.5,-1,2)
mapXy.gr.linear(X,coeff)
Sigmoid Map Function Gradient
Description
This function implements the gradient of the sigmoid map function with
respect to coeff used for logistic regression in the form required by
EmpiricalRiskMinimizationDP.CMS
.
Usage
mapXy.gr.sigmoid(X, coeff)
Arguments
X |
Matrix of data. |
coeff |
Vector or matrix of coefficients or weights. |
Value
Matrix of values of the gradient of the sigmoid function with respect to each value of coeff.
Examples
X <- matrix(c(1,2,3,4,5,6),nrow=2)
coeff <- c(0.5,-1,2)
mapXy.gr.sigmoid(X,coeff)
Linear Map Function
Description
This function implements the linear map function from data X to labels y used
for linear SVM and linear regression in the form required by
EmpiricalRiskMinimizationDP.CMS
and
EmpiricalRiskMinimizationDP.KST
.
Usage
mapXy.linear(X, coeff)
Arguments
X |
Matrix of data. |
coeff |
Vector or matrix of coefficients or weights. |
Value
Matrix of values of the linear map function corresponding to each row of X.
Examples
X <- matrix(c(1,2,3,4,5,6),nrow=2)
coeff <- c(0.5,-1,2)
mapXy.linear(X,coeff)
Sigmoid Map Function
Description
This function implements the sigmoid map function from data X to labels y
used for logistic regression in the form required by
EmpiricalRiskMinimizationDP.CMS
.
Usage
mapXy.sigmoid(X, coeff)
Arguments
X |
Matrix of data. |
coeff |
Vector or matrix of coefficients or weights. |
Value
Matrix of values of the sigmoid function corresponding to each row of X.
Examples
X <- matrix(c(1,2,3,4,5,6),nrow=2)
coeff <- c(0.5,-1,2)
mapXy.sigmoid(X,coeff)
Differentially Private Mean
Description
This function computes the differentially private mean of a given dataset at user-specified privacy levels of epsilon and delta.
Usage
meanDP(
x,
eps,
lower.bound,
upper.bound,
which.sensitivity = "bounded",
mechanism = "Laplace",
delta = 0,
type.DP = "aDP"
)
Arguments
x |
Dataset whose mean is desired. |
eps |
Positive real number defining the epsilon privacy budget. |
lower.bound |
Scalar representing the global or public lower bound on values of x. |
upper.bound |
Scalar representing the global or public upper bound on values of x. |
which.sensitivity |
String indicating which type of sensitivity to use. Can be one of {'bounded', 'unbounded', 'both'}. If 'bounded' (default), returns result based on bounded definition for differential privacy. If 'unbounded', returns result based on unbounded definition. If 'both', returns result based on both methods (Kifer and Machanavajjhala 2011). Note that if 'both' is chosen, each result individually satisfies (eps, delta)-differential privacy, but may not do so collectively and in composition. Care must be taken not to violate differential privacy in this case. |
mechanism |
String indicating which mechanism to use for differential
privacy. Currently the following mechanisms are supported: {'Laplace',
'Gaussian', 'analytic'}. Default is Laplace. See |
delta |
Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP. |
type.DP |
String indicating the type of differential privacy desired for the Gaussian mechanism (if selected). Can be either 'pDP' for probabilistic DP (Machanavajjhala et al. 2008) or 'aDP' for approximate DP (Dwork et al. 2006). Note that if 'aDP' is chosen, epsilon must be strictly less than 1. |
Value
Sanitized mean based on the bounded and/or unbounded definitions of differential privacy.
References
Dwork C, McSherry F, Nissim K, Smith A (2006). “Calibrating Noise to Sensitivity in Private Data Analysis.” In Halevi S, Rabin T (eds.), Theory of Cryptography, 265–284. ISBN 978-3-540-32732-5, https://doi.org/10.1007/11681878_14.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Machanavajjhala A, Kifer D, Abowd J, Gehrke J, Vilhuber L (2008). “Privacy: Theory meets Practice on the Map.” In 2008 IEEE 24th International Conference on Data Engineering, 277-286. doi:10.1109/ICDE.2008.4497436.
Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M (2006). “Our Data, Ourselves: Privacy Via Distributed Noise Generation.” In Vaudenay S (ed.), Advances in Cryptology - EUROCRYPT 2006, 486–503. ISBN 978-3-540-34547-3, doi:10.1007/11761679_29.
Examples
D <- stats::rnorm(500, mean=3, sd=2)
lb <- -3 # 3 std devs below mean
ub <- 9 # 3 std devs above mean
meanDP(D, 1, lb, ub)
meanDP(D, .5, lb, ub, which.sensitivity='unbounded', mechanism='Gaussian',
delta=0.01)
Differentially Private Mean Data Access Function
Description
This function performs the data access step in the computation of a
differentially private mean. The true values are computed using
mean
, while the sensitivities are calculated based on
bounded and unbounded differential privacy (Kifer and Machanavajjhala 2011)
according to the theoretical values (Liu 2019). For the
mean, the sensitivities based on bounded and unbounded differential privacy
are identical, so only one value is returned.
Usage
meanDataAccess(x, lower.bound, upper.bound)
Arguments
x |
Dataset whose mean is desired. |
lower.bound |
Scalar representing the global or public lower bound on values of x. |
upper.bound |
Scalar representing the global or public upper bound on values of x. |
Value
List of the true mean and the sensitivity calculated based on theoretical values.
References
Liu F (2019). “Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanism with Univariate Bounding Constraints.” Transactions on Data Privacy, 12(3), 169-195. http://www.tdp.cat/issues16/tdp.a316a18.pdf.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Examples
meanDataAccess(c(1,4,3,2), 0, 5)
Differentially Private Median
Description
This function computes the differentially private median of an input vector at a user-specified privacy level of epsilon.
Usage
medianDP(
x,
eps,
lower.bound,
upper.bound,
which.sensitivity = "bounded",
mechanism = "exponential"
)
Arguments
x |
Numeric vector of which the median will be taken. |
eps |
Positive real number defining the epsilon privacy budget. |
lower.bound |
Real number giving the global or public lower bound of x. |
upper.bound |
Real number giving the global or public upper bound of x. |
which.sensitivity |
String indicating which type of sensitivity to use. Can be one of {'bounded', 'unbounded', 'both'}. If 'bounded' (default), returns result based on bounded definition for differential privacy. If 'unbounded', returns result based on unbounded definition. If 'both', returns result based on both methods (Kifer and Machanavajjhala 2011). Note that if 'both' is chosen, each result individually satisfies (eps, 0)-differential privacy, but may not do so collectively and in composition. Care must be taken not to violate differential privacy in this case. |
mechanism |
String indicating which mechanism to use for differential
privacy. Currently the following mechanisms are supported: {'exponential'}.
See |
Value
Sanitized median based on the bounded and/or unbounded definitions of differential privacy.
References
Dwork C, McSherry F, Nissim K, Smith A (2006). “Calibrating Noise to Sensitivity in Private Data Analysis.” In Halevi S, Rabin T (eds.), Theory of Cryptography, 265–284. ISBN 978-3-540-32732-5, https://doi.org/10.1007/11681878_14.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Smith A (2011). “Privacy-Preserving Statistical Estimation with Optimal Convergence Rates.” In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, STOC '11, 813–822. ISBN 9781450306911, doi:10.1145/1993636.1993743.
Examples
D <- stats::rnorm(500)
lower.bound <- -3 # 3 standard deviations below mean
upper.bound <- 3 # 3 standard deviations above mean
eps <- 1
# Get median satisfying pure 1-differential privacy
private.median <- medianDP(D, eps, lower.bound, upper.bound)
private.median
Transform Function for Gaussian Kernel Approximation
Description
This function maps an input data row x with a given prefilter to an output value in such a way as to approximate the Gaussian kernel (Chaudhuri et al. 2011).
Usage
phi.gaussian(x, theta)
Arguments
x |
Vector or matrix corresponding to one row of the dataset X. |
theta |
Randomly sampled prefilter vector of length n+1, where n is the length of x. |
Value
Mapped value corresponding to one element of the transformed space.
References
Chaudhuri K, Monteleoni C, Sarwate AD (2011). “Differentially Private Empirical Risk Minimization.” Journal of Machine Learning Research, 12(29), 1069-1109. https://jmlr.org/papers/v12/chaudhuri11a.html.
Examples
x <- c(1,2,3)
theta <- c(0.1, 1.1, -0.8, 3)
phi.gaussian(x, theta)
Differentially Private Pooled Covariance
Description
This function computes the differentially private pooled covariance from two or more two-column matrices of data at user-specified privacy levels of epsilon and delta.
Usage
pooledCovDP(
...,
eps = 1,
lower.bound1,
upper.bound1,
lower.bound2,
upper.bound2,
which.sensitivity = "bounded",
mechanism = "Laplace",
delta = 0,
type.DP = "aDP",
approx.n.max = FALSE
)
Arguments
... |
Two or more matrices, each with two columns from which to compute the pooled covariance. |
eps |
Positive real number defining the epsilon privacy budget. |
lower.bound1 , lower.bound2 |
Real numbers giving the global or public lower bounds over the first and second columns of all input data, respectively. |
upper.bound1 , upper.bound2 |
Real numbers giving the global or public upper bounds over the first and second columns of all input data, respectively. |
which.sensitivity |
String indicating which type of sensitivity to use. Can be one of {'bounded', 'unbounded', 'both'}. If 'bounded' (default), returns result based on bounded definition for differential privacy. If 'unbounded', returns result based on unbounded definition. If 'both', returns result based on both methods (Kifer and Machanavajjhala 2011). Note that if 'both' is chosen, each result individually satisfies (eps, delta)-differential privacy, but may not do so collectively and in composition. Care must be taken not to violate differential privacy in this case. |
mechanism |
String indicating which mechanism to use for differential
privacy. Currently the following mechanisms are supported: {'Laplace',
'Gaussian', 'analytic'}. Default is Laplace. See |
delta |
Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP. |
type.DP |
String indicating the type of differential privacy desired for the Gaussian mechanism (if selected). Can be either 'pDP' for probabilistic DP (Machanavajjhala et al. 2008) or 'aDP' for approximate DP (Dwork et al. 2006). Note that if 'aDP' is chosen, epsilon must be strictly less than 1. |
approx.n.max |
Logical indicating whether to approximate n.max (defined to be the length of the largest input vector) in the computation of the global sensitivity based on the upper and lower bounds of the data (Liu 2019). Approximation is best if n.max is very large. |
Value
Sanitized pooled covariance based on the bounded and/or unbounded definitions of differential privacy.
References
Dwork C, McSherry F, Nissim K, Smith A (2006). “Calibrating Noise to Sensitivity in Private Data Analysis.” In Halevi S, Rabin T (eds.), Theory of Cryptography, 265–284. ISBN 978-3-540-32732-5, https://doi.org/10.1007/11681878_14.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Machanavajjhala A, Kifer D, Abowd J, Gehrke J, Vilhuber L (2008). “Privacy: Theory meets Practice on the Map.” In 2008 IEEE 24th International Conference on Data Engineering, 277-286. doi:10.1109/ICDE.2008.4497436.
Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M (2006). “Our Data, Ourselves: Privacy Via Distributed Noise Generation.” In Vaudenay S (ed.), Advances in Cryptology - EUROCRYPT 2006, 486–503. ISBN 978-3-540-34547-3, doi:10.1007/11761679_29.
Liu F (2019). “Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanism with Univariate Bounding Constraints.” Transactions on Data Privacy, 12(3), 169-195. http://www.tdp.cat/issues16/tdp.a316a18.pdf.
Examples
# Build datasets
D1 <- sort(stats::rnorm(500, mean=3, sd=2))
D2 <- sort(stats::rnorm(500, mean=-1, sd=0.5))
D3 <- sort(stats::rnorm(200, mean=3, sd=2))
D4 <- sort(stats::rnorm(200, mean=-1, sd=0.5))
M1 <- matrix(c(D1, D2), ncol=2)
M2 <- matrix(c(D3, D4), ncol=2)
lb1 <- -3 # 3 std devs below mean
lb2 <- -2.5 # 3 std devs below mean
ub1 <- 9 # 3 std devs above mean
ub2 <- .5 # 3 std devs above mean
# Pooled covariance satisfying (1,0)-differential privacy
private.pooled.cov <- pooledCovDP(M1, M2, eps = 1, lower.bound1 = lb1,
lower.bound2 = lb2, upper.bound1 = ub1,
upper.bound2 = ub2)
private.pooled.cov
# Pooled covariance satisfying approximate (0.9, 0.01)-differential privacy
# and approximating n.max in the sensitivity calculation
private.pooled.cov <- pooledCovDP(M1, M2, eps = 0.9, lower.bound1 = lb1,
lower.bound2 = lb2, upper.bound1 = ub1,
upper.bound2 = ub2, mechanism = 'Gaussian',
delta = 0.01, approx.n.max = TRUE)
private.pooled.cov
Differentially Private Pooled Covariance Data Access Function
Description
This function performs the data access step in the computation of a
differentially private pooled covariance. The true values are computed using
the theoretical formula and cov
, while the sensitivities
are calculated based on bounded and unbounded differential privacy
(Kifer and Machanavajjhala 2011) according to the theoretical values
(Liu 2019).
Usage
pooledCovDataAccess(
samples,
lower.bound1,
upper.bound1,
lower.bound2,
upper.bound2,
approx.n.max
)
Arguments
samples |
List of two-column matrices from which to compute the pooled covariance. |
lower.bound1 , lower.bound2 |
Real numbers giving the lower bounds of the first and second columns of samples, respectively. |
upper.bound1 , upper.bound2 |
Real numbers giving the upper bounds of the first and second columns of samples, respectively. |
approx.n.max |
Logical indicating whether to approximate n.max, which is defined to be the length of the largest input vector. Approximation is best if n.max is very large. |
Value
List of the true pooled covariance and the sensitivities calculated based on bounded and unbounded differential privacy.
References
Liu F (2019). “Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanism with Univariate Bounding Constraints.” Transactions on Data Privacy, 12(3), 169-195. http://www.tdp.cat/issues16/tdp.a316a18.pdf.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Examples
x1 <- matrix(c(1,4,-2,8,-6,-3),ncol=2)
x2 <- matrix(c(1,2,-5,7),ncol=2)
pooledCovDataAccess(list(x1,x2),-10,10,-10,10,FALSE)
Differentially Private Pooled Variance
Description
This function computes the differentially private pooled variance from two or more vectors of data at user-specified privacy levels of epsilon and delta.
Usage
pooledVarDP(
...,
eps = 1,
lower.bound,
upper.bound,
which.sensitivity = "bounded",
mechanism = "Laplace",
delta = 0,
type.DP = "aDP",
approx.n.max = FALSE
)
Arguments
... |
Two or more vectors from which to compute the pooled variance. |
eps |
Positive real number defining the epsilon privacy budget. |
lower.bound |
Real number giving the global or public lower bound of the input data. |
upper.bound |
Real number giving the global or public upper bound of the input data. |
which.sensitivity |
String indicating which type of sensitivity to use. Can be one of {'bounded', 'unbounded', 'both'}. If 'bounded' (default), returns result based on bounded definition for differential privacy. If 'unbounded', returns result based on unbounded definition. If 'both', returns result based on both methods (Kifer and Machanavajjhala 2011). Note that if 'both' is chosen, each result individually satisfies (eps, delta)-differential privacy, but may not do so collectively and in composition. Care must be taken not to violate differential privacy in this case. |
mechanism |
String indicating which mechanism to use for differential
privacy. Currently the following mechanisms are supported: {'Laplace',
'Gaussian', 'analytic'}. Default is Laplace. See |
delta |
Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP. |
type.DP |
String indicating the type of differential privacy desired for the Gaussian mechanism (if selected). Can be either 'pDP' for probabilistic DP (Machanavajjhala et al. 2008) or 'aDP' for approximate DP (Dwork et al. 2006). Note that if 'aDP' is chosen, epsilon must be strictly less than 1. |
approx.n.max |
Logical indicating whether to approximate n.max (defined to be the length of the largest input vector) in the computation of the global sensitivity based on the upper and lower bounds of the data (Liu 2019). Approximation is best if n.max is very large. |
Value
Sanitized pooled variance based on the bounded and/or unbounded definitions of differential privacy.
References
Dwork C, McSherry F, Nissim K, Smith A (2006). “Calibrating Noise to Sensitivity in Private Data Analysis.” In Halevi S, Rabin T (eds.), Theory of Cryptography, 265–284. ISBN 978-3-540-32732-5, https://doi.org/10.1007/11681878_14.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Machanavajjhala A, Kifer D, Abowd J, Gehrke J, Vilhuber L (2008). “Privacy: Theory meets Practice on the Map.” In 2008 IEEE 24th International Conference on Data Engineering, 277-286. doi:10.1109/ICDE.2008.4497436.
Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M (2006). “Our Data, Ourselves: Privacy Via Distributed Noise Generation.” In Vaudenay S (ed.), Advances in Cryptology - EUROCRYPT 2006, 486–503. ISBN 978-3-540-34547-3, doi:10.1007/11761679_29.
Liu F (2019). “Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanism with Univariate Bounding Constraints.” Transactions on Data Privacy, 12(3), 169-195. http://www.tdp.cat/issues16/tdp.a316a18.pdf.
Examples
# Build datasets
D1 <- stats::rnorm(500, mean=3, sd=2)
D2 <- stats::rnorm(200, mean=3, sd=2)
D3 <- stats::rnorm(100, mean=3, sd=2)
lower.bound <- -3 # 3 standard deviations below mean
upper.bound <- 9 # 3 standard deviations above mean
# Get private pooled variance without approximating n.max
private.pooled.var <- pooledVarDP(D1, D2, D3, eps=1, lower.bound=lower.bound,
upper.bound = upper.bound)
private.pooled.var
# If n.max is sensitive, we can also use
private.pooled.var <- pooledVarDP(D1, D2, D3, eps=1, lower.bound=lower.bound,
upper.bound = upper.bound,
approx.n.max = TRUE)
private.pooled.var
Differentially Private Pooled Variance Data Access Function
Description
This function performs the data access step in the computation of a
differentially private pooled variance. The true values are computed using
the theoretical formula and var
, while the sensitivities
are calculated based on bounded and unbounded differential privacy
(Kifer and Machanavajjhala 2011) according to the theoretical values
(Liu 2019).
Usage
pooledVarDataAccess(samples, lower.bound, upper.bound, approx.n.max)
Arguments
samples |
List of vectors from which to compute the pooled variance. |
lower.bound |
Real number giving the lower bound of the input data. |
upper.bound |
Real number giving the upper bound of the input data. |
approx.n.max |
Logical indicating whether to approximate n.max, which is defined to be the length of the largest input vector. Approximation is best if n.max is very large. |
Value
List of the true pooled variance and the sensitivities calculated based on bounded and unbounded differential privacy.
References
Liu F (2019). “Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanism with Univariate Bounding Constraints.” Transactions on Data Privacy, 12(3), 169-195. http://www.tdp.cat/issues16/tdp.a316a18.pdf.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Examples
pooledVarDataAccess(list(c(1,4,-2,8,-6),c(1,2),c(-5,-7)),-10,10,FALSE)
Differentially Private Quantile
Description
This function computes the differentially private quantile of an input vector at a user-specified privacy level of epsilon.
Usage
quantileDP(
x,
quant,
eps,
lower.bound,
upper.bound,
which.sensitivity = "bounded",
mechanism = "exponential"
)
Arguments
x |
Numeric vector of which the quantile will be taken. |
quant |
Real number between 0 and 1 indicating which quantile to return. |
eps |
Positive real number defining the epsilon privacy budget. |
lower.bound |
Real number giving the global or public lower bound of x. |
upper.bound |
Real number giving the global or public upper bound of x. |
which.sensitivity |
String indicating which type of sensitivity to use. Can be one of {'bounded', 'unbounded', 'both'}. If 'bounded' (default), returns result based on bounded definition for differential privacy. If 'unbounded', returns result based on unbounded definition. If 'both', returns result based on both methods (Kifer and Machanavajjhala 2011). Note that if 'both' is chosen, each result individually satisfies (eps, 0)-differential privacy, but may not do so collectively and in composition. Care must be taken not to violate differential privacy in this case. |
mechanism |
String indicating which mechanism to use for differential
privacy. Currently the following mechanisms are supported: {'exponential'}.
See |
Value
Sanitized quantile based on the bounded and/or unbounded definitions of differential privacy.
References
Dwork C, McSherry F, Nissim K, Smith A (2006). “Calibrating Noise to Sensitivity in Private Data Analysis.” In Halevi S, Rabin T (eds.), Theory of Cryptography, 265–284. ISBN 978-3-540-32732-5, https://doi.org/10.1007/11681878_14.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Smith A (2011). “Privacy-Preserving Statistical Estimation with Optimal Convergence Rates.” In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, STOC '11, 813–822. ISBN 9781450306911, doi:10.1145/1993636.1993743.
Examples
D <- stats::rnorm(500)
lower.bound <- -3 # 3 standard deviations below mean
upper.bound <- 3 # 3 standard deviations above mean
quant <- 0.25
eps <- 1
# Get 25th quantile satisfying pure 1-differential privacy
private.quantile <- quantileDP(D, quant, eps, lower.bound, upper.bound)
private.quantile
Differentially Private Quantile Data Access Function
Description
This function performs the data access step in the computation of a differentially private quantile. The utility vector is computed as in Smith (2011), while the sensitivities are calculated based on bounded and unbounded differential privacy (Kifer and Machanavajjhala 2011) according to the theoretical values (Gillenwater et al. 2021).
Usage
quantileDataAccess(x, quant, lower.bound, upper.bound)
Arguments
x |
Numeric vector. |
quant |
Real number between 0 and 1 indicating which quantile to return. |
lower.bound |
Real number giving the lower bound of the input data. |
upper.bound |
Real number giving the upper bound of the input data. |
Value
List of a vector corresponding to the utility function, the sorted and clipped vector of inputs and the sensitivity calculated based on theoretical values.
References
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Smith A (2011). “Privacy-Preserving Statistical Estimation with Optimal Convergence Rates.” In Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing, STOC '11, 813–822. ISBN 9781450306911, doi:10.1145/1993636.1993743.
Gillenwater J, Joseph M, Kulesza A (2021). “Differentially Private Quantiles.” In Meila M, Zhang T (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, 3713–3722. http://proceedings.mlr.press/v139/gillenwater21a/gillenwater21a.pdf.
Examples
quantileDataAccess(c(1,1,-2,8,-6),.25,-10,10)
l2 Regularizer Gradient
Description
This function implements the l2 regularizer gradient in the form required by
EmpiricalRiskMinimizationDP.CMS
and
EmpiricalRiskMinimizationDP.KST
.
Usage
regularizer.gr.l2(coeff)
Arguments
coeff |
Vector or matrix of coefficients or weights. |
Value
Regularizer gradient value at the given coefficient vector.
Examples
coeff <- c(0.5,-1,2)
regularizer.gr.l2(coeff)
l2 Regularizer
Description
This function implements the l2 regularizer in the form required by
EmpiricalRiskMinimizationDP.CMS
and
EmpiricalRiskMinimizationDP.KST
.
Usage
regularizer.l2(coeff)
Arguments
coeff |
Vector or matrix of coefficients or weights. |
Value
Regularizer value at the given coefficient vector.
Examples
coeff <- c(0.5,-1,2)
regularizer.l2(coeff)
Differentially Private Standard Deviation
Description
This function computes the differentially private standard deviation of a given dataset at user-specified privacy levels of epsilon and delta.
Usage
sdDP(
x,
eps,
lower.bound,
upper.bound,
which.sensitivity = "bounded",
mechanism = "Laplace",
delta = 0,
type.DP = "aDP"
)
Arguments
x |
Numeric vector whose variance is desired. |
eps |
Positive real number defining the epsilon privacy budget. |
lower.bound |
Scalar representing the global or public lower bound on values of x. |
upper.bound |
Scalar representing the global or public upper bound on values of x. |
which.sensitivity |
String indicating which type of sensitivity to use. Can be one of {'bounded', 'unbounded', 'both'}. If 'bounded' (default), returns result based on bounded definition for differential privacy. If 'unbounded', returns result based on unbounded definition. If 'both', returns result based on both methods (Kifer and Machanavajjhala 2011). Note that if 'both' is chosen, each result individually satisfies (eps, delta)-differential privacy, but may not do so collectively and in composition. Care must be taken not to violate differential privacy in this case. |
mechanism |
String indicating which mechanism to use for differential
privacy. Currently the following mechanisms are supported: {'Laplace',
'Gaussian', 'analytic'}. Default is Laplace. See |
delta |
Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP. |
type.DP |
String indicating the type of differential privacy desired for the Gaussian mechanism (if selected). Can be either 'pDP' for probabilistic DP (Machanavajjhala et al. 2008) or 'aDP' for approximate DP (Dwork et al. 2006). Note that if 'aDP' is chosen, epsilon must be strictly less than 1. |
Value
Sanitized standard deviation based on the bounded and/or unbounded definitions of differential privacy.
References
Dwork C, McSherry F, Nissim K, Smith A (2006). “Calibrating Noise to Sensitivity in Private Data Analysis.” In Halevi S, Rabin T (eds.), Theory of Cryptography, 265–284. ISBN 978-3-540-32732-5, https://doi.org/10.1007/11681878_14.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Machanavajjhala A, Kifer D, Abowd J, Gehrke J, Vilhuber L (2008). “Privacy: Theory meets Practice on the Map.” In 2008 IEEE 24th International Conference on Data Engineering, 277-286. doi:10.1109/ICDE.2008.4497436.
Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M (2006). “Our Data, Ourselves: Privacy Via Distributed Noise Generation.” In Vaudenay S (ed.), Advances in Cryptology - EUROCRYPT 2006, 486–503. ISBN 978-3-540-34547-3, doi:10.1007/11761679_29.
Liu F (2019). “Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanism with Univariate Bounding Constraints.” Transactions on Data Privacy, 12(3), 169-195. http://www.tdp.cat/issues16/tdp.a316a18.pdf.
Examples
D <- stats::rnorm(500, mean=3, sd=2)
lb <- -3 # 3 std devs below mean
ub <- 9 # 3 std devs above mean
sdDP(D, 1, lb, ub)
sdDP(D,.5, lb, ub, which.sensitivity='unbounded', mechanism='Gaussian',
delta=0.01)
Privacy-preserving Support Vector Machine
Description
This class implements differentially private support vector machine (SVM) (Chaudhuri et al. 2011). It can be either weighted (Yang et al. 2005) or unweighted. Either the output or the objective perturbation method can be used for unweighted SVM, though only the output perturbation method is currently supported for weighted SVM.
Details
To use this class for SVM, first use the new
method to
construct an object of this class with the desired function values and
hyperparameters, including a choice of the desired kernel. After
constructing the object, the fit
method can be applied to fit the
model with a provided dataset, data bounds, and optional observation
weights and weight upper bound. In fitting, the model stores a vector of
coefficients coeff
which satisfy differential privacy. Additionally,
if a nonlinear kernel is chosen, the models stores a mapping function from
the input data X to a higher dimensional embedding V in the form of a
method XtoV
as required (Chaudhuri et al. 2011). These
can be released directly, or used in conjunction with the predict
method to privately predict the label of new datapoints. Note that the
mapping function XtoV
is based on an approximation method via
Fourier transforms (Rahimi and Recht 2007; Rahimi and Recht 2008).
Note that in order to guarantee differential privacy for the SVM model, certain constraints must be satisfied for the values used to construct the object, as well as for the data used to fit. These conditions depend on the chosen perturbation method. First, the loss function is assumed to be differentiable (and doubly differentiable if the objective perturbation method is used). The hinge loss, which is typically used for SVM, is not differentiable at 1. Thus, to satisfy this constraint, this class utilizes the Huber loss, a smooth approximation to the hinge loss (Chapelle 2007). The level of approximation to the hinge loss is determined by a user-specified constant, h, which defaults to 0.5, a typical value. Additionally, the regularizer must be 1-strongly convex and differentiable. It also must be doubly differentiable if objective perturbation is chosen. If weighted SVM is desired, the provided weights must be nonnegative and bounded above by a global or public value, which must also be provided.
Finally, it is assumed that if x represents a single row of the dataset X,
then the l2-norm of x is at most 1 for all x. In order to ensure this
constraint is satisfied, the dataset is preprocessed and scaled, and the
resulting coefficients are postprocessed and un-scaled so that the stored
coefficients correspond to the original data. Due to this constraint on x,
it is best to avoid using a bias term in the model whenever possible. If a
bias term must be used, the issue can be partially circumvented by adding a
constant column to X before fitting the model, which will be scaled along
with the rest of X. The fit
method contains functionality to add a
column of constant 1s to X before scaling, if desired.
Super classes
DPpack::EmpiricalRiskMinimizationDP.CMS
-> DPpack::WeightedERMDP.CMS
-> svmDP
Methods
Public methods
Method new()
Create a new svmDP
object.
Usage
svmDP$new( regularizer, eps, gamma, perturbation.method = "objective", kernel = "linear", D = NULL, kernel.param = NULL, regularizer.gr = NULL, huber.h = 0.5 )
Arguments
regularizer
String or regularization function. If a string, must be 'l2', indicating to use l2 regularization. If a function, must have form
regularizer(coeff)
, wherecoeff
is a vector or matrix, and return the value of the regularizer atcoeff
. Seeregularizer.l2
for an example. Additionally, in order to ensure differential privacy, the function must be 1-strongly convex and doubly differentiable.eps
Positive real number defining the epsilon privacy budget. If set to Inf, runs algorithm without differential privacy.
gamma
Nonnegative real number representing the regularization constant.
perturbation.method
String indicating whether to use the 'output' or the 'objective' perturbation methods (Chaudhuri et al. 2011). Defaults to 'objective'.
kernel
String indicating which kernel to use for SVM. Must be one of {'linear', 'Gaussian'}. If 'linear' (default), linear SVM is used. If 'Gaussian,' uses the sampling function corresponding to the Gaussian (radial) kernel approximation.
D
Nonnegative integer indicating the dimensionality of the transform space approximating the kernel if a nonlinear kernel is used. Higher values of D provide better kernel approximations at a cost of computational efficiency. This value must be specified if a nonlinear kernel is used.
kernel.param
Positive real number corresponding to the Gaussian kernel parameter. Defaults to 1/p, where p is the number of predictors.
regularizer.gr
Optional function representing the gradient of the regularization function with respect to
coeff
and of the formregularizer.gr(coeff)
. Should return a vector. Seeregularizer.gr.l2
for an example. Ifregularizer
is given as a string, this value is ignored. If not given andregularizer
is a function, non-gradient based optimization methods are used to compute the coefficient values in fitting the model.huber.h
Positive real number indicating the degree to which the Huber loss approximates the hinge loss. Defaults to 0.5 (Chapelle 2007).
Returns
A new svmDP object.
Method fit()
Fit the differentially private SVM model. This method runs
either the output perturbation or the objective perturbation algorithm
(Chaudhuri et al. 2011), depending on the value of
perturbation.method used to construct the object, to generate an
objective function. A numerical optimization method is then run to find
optimal coefficients for fitting the model given the training data,
weights, and hyperparameters. The built-in optim
function
using the "BFGS" optimization method is used. If regularizer
is
given as 'l2' or if regularizer.gr
is given in the construction of
the object, the gradient of the objective function is utilized by
optim
as well. Otherwise, non-gradient based optimization methods
are used. The resulting privacy-preserving coefficients are stored in
coeff
.
Usage
svmDP$fit( X, y, upper.bounds, lower.bounds, add.bias = FALSE, weights = NULL, weights.upper.bound = NULL )
Arguments
X
Dataframe of data to be fit.
y
Vector or matrix of true labels for each row of
X
.upper.bounds
Numeric vector of length
ncol(X)
giving upper bounds on the values in each column of X. Thencol(X)
values are assumed to be in the same order as the corresponding columns ofX
. Any value in the columns ofX
larger than the corresponding upper bound is clipped at the bound.lower.bounds
Numeric vector of length
ncol(X)
giving lower bounds on the values in each column ofX
. Thencol(X)
values are assumed to be in the same order as the corresponding columns ofX
. Any value in the columns ofX
larger than the corresponding upper bound is clipped at the bound.add.bias
Boolean indicating whether to add a bias term to
X
. Defaults to FALSE.weights
Numeric vector of observation weights of the same length as
y
. If not given, no observation weighting is performed.weights.upper.bound
Numeric value representing the global or public upper bound on the weights. Required if weights are given.
Method XtoV()
Convert input data X into transformed data V. Uses sampled pre-filter values and a mapping function based on the chosen kernel to produce D-dimensional data V on which to train the model or predict future values. This method is only used if the kernel is nonlinear. See Chaudhuri et al. (2011) for more details.
Usage
svmDP$XtoV(X)
Arguments
X
Matrix corresponding to the original dataset.
Returns
Matrix V of size n by D representing the transformed dataset, where n is the number of rows of X, and D is the provided transformed space dimension.
Method predict()
Predict label(s) for given X
using the fitted
coefficients.
Usage
svmDP$predict(X, add.bias = FALSE, raw.value = FALSE)
Arguments
X
Dataframe of data on which to make predictions. Must be of same form as
X
used to fit coefficients.add.bias
Boolean indicating whether to add a bias term to
X
. Defaults to FALSE. If add.bias was set to TRUE when fitting the coefficients, add.bias should be set to TRUE for predictions.raw.value
Boolean indicating whether to return the raw predicted value or the rounded class label. If FALSE (default), outputs the predicted labels 0 or 1. If TRUE, returns the raw score from the SVM model.
Returns
Matrix of predicted labels or scores corresponding to each row of
X
.
Method clone()
The objects of this class are cloneable with this method.
Usage
svmDP$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
References
Chaudhuri K, Monteleoni C, Sarwate AD (2011). “Differentially Private Empirical Risk Minimization.” Journal of Machine Learning Research, 12(29), 1069-1109. https://jmlr.org/papers/v12/chaudhuri11a.html.
Yang X, Song Q, Cao A (2005). “Weighted support vector machine for data classification.” In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005., volume 2, 859-864 vol. 2. doi:10.1109/IJCNN.2005.1555965.
Chapelle O (2007). “Training a Support Vector Machine in the Primal.” Neural Computation, 19(5), 1155-1178. doi:10.1162/neco.2007.19.5.1155.
Rahimi A, Recht B (2007). “Random Features for Large-Scale Kernel Machines.” In Platt J, Koller D, Singer Y, Roweis S (eds.), Advances in Neural Information Processing Systems, volume 20. https://proceedings.neurips.cc/paper/2007/file/013a006f03dbc5392effeb8f18fda755-Paper.pdf.
Rahimi A, Recht B (2008). “Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning.” In Koller D, Schuurmans D, Bengio Y, Bottou L (eds.), Advances in Neural Information Processing Systems, volume 21. https://proceedings.neurips.cc/paper/2008/file/0efe32849d230d7f53049ddc4a4b0c60-Paper.pdf.
Examples
# Build train dataset X and y, and test dataset Xtest and ytest
N <- 400
X <- data.frame()
y <- data.frame()
for (i in (1:N)){
Xtemp <- data.frame(x1 = stats::rnorm(1,sd=.28) , x2 = stats::rnorm(1,sd=.28))
if (sum(Xtemp^2)<.15) ytemp <- data.frame(y=0)
else ytemp <- data.frame(y=1)
X <- rbind(X, Xtemp)
y <- rbind(y, ytemp)
}
Xtest <- X[seq(1,N,10),]
ytest <- y[seq(1,N,10),,drop=FALSE]
X <- X[-seq(1,N,10),]
y <- y[-seq(1,N,10),,drop=FALSE]
# Construct object for SVM
regularizer <- 'l2' # Alternatively, function(coeff) coeff%*%coeff/2
eps <- 1
gamma <- 1
perturbation.method <- 'output'
kernel <- 'Gaussian'
D <- 20
svmdp <- svmDP$new(regularizer, eps, gamma, perturbation.method,
kernel=kernel, D=D)
# Fit with data
# Bounds for X based on construction
upper.bounds <- c( 1, 1)
lower.bounds <- c(-1,-1)
weights <- rep(1, nrow(y)) # Uniform weighting
weights[nrow(y)] <- 0.5 # Half weight for last observation
wub <- 1 # Public upper bound for weights
svmdp$fit(X, y, upper.bounds, lower.bounds, weights=weights,
weights.upper.bound=wub) # No bias term
# Predict new data points
predicted.y <- svmdp$predict(Xtest)
n.errors <- sum(predicted.y!=ytest)
Differentially Private Contingency Table
Description
This function computes a differentially private contingency table from given vectors of data at user-specified privacy levels of epsilon and delta.
Usage
tableDP(
...,
eps = 1,
which.sensitivity = "bounded",
mechanism = "Laplace",
delta = 0,
type.DP = "aDP",
allow.negative = FALSE
)
Arguments
... |
Vectors of data from which to create the contingency table. |
eps |
Positive real number defining the epsilon privacy budget. |
which.sensitivity |
String indicating which type of sensitivity to use. Can be one of {'bounded', 'unbounded', 'both'}. If 'bounded' (default), returns result based on bounded definition for differential privacy. If 'unbounded', returns result based on unbounded definition. If 'both', returns result based on both methods (Kifer and Machanavajjhala 2011). Note that if 'both' is chosen, each result individually satisfies (eps, delta)-differential privacy, but may not do so collectively and in composition. Care must be taken not to violate differential privacy in this case. |
mechanism |
String indicating which mechanism to use for differential
privacy. Currently the following mechanisms are supported: {'Laplace',
'Gaussian', 'analytic'}. Default is Laplace. See |
delta |
Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP. |
type.DP |
String indicating the type of differential privacy desired for the Gaussian mechanism (if selected). Can be either 'pDP' for probabilistic DP (Machanavajjhala et al. 2008) or 'aDP' for approximate DP (Dwork et al. 2006). Note that if 'aDP' is chosen, epsilon must be strictly less than 1. |
allow.negative |
Logical value. If FALSE (default), any negative values in the sanitized table due to the added noise will be set to 0. If TRUE, the negative values (if any) will be returned. |
Value
Sanitized contingency table based on the bounded and/or unbounded definitions of differential privacy.
References
Dwork C, McSherry F, Nissim K, Smith A (2006). “Calibrating Noise to Sensitivity in Private Data Analysis.” In Halevi S, Rabin T (eds.), Theory of Cryptography, 265–284. ISBN 978-3-540-32732-5, https://doi.org/10.1007/11681878_14.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Machanavajjhala A, Kifer D, Abowd J, Gehrke J, Vilhuber L (2008). “Privacy: Theory meets Practice on the Map.” In 2008 IEEE 24th International Conference on Data Engineering, 277-286. doi:10.1109/ICDE.2008.4497436.
Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M (2006). “Our Data, Ourselves: Privacy Via Distributed Noise Generation.” In Vaudenay S (ed.), Advances in Cryptology - EUROCRYPT 2006, 486–503. ISBN 978-3-540-34547-3, doi:10.1007/11761679_29.
Examples
x <- MASS::Cars93$Type
y <- MASS::Cars93$Origin
z <- MASS::Cars93$AirBags
tableDP(x,y,eps=1,which.sensitivity='bounded',mechanism='Laplace',
type.DP='pDP')
tableDP(x,y,z,eps=.5,which.sensitivity='unbounded',mechanism='Gaussian',
delta=0.01)
Differentially Private Contingency Table Data Access Function
Description
This function performs the data access step in the computation of a
differentially private contingency table. The true values are computed using
table
,while the sensitivities are calculated based on
bounded and unbounded differential privacy (Kifer and Machanavajjhala 2011)
according to the theoretical values (Liu 2019).
Usage
tableDataAccess(..., mechanism = "Laplace")
Arguments
... |
Vectors of data from which to create the contingency table. |
mechanism |
String indicating which mechanism to use for differential privacy. If the 'Laplace' mechanism is chosen, l1 sensitivities are returned. If the 'Gaussian' or 'analytic' mechanisms are chosen, l2 sensitivities are returned. |
Value
List of the true contingency table and the sensitivities calculated based on bounded and unbounded differential privacy.
References
Liu F (2019). “Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanism with Univariate Bounding Constraints.” Transactions on Data Privacy, 12(3), 169-195. http://www.tdp.cat/issues16/tdp.a316a18.pdf.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Examples
x <- MASS::Cars93$Type;
y <- MASS::Cars93$Origin;
tableDataAccess(x, y, mechanism='Laplace')
Privacy-preserving Hyperparameter Tuning for Binary Classification Models
Description
This function implements the privacy-preserving hyperparameter tuning
function for binary classification (Chaudhuri et al. 2011) using
the exponential mechanism. It accepts a list of DP models with various chosen
hyperparameters, a dataset X with corresponding labels y, upper and lower
bounds on the columns of X, and a boolean indicating whether to add bias in
the construction of each of the models. The data are split into m+1 equal
groups, where m is the number of models being compared. One group is set
aside as the validation group, and each of the other m groups are used to
train each of the given m models. The number of errors on the validation set
is counted for each model and used as the utility values in the exponential
mechanism (ExponentialMechanism
) to select a tuned model in a
privacy-preserving way.
Usage
tune_classification_model(
DPmodels,
X,
y,
upper.bounds,
lower.bounds,
add.bias = FALSE,
weights = NULL,
weights.upper.bound = NULL
)
Arguments
DPmodels |
Vector of binary classification model objects, each initialized with a different combination of hyperparameter values from the search space for tuning. Each model should be initialized with the same epsilon privacy parameter value eps. The tuned model satisfies eps-level differential privacy. |
X |
Dataframe of data to be used in tuning the model. Note it is assumed the data rows and corresponding labels are randomly shuffled. |
y |
Vector or matrix of true labels for each row of X. |
upper.bounds |
Numeric vector giving upper bounds on the values in each column of X. Should be of length ncol(X). The values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X larger than the corresponding upper bound is clipped at the bound. |
lower.bounds |
Numeric vector giving lower bounds on the values in each column of X. Should be of length ncol(X). The values are assumed to be in the same order as the corresponding columns of X. Any value in the columns of X smaller than the corresponding lower bound is clipped at the bound. |
add.bias |
Boolean indicating whether to add a bias term to X. Defaults to FALSE. |
weights |
Numeric vector of observation weights of the same length as
|
weights.upper.bound |
Numeric value representing the global or public upper bound on the weights. |
Value
Single model object selected from the input list DPmodels with tuned parameters.
References
Chaudhuri K, Monteleoni C, Sarwate AD (2011). “Differentially Private Empirical Risk Minimization.” Journal of Machine Learning Research, 12(29), 1069-1109. https://jmlr.org/papers/v12/chaudhuri11a.html.
Examples
# Build train dataset X and y, and test dataset Xtest and ytest
N <- 200
K <- 2
X <- data.frame()
y <- data.frame()
for (j in (1:K)){
t <- seq(-.25,.25,length.out = N)
if (j==1) m <- stats::rnorm(N,-.2,.1)
if (j==2) m <- stats::rnorm(N, .2,.1)
Xtemp <- data.frame(x1 = 3*t , x2 = m - t)
ytemp <- data.frame(matrix(j-1, N, 1))
X <- rbind(X, Xtemp)
y <- rbind(y, ytemp)
}
Xtest <- X[seq(1,(N*K),10),]
ytest <- y[seq(1,(N*K),10),,drop=FALSE]
X <- X[-seq(1,(N*K),10),]
y <- y[-seq(1,(N*K),10),,drop=FALSE]
y <- as.matrix(y)
weights <- rep(1, nrow(y)) # Uniform weighting
weights[nrow(y)] <- 0.5 # half weight for last observation
wub <- 1 # Public upper bound for weights
# Grid of possible gamma values for tuning logistic regression model
grid.search <- c(100, 1, .0001)
# Construct objects for SVM parameter tuning
eps <- 1 # Privacy budget should be the same for all models
svmdp1 <- svmDP$new("l2", eps, grid.search[1], perturbation.method='output')
svmdp2 <- svmDP$new("l2", eps, grid.search[2], perturbation.method='output')
svmdp3 <- svmDP$new("l2", eps, grid.search[3], perturbation.method='output')
DPmodels <- c(svmdp1, svmdp2, svmdp3)
# Tune using data and bounds for X based on its construction
upper.bounds <- c( 1, 1)
lower.bounds <- c(-1,-1)
tuned.model <- tune_classification_model(DPmodels, X, y, upper.bounds,
lower.bounds, weights=weights,
weights.upper.bound=wub)
tuned.model$gamma # Gives resulting selected hyperparameter
# tuned.model result can be used the same as a trained LogisticRegressionDP model
# Predict new data points
predicted.y <- tuned.model$predict(Xtest)
n.errors <- sum(predicted.y!=ytest)
Privacy-preserving Hyperparameter Tuning for Linear Regression Models
Description
This function implements the privacy-preserving hyperparameter tuning
function for linear regression (Kifer et al. 2012) using the
exponential mechanism. It accepts a list of DP models with various chosen
hyperparameters, a dataset X with corresponding values y, upper and lower
bounds on the columns of X and the values of y, and a boolean indicating
whether to add bias in the construction of each of the models. The data are
split into m+1 equal groups, where m is the number of models being compared.
One group is set aside as the validation group, and each of the other m
groups are used to train each of the given m models. The negative of the sum
of the squared error for each model on the validation set is used as the
utility values in the exponential mechanism
(ExponentialMechanism
) to select a tuned model in a
privacy-preserving way.
Usage
tune_linear_regression_model(
DPmodels,
X,
y,
upper.bounds,
lower.bounds,
add.bias = FALSE
)
Arguments
DPmodels |
Vector of linear regression model objects, each initialized with a different combination of hyperparameter values from the search space for tuning. Each model should be initialized with the same epsilon privacy parameter value eps. The tuned model satisfies eps-level differential privacy. |
X |
Dataframe of data to be used in tuning the model. Note it is assumed the data rows and corresponding labels are randomly shuffled. |
y |
Vector or matrix of true values for each row of X. |
upper.bounds |
Numeric vector giving upper bounds on the values in each column of X and the values in y. Should be length ncol(X)+1. The first ncol(X) values are assumed to be in the same order as the corresponding columns of X, while the last value in the vector is assumed to be the upper bound on y. Any value in the columns of X and y larger than the corresponding upper bound is clipped at the bound. |
lower.bounds |
Numeric vector giving lower bounds on the values in each column of X and the values in y. Should be length ncol(X)+1. The first ncol(X) values are assumed to be in the same order as the corresponding columns of X, while the last value in the vector is assumed to be the lower bound on y. Any value in the columns of X and y smaller than the corresponding lower bound is clipped at the bound. |
add.bias |
Boolean indicating whether to add a bias term to X. Defaults to FALSE. |
Value
Single model object selected from the input list DPmodels with tuned parameters.
References
Kifer D, Smith A, Thakurta A (2012). “Private Convex Empirical Risk Minimization and High-dimensional Regression.” In Mannor S, Srebro N, Williamson RC (eds.), Proceedings of the 25th Annual Conference on Learning Theory, volume 23 of Proceedings of Machine Learning Research, 25.1–25.40. https://proceedings.mlr.press/v23/kifer12.html.
Examples
# Build example dataset
n <- 500
X <- data.frame(X=seq(-1,1,length.out = n))
true.theta <- c(-.3,.5) # First element is bias term
p <- length(true.theta)
y <- true.theta[1] + as.matrix(X)%*%true.theta[2:p] + stats::rnorm(n=n,sd=.1)
# Grid of possible gamma values for tuning linear regression model
grid.search <- c(100, 1, .0001)
# Construct objects for logistic regression parameter tuning
# Privacy budget should be the same for all models
eps <- 1
delta <- 0.01
linrdp1 <- LinearRegressionDP$new("l2", eps, delta, grid.search[1])
linrdp2 <- LinearRegressionDP$new("l2", eps, delta, grid.search[2])
linrdp3 <- LinearRegressionDP$new("l2", eps, delta, grid.search[3])
DPmodels <- c(linrdp1, linrdp2, linrdp3)
# Tune using data and bounds for X and y based on their construction
upper.bounds <- c( 1, 2) # Bounds for X and y
lower.bounds <- c(-1,-2) # Bounds for X and y
tuned.model <- tune_linear_regression_model(DPmodels, X, y, upper.bounds,
lower.bounds, add.bias=TRUE)
tuned.model$gamma # Gives resulting selected hyperparameter
# tuned.model result can be used the same as a trained LogisticRegressionDP model
tuned.model$coeff # Gives coefficients for tuned model
# Build a test dataset for prediction
Xtest <- data.frame(X=c(-.5, -.25, .1, .4))
predicted.y <- tuned.model$predict(Xtest, add.bias=TRUE)
Differentially Private Variance
Description
This function computes the differentially private variance of a given dataset at user-specified privacy levels of epsilon and delta.
Usage
varDP(
x,
eps,
lower.bound,
upper.bound,
which.sensitivity = "bounded",
mechanism = "Laplace",
delta = 0,
type.DP = "aDP"
)
Arguments
x |
Numeric vector whose variance is desired. |
eps |
Positive real number defining the epsilon privacy budget. |
lower.bound |
Scalar representing the global or public lower bound on values of x. |
upper.bound |
Scalar representing the global or public upper bound on values of x. |
which.sensitivity |
String indicating which type of sensitivity to use. Can be one of {'bounded', 'unbounded', 'both'}. If 'bounded' (default), returns result based on bounded definition for differential privacy. If 'unbounded', returns result based on unbounded definition. If 'both', returns result based on both methods (Kifer and Machanavajjhala 2011). Note that if 'both' is chosen, each result individually satisfies (eps, delta)-differential privacy, but may not do so collectively and in composition. Care must be taken not to violate differential privacy in this case. |
mechanism |
String indicating which mechanism to use for differential
privacy. Currently the following mechanisms are supported: {'Laplace',
'Gaussian', 'analytic'}. Default is Laplace. See |
delta |
Nonnegative real number defining the delta privacy parameter. If 0 (default), reduces to eps-DP. |
type.DP |
String indicating the type of differential privacy desired for the Gaussian mechanism (if selected). Can be either 'pDP' for probabilistic DP (Machanavajjhala et al. 2008) or 'aDP' for approximate DP (Dwork et al. 2006). Note that if 'aDP' is chosen, epsilon must be strictly less than 1. |
Value
Sanitized variance based on the bounded and/or unbounded definitions of differential privacy.
References
Dwork C, McSherry F, Nissim K, Smith A (2006). “Calibrating Noise to Sensitivity in Private Data Analysis.” In Halevi S, Rabin T (eds.), Theory of Cryptography, 265–284. ISBN 978-3-540-32732-5, https://doi.org/10.1007/11681878_14.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Machanavajjhala A, Kifer D, Abowd J, Gehrke J, Vilhuber L (2008). “Privacy: Theory meets Practice on the Map.” In 2008 IEEE 24th International Conference on Data Engineering, 277-286. doi:10.1109/ICDE.2008.4497436.
Dwork C, Kenthapadi K, McSherry F, Mironov I, Naor M (2006). “Our Data, Ourselves: Privacy Via Distributed Noise Generation.” In Vaudenay S (ed.), Advances in Cryptology - EUROCRYPT 2006, 486–503. ISBN 978-3-540-34547-3, doi:10.1007/11761679_29.
Liu F (2019). “Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanism with Univariate Bounding Constraints.” Transactions on Data Privacy, 12(3), 169-195. http://www.tdp.cat/issues16/tdp.a316a18.pdf.
Examples
D <- stats::rnorm(500, mean=3, sd=2)
lb <- -3 # 3 std devs below mean
ub <- 9 # 3 std devs above mean
varDP(D, 1, lb, ub)
varDP(D,.5, lb, ub, which.sensitivity='unbounded', mechanism='Gaussian',
delta=0.01)
Differentially Private Variance Data Access Function
Description
This function performs the data access step in the computation of a
differentially private variance. The true values are computed using
var
, while the sensitivities are calculated based on
bounded and unbounded differential privacy (Kifer and Machanavajjhala 2011)
according to the theoretical values (Liu 2019). For the
variance, the sensitivities based on bounded and unbounded differential
privacy are identical, so only one value is returned.
Usage
varDataAccess(x, lower.bound, upper.bound)
Arguments
x |
Dataset whose variance is desired. |
lower.bound |
Scalar representing the global or public lower bound on values of x. |
upper.bound |
Scalar representing the global or public upper bound on values of x. |
Value
List of the true variance and the sensitivity calculated based on theoretical values.
References
Liu F (2019). “Statistical Properties of Sanitized Results from Differentially Private Laplace Mechanism with Univariate Bounding Constraints.” Transactions on Data Privacy, 12(3), 169-195. http://www.tdp.cat/issues16/tdp.a316a18.pdf.
Kifer D, Machanavajjhala A (2011). “No Free Lunch in Data Privacy.” In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD '11, 193–204. ISBN 9781450306614, doi:10.1145/1989323.1989345.
Examples
varDataAccess(c(1,4,3,2), 0, 5)