Type: Package
Title: Heckman Selection Models Based on Bayesian Analysis
Version: 1.0.0
Maintainer: Heeju Lim <heeju.lim@uconn.edu>
Description: Implements Heckman selection models using a Bayesian approach via 'Stan' and compares the performance of normal, Student’s t, and contaminated normal distributions in addressing complexities and selection bias (Heeju Lim, Victor E. Lachos, and Victor H. Lachos, Bayesian analysis of flexible Heckman selection models using Hamiltonian Monte Carlo, 2025, under submission).
Imports: rstan (≥ 2.26.23), mvtnorm (≥ 1.2-3), loo, stats
License: GPL-3
Encoding: UTF-8
LazyData: true
RoxygenNote: 7.3.2
NeedsCompilation: no
Packaged: 2025-05-02 19:55:20 UTC; heeju
Author: Heeju Lim [aut, cre], Victor E. Lachos [aut], Victor H. Lachos [aut]
Depends: R (≥ 3.5.0)
Repository: CRAN
Date/Publication: 2025-05-06 08:50:05 UTC

Fit the Heckman Selection Stan model using the Normal, Student-t or Contaminated Normal distributions.

Description

'HeckmanStan()' fits the Heckman selection model using a Bayesian approach to address sample selection bias.

Usage

HeckmanStan(
  y,
  x,
  w,
  cc,
  family = "CN",
  init = "random",
  thin = 5,
  chains = 1,
  iter = 10,
  warmup = 5
)

Arguments

y

A response vector.

x

A covariate matrix for the response y.

w

A covariate matrix for the missing indicator cc.

cc

A missing indicator vector (1=observed, 0=missing) .

family

The distribution family to be used (Normal, T, or CN).

init

Parameters specifies the initial values for model parameters.

thin

An Interval at which samples are retained from the MCMC process to reduce autocorrelation.

chains

The number of chains to run during the MCMC sampling. Running multiple chains is useful for checking convergence.

iter

The total number of iterations for the MCMC sampling, determining how many samples will be drawn.

warmup

The number of initial iterations that will be discarded as the algorithm stabilizes before collecting samples.

Value

An object of class HeckmanStan, which is a list containing two elements:

Examples


################################################################################
# Simulation
################################################################################
library(mvtnorm)
n<- 100
w<- cbind(1,rnorm(n),rnorm(n))
x<- cbind(w[,1:2])
family="CN"
sigma2<- 1
rho<-0.7
beta<- c(1,0.5)
gamma<- c(1,0.3,-.5)
nu=c(0.1,0.1)
data<-geraHeckman(x,w,beta,gamma,sigma2,rho,nu,family=family)
y<-data$y
cc<-data$cc
# Fit Heckman Normal Stan model
fit.n_stan <- HeckmanStan(y, x, w, cc, family="Normal"
                         , thin = 5, chains = 1, iter = 10000, warmup = 1000)
qoi=c("beta","gamma","sigma_e","sigma2", "rho","EAIC","EBIC")
print(fit.n_stan[[1]],par=qoi)
print(fit.n_stan[[2]])

require(rstan)
plot(fit.n_stan[[1]], pars=qoi)
plot(fit.n_stan[[1]], plotfun="hist", pars=qoi)
plot(fit.n_stan[[1]], plotfun="trace", pars=qoi)
plot(fit.n_stan[[1]], plotfun = "rhat")




MEPS 2001: Ambulatory Expenditures Data

Description

This dataset is an extract from the 2001 Medical Expenditure Panel Survey (MEPS), providing information on ambulatory expenditures and various demographic and health-related variables. It has been used for illustrative examples by Cameron and Trivedi (2009, Chapter 16).

Usage

data(MEPS2001)

Format

A data frame with 3,328 observations on the following 22 variables.

educ

Education status

age

Age

income

Income

female

Gender

vgood

Self-reported health status, very good

good

Self-reported health status, good

hospexp

Hospital expenditures

totchr

Total number of chronic diseases

ffs

Family support

dhospexp

Dummy variable for hospital expenditures

age2

Age squared

agefem

Interaction between age and gender

fairpoor

Self-reported health status, fair or poor

year01

Year of survey

instype

Type of insurance

ambexp

Ambulatory expenditures

lambexp

Log of ambulatory expenditures

blhisp

Ethnicity

instype_s1

Insurance type, version 1

dambexp

Dummy variable for ambulatory expenditures

lnambx

Log-transformed ambulatory expenditures

ins

Insurance status

Source

2001 Medical Expenditure Panel Survey by the Agency for Healthcare Research and Quality.

References

Cameron, C.A. and Trivedi, P.K. (2009). *Microeconometrics Using Stata*. College Station, TX: Stata Press.

Examples


data(MEPS2001)
head(MEPS2001)


Panel Study of Income Dynamics 1976 Extract

Description

Cross-section data originating from the 1976 Panel Study of Income Dynamics (PSID). The dataset includes demographic and economic characteristics of married women and their husbands, and is commonly used for analyzing female labor force participation.

Usage

data(PSID1976)

Format

A data frame with 753 observations on the following 22 variables.

age

age of the woman

city

dummy for living in a city

college

dummy for college education (woman)

education

years of education (woman)

experience

years of labor market experience

feducation

father's years of education

fincome

family income in 1,000s

hage

husband's age

hcollege

dummy for husband's college education

heducation

husband's years of education

hhours

husband's weekly working hours

hours

woman's weekly working hours

hwage

husband's log hourly wage

meducation

mother's years of education

oldkids

number of children older than 6

participation

dummy for woman's labor force participation

repwage

replacement wage (predicted wage if not employed)

tax

marginal tax rate

unemp

state unemployment rate

wage

log hourly wage of the woman

youngkids

number of children 6 or younger

References

Mroz, T. A. (1987). The sensitivity of an empirical model of married women's hours of work to economic and statistical assumptions. *Econometrica*, 55(4), 765–799.

Examples


data(PSID1976)
head(PSID1976)


Generating Heckman data : Normal, Student-t, Slash and Laplace

Description

'geraHeckman()' generates a random sample from the Heckman selection model (Normal, Student-t or CN).

Usage

geraHeckman(x, w, beta, gamma, sigma2, rho, nu, family = "T")

Arguments

x

A covariate matrix for the response y.

w

A covariate matrix for the missing indicator cc.

beta

Values for the beta vector.

gamma

Values for the gamma vector.

sigma2

Value for the variance.

rho

Value for the dependence between the response and missing value.

nu

When using the t- distribution, the initial value for the degrees of freedom.

family

The distribution family to be used (Normal, T, or CN).

Value

Return an object with the response (y) and missing values (cc).

References

Lachos, V. H., Prates, M. O., & Dey, D. K. (2021). Heckman selection-t model: Parameter estimation via the EM-algorithm. Journal of Multivariate Analysis, 184, 104737.

Examples


n <- 100
rho <- .6
cens <- 0.25
nu <- 4
set.seed(20200527)
w <- cbind(1,runif(n,-1,1),rnorm(n))
x <- cbind(w[,1:2])

family <- "T"
c <- qt(cens, df=nu)

sigma2 <- 1
beta <- c(1,0.5)
gamma<- c(1,0.3,-.5)
gamma[1] <- -c*sqrt(sigma2)

data <- geraHeckman(x,w,beta,gamma,sigma2,rho,nu,family=family)