Help for package FPDclustering

Type:

Package

Title:

PD-Clustering and Related Methods

Version:

2.3.5

Date:

2025-03-05

Maintainer:

Cristina Tortora <grikris1@gmail.com>

Description:

Probabilistic distance clustering (PD-clustering) is an iterative, distribution-free, probabilistic clustering method. PD-clustering assigns units to a cluster according to their probability of membership under the constraint that the product of the probability and the distance of each point to any cluster center is a constant. PD-clustering is a flexible method that can be used with elliptical clusters, outliers, or noisy data. PDQ is an extension of the algorithm for clusters of different sizes. GPDC and TPDC use a dissimilarity measure based on densities. Factor PD-clustering (FPDC) is a factor clustering method that involves a linear transformation of variables and a cluster optimizing the PD-clustering criterion. It works on high-dimensional data sets.

Depends:

ThreeWay,mvtnorm,R (≥ 4.1.0)

Imports:

ExPosition, cluster,rootSolve, MASS, klaR, GGally, ggplot2, ggeasy

License:

GPL-2 | GPL-3 [expanded from: GPL (≥ 2)]

NeedsCompilation:

Packaged:

2025-03-06 03:15:00 UTC; 011543324

Repository:

CRAN

Date/Publication:

2025-03-06 03:40:04 UTC

Author:

Cristina Tortora [aut, cre, cph], Noe Vidales [aut], Francesco Palumbo [aut], Tina Kalra [aut], Paul D. McNicholas [fnd]

Unsupervised Learning on Country Data

Description

Ten vables recorded on 167 countries. The goal is to categorize the countries using socio-economic and health indicators that determine the country's overall development. The data set has been donated by the HELP International organization, an international humanitarian NGO that needs to identify the countries that need aid and asked the analysts to categorize the countries.

Usage

data(Country_data)

Format

A data frame with 167 observations and 10 variables.

country: country name
child_mort: Death of children under 5 years of age per 1000 live births
exports: Exports of goods and services per capita. Given as %age of the GDP per capita
health: Total health spending per capita. Given as %age of GDP per capita
imports: Imports of goods and services per capita. Given as %age of the GDP per capita
income: Net income per person
inflation: The measurement of the annual growth rate of the Total GDP
life_expec: The average number of years a new born child would live if the current mortality patterns are to remain the same
total_fer: The number of children that would be born to each woman if the current age-fertility rates remain the same.
gdpp: The GDP per capita. Calculated as the Total GDP divided by the total population.

Source

https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data/metadata?resource=download

References

R. Kokkula. Unsupervised learning on country data. kaggle, 2022. URL https://www.kaggle.com/datasets/rohan0301/unsupervised-learning-on-country-data/metadata?resource=download

Examples

data(Country_data)
pairs(Country_data[,2:10])

Factor probabilistic distance clustering

Description

An implementation of FPDC, a probabilistic factor clustering algorithm that involves a linear transformation of variables and a cluster optimizing the PD-clustering criterion

Usage

FPDC(data = NULL, k = 2, nf = 2, nu = 2)

Arguments

data

A matrix or data frame such that rows correspond to observations and columns correspond to variables.

k

A numerical parameter giving the number of clusters

nf

A numerical parameter giving the number of factors for variables

nu

A numerical parameter giving the number of factors for units

Value

A class FPDclustering list with components

label

A vector of integers indicating the cluster membership for each unit

centers

A matrix of cluster centers

probability

A matrix of probability of each point belonging to each cluster

JDF

The value of the Joint distance function

iter

The number of iterations

explained

The explained variability

data

the data set

Author(s)

Cristina Tortora and Paul D. McNicholas

References

Tortora, C., M. Gettler Summa, M. Marino, and F. Palumbo. Factor probabilistic distance clustering (fpdc): a new clustering method for high dimensional data sets. Advanced in Data Analysis and Classification, 10(4), 441-464, 2016. doi:10.1007/s11634-015-0219-5.

Tortora C., Gettler Summa M., and Palumbo F.. Factor pd-clustering. In Lausen et al., editor, Algorithms from and for Nature and Life, Studies in Classification, Data Analysis, and Knowledge Organization DOI 10.1007/978-3-319-00035-011, 115-123, 2013.

Tortora C., Non-hierarchical clustering methods on factorial subspaces, 2012.

Examples


# Asymmetric data set clustering example (with shape 3).
data('asymmetric3')
x<-asymmetric3[,-1]

#Clustering
fpdas3=FPDC(x,4,3,3)

#Results
table(asymmetric3[,1],fpdas3$label)
Silh(fpdas3$probability)
summary(fpdas3)
plot(fpdas3)



# Asymmetric data set clustering example (with shape 20).
data('asymmetric20')
x<-asymmetric20[,-1]

#Clustering
fpdas20=FPDC(x,4,3,3)

#Results
table(asymmetric20[,1],fpdas20$label)
Silh(fpdas20$probability)
summary(fpdas20)
plot(fpdas20)



# Clustering example with outliers.
data('outliers')
x<-outliers[,-1]

#Clustering
fpdout=FPDC(x,4,5,4)

#Results
table(outliers[,1],fpdout$label)
Silh(fpdout$probability)
summary(fpdout)
plot(fpdout)

Gaussian PD-Clustering

Description

An implementation of Gaussian PD-Clustering GPDC, an extention of PD-clustering adjusted for cluster size that uses a dissimilarity measure based on the Gaussian density.

Usage

GPDC(data=NULL,k=2,ini="kmedoids", nr=5,iter=100)

Arguments

data

A matrix or data frame such that rows correspond to observations and columns correspond to variables.

k

A numerical parameter giving the number of clusters

ini

A parameter that selects center starts. Options available are random ("random"), kmedoid ("kmedoid", by default), and PDC ("PDclust").

nr

Number of random starts when ini set to "random"

iter

Maximum number of iterations

Value

A class FPDclustering list with components

label

A vector of integers indicating the cluster membership for each unit

centers

A matrix of cluster means

sigma

A list of K elements, with the variance-covariance matrix per cluster

probability

A matrix of probability of each point belonging to each cluster

JDF

The value of the Joint distance function

iter

The number of iterations

data

the data set

Author(s)

Cristina Tortora and Francesco Palumbo

References

Tortora C., McNicholas P.D., and Palumbo F. A probabilistic distance clustering algorithm using Gaussian and Student-t multivariate density distributions. SN Computer Science, 1:65, 2020.

C. Rainey, C. Tortora and F.Palumbo. A parametric version of probabilistic distance clustering. In: Greselin F., Deldossi L., Bagnato L., Vichi M. (eds) Statistical Learning of Complex Data. CLADAG 2017. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham, 33-43 2019. doi.org/10.1007/978-3-030-21140-0_4

Examples

#Load the data
data(ais)
dataSEL=ais[,c(10,3,5,8)]

#Clustering
res=GPDC(dataSEL,k=2,ini = "kmedoids")

#Results
table(res$label,ais$sex)
plot(res)
summary(res)

Probabilistic Distance Clustering

Description

Probabilistic distance clustering (PD-clustering) is an iterative, distribution free, probabilistic clustering method. PD clustering is based on the constraint that the product of the probability and the distance of each point to any cluster centre is a constant.

Usage

PDC(data = NULL, k = 2)

Arguments

data

A matrix or data frame such that rows correspond to observations and columns correspond to variables.

k

A numerical parameter giving the number of clusters

Value

A class FPDclustering list with components

label

A vector of integers indicating the cluster membership for each unit

centers

A matrix of cluster centers

probability

A matrix of probability of each point belonging to each cluster

JDF

The value of the Joint distance function

iter

The number of iterations

data

the data set

Author(s)

Cristina Tortora and Paul D. McNicholas

References

Ben-Israel C. and Iyigun C. Probabilistic D-Clustering. Journal of Classification, 25(1), 5-26, 2008.

Examples


#Normally generated clusters
c1 = c(+2,+2,2,2)
c2 = c(-2,-2,-2,-2)
c3 = c(-3,3,-3,3)
n=200
x1 = cbind(rnorm(n, c1[1]), rnorm(n, c1[2]), rnorm(n, c1[3]), rnorm(n, c1[4]) )
x2 = cbind(rnorm(n, c2[1]), rnorm(n, c2[2]),rnorm(n, c2[3]), rnorm(n, c2[4]) )
x3 = cbind(rnorm(n, c3[1]), rnorm(n, c3[2]),rnorm(n, c3[3]), rnorm(n, c3[4]) )
x = rbind(x1,x2,x3)

#Clustering
pdn=PDC(x,3)

#Results
plot(pdn)

Probabilistic Distance Clustering Adjusted for Cluster Size

Description

An implementation of probabilistic distance clustering adjusted for cluster size (PDQ), a probabilistic distance clustering algorithm that involves optimizing the PD-clustering criterion. The algorithm can be used, on continous, count, or mixed type data setting Euclidean, Chi square, or Gower as dissimilarity measurments.

Usage

PDQ(data=NULL,k=2,ini='kmd',dist='euc',cent=NULL,
ord=NULL,cat=NULL,bin=NULL,cont=NULL,w=NULL)

Arguments

data

A matrix or data frame such that rows correspond to observations and columns correspond to variables.

k

A numerical parameter giving the number of clusters.

ini

A parameter that selects center starts. Options available are random ("random"), kmedoid ("kmd", by default"), center ("center", the user inputs the center), and kmode ("kmode", for categoriacal data sets).

dist

A parameter that selects the distance measure used. Options available are Eucledean ("euc"), Gower ("gower") and chi square ("chi").

cent

User inputted centers if ini is set to "center".

ord

column indices of the x matrix indicating which columns are ordinal variables.

cat

column indices of the x matrix indicating which columns are categorical variables.

bin

column indices of the x matrix indicating which columns are binary variables.

cont

column indices of the x matrix indicating which columns are continuous variables.

w

numerical vector same length as the columns of the data, containing the variable weights when using Gower distance, equal weights by default.

Value

A class FPDclustering list with components

label

A vector of integers indicating the cluster membership for each unit

centers

A matrix of cluster centers

probability

A matrix of probability of each point belonging to each cluster

JDF

The value of the Joint distance function

iter

The number of iterations

jdfvector

collection of all jdf calculations at each iteration

data

the data set

Author(s)

Cristina Tortora and Noe Vidales

References

Iyigun, Cem, and Adi Ben-Israel. Probabilistic distance clustering adjusted for cluster size. Probability in the Engineering and Informational Sciences 22.4 (2008): 603-621. doi.org/10.1017/S0269964808000351.

Tortora and Palumbo. Clustering mixed-type data using a probabilistic distance algorithm. submitted.

Examples


#Mixed type data

sig=matrix(0.7,4,4)
diag(sig)=1###creat a correlation matrix
x1=rmvnorm(200,c(0,0,3,3))##  cluster 1
x2=rmvnorm(200,c(4,4,6,6),sigma=sig)##  cluster 2
x=rbind(x1,x2)# data set with 2 clusters
l=c(rep(1,200),rep(2,200))#creating the labels
x1=cbind(x1,rbinom(200,4,0.2),rbinom(200,4,0.2))#categorical variables
x2=cbind(x2,rbinom(200,4,0.7),rbinom(200,4,0.7))
x=rbind(x1,x2) ##Data set

#### Performing PDQ
pdq_class<-PDQ(data=x,k=2, ini="random", dist="gower", cont= 1:4, cat = 5:6)

###Output
table(l,pdq_class$label)
plot(pdq_class)
summary(pdq_class)



###Continuous data example
# Gaussian Generated Data  no  overlap 
x<-rmvnorm(100, mean=c(1,5,10), sigma=diag(1,3))
y<-rmvnorm(100, mean=c(4,8,13), sigma=diag(1,3))
data<-rbind(x,y)

#### Performing PDQ
pdq1=PDQ(data,2,ini="random",dist="euc")
table(rep(c(2,1),each=100),pdq1$label)
Silh(pdq1$probability)
plot(pdq1)
summary(pdq1)


# Gaussian Generated Data with  overlap 
x2<-rmvnorm(100, mean=c(1,5,10), sigma=diag(1,3))
y2<-rmvnorm(100, mean=c(2,6,11), sigma=diag(1,3))
data2<-rbind(x2,y2)

#### Performing PDQ
pdq2=PDQ(data2,2,ini="random",dist="euc")
table(rep(c(1,2),each=100),pdq2$label)
plot(pdq2)
summary(pdq2)

Probabilistic silhouette plot

Description

Graphical tool to evaluate the clustering partition.

Usage

Silh(p)

Arguments

p

A matrix of probabilities such that rows correspond to observations and columns correspond to clusters.

Details

The probabilistic silhouettes are an adaptation of the ones proposed by Menardi(2011) according to the following formula:

dbs_i = (log(p_{im_k}/p_{im_1}))/max_i |log(p_{im_k}/p_{im_1})|

where m_k is such that x_i belongs to cluster k and m_1 is such that p_{im_1} is maximum for m different fromm_k.

Value

Probabilistic silhouette plot

Author(s)

Cristina Tortora

References

Menardi G. Density-based Silhouette diagnostics for clustering methods.Statistics and Computing, 21, 295-308, 2011.

Examples


# Asymmetric data set silhouette example (with shape=3).
data('asymmetric3')
x<-asymmetric3[,-1]
fpdas3=FPDC(x,4,3,3)
Silh(fpdas3$probability)



# Asymmetric data set shiluette example (with shape=20).
data('asymmetric20')
x<-asymmetric20[,-1]
fpdas20=FPDC(x,4,3,3)
Silh(fpdas20$probability)



# Shiluette example with outliers.
data('outliers')
x<-outliers[,-1]
fpdout=FPDC(x,4,4,3)
Silh(fpdout$probability)

Star dataset to predict star types

Description

A 6 class star dataset for star classification with Deep Learned approaches

Usage

data(ais)

Format

A data frame with 202 observations and 13 variable.

K: Absolute Temperature (in K)
Lum: Relative Luminosity (L/Lo)
Rad: Relative Radius (R/Ro)
Mag: Absolute Magnitude (Mv)
Col: Star Color (white,Red,Blue,Yellow,yellow-orange etc)
Spect: Spectral Class (O,B,A,F,G,K,,M)
Type: Star Type (Red Dwarf, Brown Dwarf, White Dwarf, Main Sequence , SuperGiants, HyperGiants)

Source

https://www.kaggle.com/deepu1109/star-dataset

Examples

data(Star)

Statistics 1 students

Description

Data set collected in 2022 that contains 10 variables recorded on a convenience sample of 253 students enrolled in the first year at the University od Naples FedericoII and attending an introductory Statistics course.

Usage

data(Students)

Format

A data frame with 253 observations and 10 variable.

Sex: gender, binary
HS_qual: high school type, categorical
Stud_stat: prior knowladge of statistics, binary
Course_modality: course modality of attendance (in presence, online, mixed), categorical
HE_Parents: parents' education degree, categorical
PMP: mathematical prerequisits for psychometric, continuous
SAS: statistical anxiety sale, continuous
RAI: relative authonomy index, continuous
S_EFF: self-efficacy, continuous
COG: cognitive competence, continuous

References

R. Fabbricatore. Latent class analysis for proficiency assessment in higher education: integrating multidimensional latent traits and learning topics. Ph.D. thesis, University of Naples Federico II, 2023

Examples

data(Students)

Student-t PD-Clustering

Description

An implementation of Student-t PD-Clustering TPDC, an extention of PD-clustering adjusted for cluster size that uses a dissimilarity measure based on the multivariate Student-t density.

Usage

TPDC(data=NULL,k=2,ini="kmedoids", nr=5,iter=100)

Arguments

data

A matrix or data frame such that rows correspond to observations and columns correspond to variables.

k

A numerical parameter giving the number of clusters

ini

A parameter that selects center starts. Options available are random ("random"), kmedoid ("kmedoid", by default), and PDC ("PDclust").

nr

Number of random starts if ini is "random"

iter

Maximum number of iterations

Value

A class FPDclustering list with components

label

A vector of integers indicating the cluster membership for each unit

centers

A matrix of cluster means

sigma

A list of K elements, with the variance-covariance matrix per cluster

df

A vector of K degrees of freedom

probability

A matrix of probability of each point belonging to each cluster

JDF

The value of the Joint distance function

iter

The number of iterations

data

the data set

Author(s)

Cristina Tortora and Francesco Palumbo

References

Tortora C., McNicholas P.D., and Palumbo F. A probabilistic distance clustering algorithm using Gaussian and Student-t multivariate density distributions. SN Computer Science, 1:65, 2020.

Examples

#Load the data
data(ais)
dataSEL=ais[,c(10,3,5,8)]

#Clustering
res=TPDC(dataSEL,k=2,ini = "kmedoids")

#Results
table(res$label,ais$sex)
summary(res)
plot(res)

Choice of the number of Tucker 3 factors for FPDC

Description

An empirical way of choosing the number of factors for FPDC. The function returns a graph and a table representing the explained variability varying the number of factors.

Usage

TuckerFactors(data = NULL, k = 2)

Arguments

data

A matrix or data frame such that rows correspond to observations and columns correspond to variables.

k

A numerical parameter giving the number of clusters

Value

A table containing the explained variability varying the number of factors for units (column) and for variables (row) and the corresponding plot

Author(s)

Cristina Tortora

References

Kiers H, Kinderen A. A fast method for choosing the numbers of components in Tucker3 analysis.British Journal of Mathematical and Statistical Psychology, 56(1), 119-125, 2003.

Kroonenberg P. Applied Multiway Data Analysis. Ebooks Corporation, Hoboken, New Jersey, 2008.

Examples



# Asymmetric data set example (with shape=20).
data('asymmetric20')
xp=TuckerFactors(asymmetric20[,-1], k = 4)

Australian institute of sport data

Description

Data obtained to study sex, sport and body-size dependency of hematology in highly trained athletes.

Usage

data(ais)

Format

A data frame with 202 observations and 13 variables.

rcc: red blood cell count, in
wcc: while blood cell count, in per liter
hc: hematocrit, percent
hg: hemaglobin concentration, in g per decaliter
ferr: plasma ferritins, ng
bmi: Body mass index, kg
ssf: sum of skin folds
pcBfat: percent Body fat
lbm: lean body mass, kg
ht: height, cm
wt: weight, kg
sex: a factor with levels f m
sport: a factor with levels B_Ball Field Gym Netball Row Swim T_400m T_Sprnt Tennis W_Polo

Source

R package DAAG

References

Telford, R.D. and Cunningham, R.B. 1991. Sex, sport and body-size dependency of hematology in highly trained athletes. Medicine and Science in Sports and Exercise 23: 788-794.

Examples

data(ais)
pairs(ais[,1:11],col=ais$sex)

Asymmetric data set shape 20

Description

Each cluster has been generated according to a multivariate asymmetric Gaussian distribution, with shape 20, covariance matrix equal to the identity matrix and randomly generated centres.

Usage

data(asymmetric20)

Format

A data frame with 800 observations on the following 101 variables. The first variable is the membership.

Source

Generated with R using the package sn (The skew-normal and skew-t distributions), function rsn

Examples

data(asymmetric20)
plot(asymmetric20[,2:3])

Asymmetric data set shape 3

Description

Each cluster has been generated according to a multivariate asymmetric Gaussian distribution, with shape 3, covariance matrix equal to the identity matrix and randomly generated centres.

Usage

data(asymmetric3)

Format

A data frame with 800 observations on 101 variables. The first variable is the membership labels.

Source

Generated with R using the package sn (The skew-normal and skew-t distributions), function rsn

Examples

data(asymmetric3)
plot(asymmetric3[,2:3])

Data set with outliers

Description

Each cluster has been generated according to a multivariate Gaussian distribution, with centers c randomly generated. For each cluster, 20% of uniform distributed outliers have been generated at a distance included in max(x-c) and max(x-c)+5 form the center.

Usage

data(outliers)

Format

A data frame with 960 observations on the following 101 variables. The first variable corresponds to the membership

Source

generated with R

Examples

data(outliers)
 plot(outliers[,2:3])

Plots for FPDclustering objects

Description

Probability Silhouette plot, Scatterplot up to MaxVar variables, and parallel coordinate plot up to MaxVar variables, for objects of class FPDclustering.

Usage

## S3 method for class 'FPDclustering'
plot(x, maxVar=30, ... )

Arguments

x

an object of class FPDclustering

maxVar

a scalar indicating the maximum number of variables to display on the parallel plot, 30 by default

...

Additional parameters for the function paris

Author(s)

Cristina Tortora

Print for FPDclustering objects

Description

Lists the available components for the given object

Usage

## S3 method for class 'FPDclustering'
print(x,...)

Arguments

x

an object of class FPDclustering

...

Additional parameters for the function ls

Author(s)

Cristina Tortora

Summary for FPDclusteringt Objects

Description

Number of elements per cluster.

Usage

## S3 method for class 'FPDclustering'
summary(object, ... )

Arguments

object

an object of class FPDclustering

...

Additional parameters for the function paris

Author(s)

Cristina Tortora

Unsupervised Learning on Country Data

Description

Usage

Format

Source

References

Examples

Factor probabilistic distance clustering

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Gaussian PD-Clustering

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Probabilistic Distance Clustering

Description

Usage

Arguments

Value

Author(s)

References

Examples

Probabilistic Distance Clustering Adjusted for Cluster Size

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Probabilistic silhouette plot

Description

Usage

Arguments

Details

Value

Author(s)

References

Examples

Star dataset to predict star types

Description

Usage

Format

Source

Examples

Statistics 1 students

Description

Usage

Format

References

Examples

Student-t PD-Clustering

Description

Usage

Arguments

Value

Author(s)

References

See Also

Examples

Choice of the number of Tucker 3 factors for FPDC

Description

Usage

Arguments

Value

Author(s)

References

See Also