Title: | Fast Clustering Using Adaptive Density Peak Detection |
---|---|
Description: | An implementation of ADPclust clustering procedures (Fast Clustering Using Adaptive Density Peak Detection). The work is built and improved upon the idea of Rodriguez and Laio (2014)<DOI:10.1126/science.1242072>. ADPclust clusters data by finding density peaks in a density-distance plot generated from local multivariate Gaussian density estimation. It includes an automatic centroids selection and parameter optimization algorithm, which finds the number of clusters and cluster centroids by comparing average silhouettes on a grid of testing clustering results; It also includes a user interactive algorithm that allows the user to manually selects cluster centroids from a two dimensional "density-distance plot". Here is the research article associated with this package: "Wang, Xiao-Feng, and Yifan Xu (2015)<DOI:10.1177/0962280215609948> Fast clustering using adaptive density peak detection." Statistical methods in medical research". url: <http://smm.sagepub.com/content/early/2015/10/15/0962280215609948.abstract>. |
Authors: | Ethan Yifan Xu [aut, cre], Xiao-Feng Wang [aut] |
Maintainer: | Ethan Yifan Xu <[email protected]> |
License: | GPL(>=2) |
Version: | 0.8 |
Built: | 2025-02-24 03:17:09 UTC |
Source: | https://github.com/ethanyxu/adpclust |
Clustering of data by finding cluster centers from estimated density peaks. ADPclust is a non-iterative procedure that incorporates multivariate Gaussian density estimation. The number of clusters as well as bandwidths can either be selected by the user or selected automatically through an internal clustering criterion.
adpclust( x = NULL, distm = NULL, p = NULL, centroids = "auto", h = NULL, htype = "amise", nclust = 2:10, ac = 1, f.cut = c(0.1, 0.2, 0.3), fdelta = "mnorm", dmethod = "euclidean", draw = FALSE, f = NULL )
adpclust( x = NULL, distm = NULL, p = NULL, centroids = "auto", h = NULL, htype = "amise", nclust = 2:10, ac = 1, f.cut = c(0.1, 0.2, 0.3), fdelta = "mnorm", dmethod = "euclidean", draw = FALSE, f = NULL )
x |
numeric data frame where rows are observations and columns are variables. One of x and distm must be provided. |
distm |
distance matrix of class 'dist'. distm is ignored if x is given. |
p |
number of variables (ncol(x)). This is only needed if neither x nor h is given. |
centroids |
character string specifying how cluster centroids are selected. Valid options are "user" and "auto". |
h |
nonnegative number specifying the bandwidth in density estimation. If h is NULL, the algorithm attempts to find h in a neighborhood centered at either the AMISE bandwidth or ROT bandwidth (see htype). |
htype |
character string specifying the method used to calculate a reference bandwidth for the density estimation. htype is ignored if h is given. Valid options of are "ROT" and "AMISE" (see details). |
nclust |
integer, or a vector of integers specifying the pool of the number of clusters in automatic variation. The default is 2:10. |
ac |
integer indicating which automatic cut method is used. This is ignored if centroids = 'user'. The valid options are:
|
f.cut |
number between (0, 1) or numeric vector of numbers between (0, 1). f.cut is used when centroids = "auto" and ac = 1 to automatically select cluster centroids from the decision plot (see ac). The default is c(0.1, 0.2, 0.3). |
fdelta |
character string that specifies the method used to estimate local density f(x) at each data point x. The default (recommended) is "mnorm" that uses a multivariate Gaussian density estimation to calculate f. Other options are listed below. Here 'distm' denotes the distance matrix.
|
dmethod |
character string that is passed to the 'method' argument in function dist(), which is used to calculate the distance matrix if 'distm' is not given. The default is "euclidean". |
draw |
boolean. If draw = TRUE the clustering result is plotted after the algorithm finishes. The plot is produced by by plot.adpclust(ans), where 'ans' is the outcome of 'adpclust()' |
f |
vector of numerical values. Density values of x. If this is provided then ADPclust will not estimate the densities. |
Given n data points x's in p dimensions, adpclust() calculates f(x) and delta(x) for each data point x, where f(x) is the local density at x, and delta(x) is the shortest distance between x and y for all y such that f(x) <= f(y). Data points with large f and large delta values are labeled class centroids. In other words, they appear as isolated points in the upper right corner of the f vs. delta plot (the decision plot). After cluster centroids are determined, other data points are clustered according to their distances to the closes centroids.
A bandwidth (smoothing parameter) h is used to calculate local density f(x) in various ways. See parameter 'fdelta' for details. If centroids = 'user', then h must be explicitly provided. If centroids = 'auto' and h is not specified, then it is automatically selected from a range of testing values: First a reference bandwidth h0 is calculated by one of the two methods: Scott's Rule-of-Thumb value (htype = "ROT") or Wand's Asymptotic-Mean-Integrated-Squared-Error value (htype = "AMISE"), then 10 values equally spread in the range [1/3h0, 3h0] are tested. The value that yields the highest silhouette score is chosen as the final h.
An 'adpclust' object that contains the list of the following items.
clusters Cluster assignments. A vector of the same length as the number of observations.
centers: Indices of the clustering centers.
silhouette: Silhouette score from the final clustering result.
nclust: Number of clusters.
h: Final bandwidth.
f: Final density vector f(x).
delta: Final delta vector delta(x).
selection.type: 'user' or 'auto'.
Xiao-Feng Wang, and Yifan Xu, (2015) "Fast Clustering Using Adaptive Density Peak Detection." Statistical Methods in Medical Research, doi:10.1177/0962280215609948.
# Load a data set with 3 clusters data(clust3) # Automatically select cluster centroids ans <- adpclust(clust3, centroids = "auto", draw = FALSE) summary(ans) plot(ans) # Specify distm instead of data distm <- FindDistm(clust3, normalize = TRUE) ans.distm <- adpclust(distm = distm, p = 2, centroids = "auto", draw = FALSE) identical(ans, ans.distm) # Specify the grid of h and nclust ans <- adpclust(clust3, centroids = "auto", h = c(0.1, 0.2, 0.3), nclust = 2:6) # Specify that bandwidths should be searched around # Wand's Asymptotic-Mean-Integrated-Squared-Error bandwidth # Also test 3 to 6 clusters. ans <- adpclust(clust3, centroids = "auto", htype = "AMISE", nclust = 3:6) # Set a specific bandwidth value. ans <- adpclust(clust3, centroids = "auto", h = 5) # Change method of automatic selection of centers ans <- adpclust(clust3, centroids = "auto", nclust = 2:6, ac = 2) # Specify that the single "ROT" bandwidth value by # using the 'ROT()' function ans <- adpclust(clust3, centroids = "auto", h = ROT(clust3)) # Specify distance matrix distm and density values f distm <- FindDistm(clust3, normalize = TRUE) f <- FindFD(distm, h = 0.2, fdelta="mnorm")[["f"]] ans <- adpclust(distm = distm, centroids = "auto", nclust = 2:6, f = f) # Centroids selected by user ## Not run: ans <- adpclust(clust3, centroids = "user", h = ROT(clust3)) ## End(Not run) # A larger data set data(clust5) ans <- adpclust(clust5, centroids = "auto", htype = "ROT", nclust = 3:5) summary(ans) plot(ans)
# Load a data set with 3 clusters data(clust3) # Automatically select cluster centroids ans <- adpclust(clust3, centroids = "auto", draw = FALSE) summary(ans) plot(ans) # Specify distm instead of data distm <- FindDistm(clust3, normalize = TRUE) ans.distm <- adpclust(distm = distm, p = 2, centroids = "auto", draw = FALSE) identical(ans, ans.distm) # Specify the grid of h and nclust ans <- adpclust(clust3, centroids = "auto", h = c(0.1, 0.2, 0.3), nclust = 2:6) # Specify that bandwidths should be searched around # Wand's Asymptotic-Mean-Integrated-Squared-Error bandwidth # Also test 3 to 6 clusters. ans <- adpclust(clust3, centroids = "auto", htype = "AMISE", nclust = 3:6) # Set a specific bandwidth value. ans <- adpclust(clust3, centroids = "auto", h = 5) # Change method of automatic selection of centers ans <- adpclust(clust3, centroids = "auto", nclust = 2:6, ac = 2) # Specify that the single "ROT" bandwidth value by # using the 'ROT()' function ans <- adpclust(clust3, centroids = "auto", h = ROT(clust3)) # Specify distance matrix distm and density values f distm <- FindDistm(clust3, normalize = TRUE) f <- FindFD(distm, h = 0.2, fdelta="mnorm")[["f"]] ans <- adpclust(distm = distm, centroids = "auto", nclust = 2:6, f = f) # Centroids selected by user ## Not run: ans <- adpclust(clust3, centroids = "user", h = ROT(clust3)) ## End(Not run) # A larger data set data(clust5) ans <- adpclust(clust5, centroids = "auto", htype = "ROT", nclust = 3:5) summary(ans) plot(ans)
Calculate the AMISE bandwidth from either a data frame, or from the number of observations and the dimension of the data.
AMISE(x, y = NULL)
AMISE(x, y = NULL)
x |
the number of variables (if y is given), or a data frame or a matrix (if y is missing). |
y |
the number of observations. If y is missing then x is interpreted as the data matrix. |
IMPORTANT NOTE: The standard deviation of each variable is omitted in this formula.
AMISE bandwidth.
Generated from the genRandomClust() function of the "clusterGeneration" package with separation value 0.2.
data frame
Randomly generated from three normal distributions.
data frame
500 5-dim points in 5 clusters. Generated from the genRandomClust() function of the "clusterGeneration" package with separation value 0.1.
data frame
Generated from the genRandomClust() function of the "clusterGeneration" package with separation value 0.01 (tightly clustered).
data frame
38 by 243 matrix. Each row represents a patient. Each column represents a gene.
matrix
Automatically finds centers with diagonal f(x) vs delta(x) thresholds. This is used in adpclust() with ac = 2. It finds points that are above and farthest from the diagonal line in the f vs. delta plots, and label them to be centers.
FindCentersAutoD(f, delta, nclust)
FindCentersAutoD(f, delta, nclust)
f |
vector of local distance. |
delta |
vector of minimal distances to higher ground. |
nclust |
number of clusters. Can be a single integer or a vector of integers. Duplicates are silently removed. |
a list of vectors. Each vector gives the locations of centers.
Ethan Xu
Automatically find centers with vertical threshold vertical f(x) thresholds.
FindCentersAutoV(f, delta, f.cut = c(0.1, 0.2, 0.3), nclust, rm.dup = TRUE)
FindCentersAutoV(f, delta, f.cut = c(0.1, 0.2, 0.3), nclust, rm.dup = TRUE)
f |
vector of local distance f(x). See the detail section of the help(adpclust). |
delta |
vector of minimal distances to higher ground delta(x). See the detail section of the help(adpclust). |
f.cut |
number between (0, 1) or numeric vector of numbers between (0, 1). Data points whose f values are larger than f.cut with large delta values are selected as centers. The default is c(0.1, 0.2, 0.3). |
nclust |
number of clusters. It can be either a single integer or a vector of integers. |
rm.dup |
boolean. If TRUE (default) duplicated centers vectors are removed from returned list. |
Given f's and delta's, cluster centers are chosen to be the data points whose delta values are high and f values are larger than a fixed threshold. To be more specific, let F denote the set of all f(x). centers are selected as points with the largest m delta values in the set x | f(x) > a'th percentile of F. The number of centers m is given by the parameter nclust. The cutting percentile a is given by the parameter f.cut. When at least one of these two parameters are vectors, centers are selected based all combinations of them, and returned in a list.
a list of vectors. Each vector contains the indices of selected centers.
Ethan Xu
This is the subroutine that automatically finds cluster assignments from given f and delta by testing various parameter settings and find the one that maximizes the silhouette.
FindClustersAuto( distm, f, delta, ac = 1, nclust = 2:10, f.cut = c(0.1, 0.2, 0.3) )
FindClustersAuto( distm, f, delta, ac = 1, nclust = 2:10, f.cut = c(0.1, 0.2, 0.3) )
distm |
the distance matrix |
f |
vector of local distance f(x). See the help of adpclust() for details. |
delta |
vector of minimal distances to higher ground delta(x). See the help of adpclust() for details. |
ac |
type of auto selection. The valid options are 1 and 2. See the help of adpclust() for details. |
nclust |
number of clusters to test. Either a single integer or a vector of integers. |
f.cut |
number between (0, 1) or numeric vector of numbers between (0, 1). Data points whose f values are larger than f.cut with large delta values are selected as centers. The default is c(0.1, 0.2, 0.3). See the help of FindCentersAutoV() for more details. |
list of four elements:
clusters Cluster assignments. A vector of the same length as the number of observations.
centers: Indices of the clustering centers.
silhouette: Silhouette score from the final clustering result.
nclust: Number of clusters.
Ethan Xu
Find cluster assignments from given centers and distance matrix. Each point is assigned to the center that has the shortest Euclidean distance.
FindClustersGivenCenters(distm, centers)
FindClustersGivenCenters(distm, centers)
distm |
distance matrix |
centers |
vector of integers that gives the indices of centers. Duplications will be silently dropped. |
Cluster assignments. A vector of the same length as the number of observations.
Plot the f vs. delta plot, then wait for the user to select centers of clusters by left clicking the points. In general points with both large f and large delta are good candidates of cluster centroids. Selected centers are highlighted. Press ESC to end the selection.
FindClustersManual(distm, f, delta)
FindClustersManual(distm, f, delta)
distm |
distance matrix. |
f |
vector of local densities f(x). Same length of the number of observations. |
delta |
vector of distances to the closest high ground delta(x). Same length of the number of observations. |
a list of the following items:
clusters Cluster assignments. A vector of the same length as the number of observations.
centers: Indices of the clustering centers.
silhouette: Silhouette score from the final clustering result.
nclust: Number of clusters.
data(clust3) distm <- FindDistm(clust3, normalize = TRUE) ## Not run: fd <- FindFD(distm, 2, "mnorm") ans <- FindClustersManual(distm, fd$f, fd$delta) names(ans) ans$centers ## End(Not run)
data(clust3) distm <- FindDistm(clust3, normalize = TRUE) ## Not run: fd <- FindFD(distm, 2, "mnorm") ans <- FindClustersManual(distm, fd$f, fd$delta) names(ans) ans$centers ## End(Not run)
A wrapper of the dist() method, with the option to rescale the data with standard deviation of each dimension before calculating the distance matrix. NOTE: If fdelta='mnorm' is passed to adpclust(), then the distm is calculated from rescaled data internally, i.e. distm <- FindDistm(x, normalize = TRUE).
FindDistm(x, normalize = FALSE, method = "euclidean")
FindDistm(x, normalize = FALSE, method = "euclidean")
x |
data |
normalize |
boolean. Normalize data before calculating distance? |
method |
passed to 'dist()' |
distance matrix of class dist.
Ethan Xu
Calculate f(x) and delta(x) from distm and h.
FindFD(distm, h, fdelta, f = NULL)
FindFD(distm, h, fdelta, f = NULL)
distm |
distance matrix of class 'dist'. |
h |
bandwidth. |
fdelta |
character string that specifies the method used to estimate local density f(x) at each data point x. The default is "mnorm" that uses a multivariate Gaussian density estimation to calculate f. Other options are listed below. Here 'distm' denotes the distance matrix. |
f |
numerical vector. Density values of data points. If not given (default) it is estimated by the method specified in "fdelta"
|
list of two items: f and delta.
Find bandwidth h from the number of observations n and the dimension p.
FindH(p, n, htype)
FindH(p, n, htype)
p |
dimension of data. The number of variables. |
n |
the number of observations. |
htype |
methods to calculate h. The valid options are (case insensitive) "amise" or "rot". |
bandwidth h.
Plot the f vs. delta plot with selected centroids.
## S3 method for class 'adpclust' plot(x, cols = "default", to.plot = c("cluster.sil", "fd"), ...)
## S3 method for class 'adpclust' plot(x, cols = "default", to.plot = c("cluster.sil", "fd"), ...)
x |
an object of class "adpclust". Result of adpclust(). |
cols |
vector of colors used to distinguish different clusters. Recycled if necessary. |
to.plot |
string vector that indicate which plot(s) to show. The two options are 'cluster.sil' (nclust vs. silhouette) and 'fd' (f vs. delta). |
... |
Not used. |
## Load a data set with 3 clusters data(clust3) ## Automatically select cluster centroids ans <- adpclust(clust3, centroids = "auto") plot(ans) plot(ans, to.plot = "fd") plot(ans, to.plot = "cluster.sil") plot(ans, to.plot = c("cluster.sil", "fd")) #Default
## Load a data set with 3 clusters data(clust3) ## Automatically select cluster centroids ans <- adpclust(clust3, centroids = "auto") plot(ans) plot(ans, to.plot = "fd") plot(ans, to.plot = "cluster.sil") plot(ans, to.plot = c("cluster.sil", "fd")) #Default
Calculate the ROT bandwidth either from a data frame, or from p and n.
ROT(x, y = NULL)
ROT(x, y = NULL)
x |
the number of variables (if y is missing), or a data frame or a matrix (if y is not missing). |
y |
the number of observations. If y is missing, x should be the data matrix. |
IMPORTANT NOTE: The standard deviation of each variable is omitted in this formula.
ROT bandwidth.
Summarizes the result from the adpclust() function.
## S3 method for class 'adpclust' summary(object, ...)
## S3 method for class 'adpclust' summary(object, ...)
object |
object of class "adpclust" that is returned from adpclust(). |
... |
other arguments. NOT used. |