dasf.ml.cluster.hdbscan

HDBSCAN algorithm module.

Classes

HDBSCAN

Perform HDBSCAN clustering from vector array or distance matrix.

Module Contents

class dasf.ml.cluster.hdbscan.HDBSCAN(alpha=1.0, gen_min_span_tree=False, leaf_size=40, metric='euclidean', min_cluster_size=5, min_samples=None, p=None, algorithm='auto', approx_min_span_tree=True, core_dist_n_jobs=4, cluster_selection_method='eom', allow_single_cluster=False, prediction_data=False, match_reference_implementation=False, connectivity='knn', output_type=None, verbose=0, **kwargs)[source]

Bases: dasf.ml.cluster.classifier.ClusterClassifier

Perform HDBSCAN clustering from vector array or distance matrix.

HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

Parameters

min_cluster_sizeint, optional (default=5)

The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.

min_samplesint, optional (default=None)

The number of samples in a neighbourhood for a point to be considered a core point.

metricstring, or callable, optional (default=’euclidean’)

The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square.

pint, optional (default=None)

p value to use if using the minkowski metric.

alphafloat, optional (default=1.0)

A distance scaling parameter as used in robust single linkage. See [3] for more information.

cluster_selection_epsilon: float, optional (default=0.0)

A distance threshold. Clusters below this value will be merged.

See [5] for more information.

algorithmstring, optional (default=’best’)

Exactly which algorithm to use; hdbscan has variants specialised for different characteristics of the data. By default this is set to best which chooses the “best” algorithm given the nature of the data. You can force other options if you believe you know better. Options are:

best

generic

prims_kdtree

prims_balltree

boruvka_kdtree

boruvka_balltree

leaf_size: int, optional (default=40)

If using a space tree algorithm (kdtree, or balltree) the number of points ina leaf node of the tree. This does not alter the resulting clustering, but may have an effect on the runtime of the algorithm.

memoryInstance of joblib.Memory or string (optional)

Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.

approx_min_span_treebool, optional (default=True)

Whether to accept an only approximate minimum spanning tree. For some algorithms this can provide a significant speedup, but the resulting clustering may be of marginally lower quality. If you are willing to sacrifice speed for correctness you may want to explore this; in general this should be left at the default True.

gen_min_span_tree: bool, optional (default=False)

Whether to generate the minimum spanning tree with regard to mutual reachability distance for later analysis.

core_dist_n_jobsint, optional (default=4)

Number of parallel jobs to run in core distance computations (if supported by the specific algorithm). For core_dist_n_jobs below -1, (n_cpus + 1 + core_dist_n_jobs) are used.

cluster_selection_methodstring, optional (default=’eom’)

The method used to select clusters from the condensed tree. The standard approach for HDBSCAN* is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:

eom

leaf

allow_single_clusterbool, optional (default=False)

By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset.

prediction_databoolean, optional

Whether to generate extra cached data for predicting labels or membership vectors few new unseen points later. If you wish to persist the clustering object for later re-use you probably want to set this to True. (default False)

match_reference_implementationbool, optional (default=False)

There exist some interpretational differences between this HDBSCAN* implementation and the original authors reference implementation in Java. This can result in very minor differences in clustering results. Setting this flag to True will, at a some performance cost, ensure that the clustering results match the reference implementation.

connectivity{‘pairwise’, ‘knn’}, default=’knn’

The type of connectivity matrix to compute.

‘pairwise’ will compute the entire fully-connected graph of

pairwise distances between each set of points. This is the fastest to compute and can be very fast for smaller datasets but requires O(n^2) space.

‘knn’ will sparsify the fully-connected connectivity matrix to

save memory and enable much larger inputs. “n_neighbors” will control the amount of memory used and the graph will be connected automatically in the event “n_neighbors” was not large enough to connect it.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Examples

>>> from dasf.ml.cluster import HDBSCAN
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3],
...               [8, 7], [8, 8], [25, 80]])
>>> clustering = HDBSCAN(min_cluster_size=30, min_samples=2).fit(X)
>>> clustering
HDBSCAN(min_cluster_size=30, min_samples=2)

For further informations see: - https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN - https://docs.rapids.ai/api/cuml/stable/api.html#dbscan - https://docs.rapids.ai/api/cuml/stable/api.html#dbscan-clustering

References

Constructor of the class HDBSCAN.

alpha

gen_min_span_tree

leaf_size

metric

min_cluster_size

min_samples

p

algorithm

approx_min_span_tree

core_dist_n_jobs

cluster_selection_method

allow_single_cluster

prediction_data

match_reference_implementation

connectivity

output_type

verbose

__hdbscan_cpu

_fit_cpu(X, y=None)[source]

Perform HDBSCAN clustering from features or distance matrix using CPU only.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features), or array-like of shape (n_samples, n_samples): A feature array, or array of distances between samples if metric='precomputed'.

Returns

selfobject: Fitted estimator.

_fit_gpu(X, y=None, convert_dtype=True)[source]

Perform HDBSCAN clustering from features or distance matrix using GPU only (from CuML).

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features), or array-like of shape (n_samples, n_samples): A feature array, or array of distances between samples if metric='precomputed'.

Returns

selfobject: Fitted estimator.

_fit_predict_cpu(X, y=None)[source]

Performs clustering on X and returns cluster labels using only CPU.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features), or array-like of shape (n_samples, n_samples): A feature array, or array of distances between samples if metric='precomputed'.

Returns

yndarray, shape (n_samples, ): cluster labels

_fit_predict_gpu(X, y=None)[source]

Performs clustering on X and returns cluster labels using only GPU (from CuML).

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features), or array-like of shape (n_samples, n_samples): A feature array, or array of distances between samples if metric='precomputed'.

Returns

yndarray, shape (n_samples, ): cluster labels