dasf.ml.cluster

Init module for Clustering ML algorithms.

Submodules

Classes

AgglomerativeClustering

Agglomerative Clustering

DBSCAN

Perform DBSCAN clustering from vector array or distance matrix.

HDBSCAN

Perform HDBSCAN clustering from vector array or distance matrix.

KMeans

K-Means clustering.

SOM

Initializes a Self Organizing Maps.

SpectralClustering

Apply clustering to a projection of the normalized Laplacian.

Package Contents

class dasf.ml.cluster.AgglomerativeClustering(n_clusters=2, metric='euclidean', connectivity=None, linkage='single', memory=None, compute_full_tree='auto', distance_threshold=None, compute_distances=False, handle=None, verbose=False, n_neighbors=10, output_type=None, **kwargs)[source]

Bases: dasf.ml.cluster.classifier.ClusterClassifier

Agglomerative Clustering

Recursively merges the pair of clusters that minimally increases a given linkage distance.

Read more in the User Guide.

Parameters

n_clustersint or None, default=2

The number of clusters to find. It must be None if distance_threshold is not None.

metricstr or callable, default=”euclidean”

Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. If linkage is “ward”, only “euclidean” is accepted. If “precomputed”, a distance matrix is needed as input for the fit method.

memorystr or object with the joblib.Memory interface, default=None

Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.

connectivityarray-like or callable, default=None

Connectivity matrix. Defines for each sample the neighboring samples following a given structure of the data. This can be a connectivity matrix itself or a callable that transforms the data into a connectivity matrix, such as derived from kneighbors_graph. Default is None, i.e, the hierarchical clustering algorithm is unstructured.

compute_full_tree‘auto’ or bool, default=’auto’

Stop early the construction of the tree at n_clusters. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must be True if distance_threshold is not None. By default compute_full_tree is “auto”, which is equivalent to True when distance_threshold is not None or that n_clusters is inferior to the maximum between 100 or 0.02 * n_samples. Otherwise, “auto” is equivalent to False.

linkage{‘ward’, ‘complete’, ‘average’, ‘single’}, default=’ward’

Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.

  • ‘ward’ minimizes the variance of the clusters being merged.

  • ‘average’ uses the average of the distances of each observation of the two sets.

  • ‘complete’ or ‘maximum’ linkage uses the maximum distances between all observations of the two sets.

  • ‘single’ uses the minimum of the distances between all observations of the two sets.

Added in version 0.20: Added the ‘single’ option

distance_thresholdfloat, default=None

The linkage distance threshold above which, clusters will not be merged. If not None, n_clusters must be None and compute_full_tree must be True.

Added in version 0.21.

compute_distancesbool, default=False

Computes distances between clusters even if distance_threshold is not used. This can be used to make dendrogram visualization, but introduces a computational and memory overhead.

Added in version 0.24.

n_neighborsint, default = 15

The number of neighbors to compute when connectivity = “knn”

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Examples

>>> from dasf.ml.cluster import AgglomerativeClustering
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [4, 2], [4, 4], [4, 0]])
>>> clustering = AgglomerativeClustering().fit(X)
>>> clustering
AgglomerativeClustering()

For further informations see: - https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html - https://docs.rapids.ai/api/cuml/stable/api.html#agglomerative-clustering

Constructor of the class AgglomerativeClustering.

n_clusters
metric
connectivity
linkage
memory
compute_full_tree
distance_threshold
compute_distances
handle
verbose
n_neighbors
output_type
__agg_cluster_cpu
_fit_cpu(X, y=None, convert_dtype=True)[source]

Fit without validation using CPU only.

Parameters

Xndarray of shape (n_samples, n_features) or (n_samples, n_samples)

Training instances to cluster, or distances between instances if affinity='precomputed'.

Returns

selfobject

Returns the fitted instance.

_fit_gpu(X, y=None, convert_dtype=True)[source]

Fit without validation using GPU only.

Parameters

Xndarray of shape (n_samples, n_features) or (n_samples, n_samples)

Training instances to cluster, or distances between instances if affinity='precomputed'.

Returns

selfobject

Returns the fitted instance.

_fit_predict_cpu(X, y=None)[source]

Fit and return the result of each sample’s clustering assignment using CPU only.

In addition to fitting, this method also return the result of the clustering assignment for each sample in the training set.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)

Training instances to cluster, or distances between instances if affinity='precomputed'.

yIgnored

Not used, present here for API consistency by convention.

Returns

labelsndarray of shape (n_samples,)

Cluster labels.

_fit_predict_gpu(X, y=None)[source]

Fit and return the result of each sample’s clustering assignment using GPU only.

In addition to fitting, this method also return the result of the clustering assignment for each sample in the training set.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)

Training instances to cluster, or distances between instances if affinity='precomputed'.

yIgnored

Not used, present here for API consistency by convention.

Returns

labelsndarray of shape (n_samples,)

Cluster labels.

class dasf.ml.cluster.DBSCAN(eps=0.5, leaf_size=40, metric='euclidean', min_samples=5, p=None, output_type=None, calc_core_sample_indices=True, verbose=False, **kwargs)[source]

Bases: dasf.ml.cluster.classifier.ClusterClassifier

Perform DBSCAN clustering from vector array or distance matrix.

DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.

Read more in the User Guide.

Parameters

epsfloat, default=0.5

The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.

min_samplesint, default=5

The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

metricstring, or callable, default=’euclidean’

The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by sklearn.metrics.pairwise_distances() for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a Glossary, in which case only “nonzero” elements may be considered neighbors for DBSCAN.

Added in version 0.17: metric precomputed to accept precomputed sparse matrix.

metric_paramsdict, default=None

Additional keyword arguments for the metric function.

Added in version 0.19.

algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’

The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.

leaf_sizeint, default=30

Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.

pfloat, default=None

The power of the Minkowski metric to be used to calculate distance between points. If None, then p=2 (equivalent to the Euclidean distance).

n_jobsint, default=None

The number of parallel jobs to run. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

calc_core_sample_indices(optional)boolean, default = True

Indicates whether the indices of the core samples should be calculated. The the attribute core_sample_indices_ will not be used, setting this to False will avoid unnecessary kernel launches.

Examples

>>> from dasf.ml.cluster import DBSCAN
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3],
...               [8, 7], [8, 8], [25, 80]])
>>> clustering = DBSCAN(eps=3, min_samples=2).fit(X)
>>> clustering
DBSCAN(eps=3, min_samples=2)

For further informations see: - https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN - https://docs.rapids.ai/api/cuml/stable/api.html#dbscan - https://docs.rapids.ai/api/cuml/stable/api.html#dbscan-clustering

See Also

OPTICSA similar clustering at multiple values of eps. Our implementation

is optimized for memory usage.

References

Ester, M., H. P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996

Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 19.

Constructor of the class DBSCAN.

eps
leaf_size
metric
min_samples
p
output_type
calc_core_sample_indices
verbose
__dbscan_cpu
_lazy_fit_gpu(X, y=None, out_dtype='int32')[source]

Perform DBSCAN clustering from features, or distance matrix using Dask with GPUs only (from CuML).

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features), or (n_samples, n_samples)

Training instances to cluster, or distances between instances if metric='precomputed'. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

yIgnored

Not used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=None

Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.

Returns

selfobject

Returns a fitted instance of self.

_fit_cpu(X, y=None, sample_weight=None)[source]

Perform DBSCAN clustering from features, or distance matrix using CPU only.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features), or (n_samples, n_samples)

Training instances to cluster, or distances between instances if metric='precomputed'. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

yIgnored

Not used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=None

Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.

Returns

selfobject

Returns a fitted instance of self.

_fit_gpu(X, y=None, out_dtype='int32')[source]

Perform DBSCAN clustering from features, or distance matrix using GPU only (from CuML).

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features), or (n_samples, n_samples)

Training instances to cluster, or distances between instances if metric='precomputed'. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

yIgnored

Not used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=None

Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.

Returns

selfobject

Returns a fitted instance of self.

_lazy_fit_predict_gpu(X, y=None, out_dtype='int32')[source]

Compute clusters from a data or distance matrix and predict labels using Dask and GPUs (from CuML).

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features), or (n_samples, n_samples)

Training instances to cluster, or distances between instances if metric='precomputed'. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

yIgnored

Not used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=None

Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.

Returns

labelsndarray of shape (n_samples,)

Cluster labels. Noisy samples are given the label -1.

_fit_predict_cpu(X, y=None, sample_weight=None)[source]

Compute clusters from a data or distance matrix and predict labels using CPU only.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features), or (n_samples, n_samples)

Training instances to cluster, or distances between instances if metric='precomputed'. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

yIgnored

Not used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=None

Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.

Returns

labelsndarray of shape (n_samples,)

Cluster labels. Noisy samples are given the label -1.

_fit_predict_gpu(X, y=None, out_dtype='int32')[source]

Compute clusters from a data or distance matrix and predict labels using GPU only (from CuML).

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features), or (n_samples, n_samples)

Training instances to cluster, or distances between instances if metric='precomputed'. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

yIgnored

Not used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=None

Weight of each sample, such that a sample with a weight of at least min_samples is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.

Returns

labelsndarray of shape (n_samples,)

Cluster labels. Noisy samples are given the label -1.

class dasf.ml.cluster.HDBSCAN(alpha=1.0, gen_min_span_tree=False, leaf_size=40, metric='euclidean', min_cluster_size=5, min_samples=None, p=None, algorithm='auto', approx_min_span_tree=True, core_dist_n_jobs=4, cluster_selection_method='eom', allow_single_cluster=False, prediction_data=False, match_reference_implementation=False, connectivity='knn', output_type=None, verbose=0, **kwargs)[source]

Bases: dasf.ml.cluster.classifier.ClusterClassifier

Perform HDBSCAN clustering from vector array or distance matrix.

HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

Parameters

min_cluster_sizeint, optional (default=5)

The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.

min_samplesint, optional (default=None)

The number of samples in a neighbourhood for a point to be considered a core point.

metricstring, or callable, optional (default=’euclidean’)

The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square.

pint, optional (default=None)

p value to use if using the minkowski metric.

alphafloat, optional (default=1.0)

A distance scaling parameter as used in robust single linkage. See [3] for more information.

cluster_selection_epsilon: float, optional (default=0.0)

A distance threshold. Clusters below this value will be merged.

See [5] for more information.

algorithmstring, optional (default=’best’)

Exactly which algorithm to use; hdbscan has variants specialised for different characteristics of the data. By default this is set to best which chooses the “best” algorithm given the nature of the data. You can force other options if you believe you know better. Options are:

  • best

  • generic

  • prims_kdtree

  • prims_balltree

  • boruvka_kdtree

  • boruvka_balltree

leaf_size: int, optional (default=40)

If using a space tree algorithm (kdtree, or balltree) the number of points ina leaf node of the tree. This does not alter the resulting clustering, but may have an effect on the runtime of the algorithm.

memoryInstance of joblib.Memory or string (optional)

Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.

approx_min_span_treebool, optional (default=True)

Whether to accept an only approximate minimum spanning tree. For some algorithms this can provide a significant speedup, but the resulting clustering may be of marginally lower quality. If you are willing to sacrifice speed for correctness you may want to explore this; in general this should be left at the default True.

gen_min_span_tree: bool, optional (default=False)

Whether to generate the minimum spanning tree with regard to mutual reachability distance for later analysis.

core_dist_n_jobsint, optional (default=4)

Number of parallel jobs to run in core distance computations (if supported by the specific algorithm). For core_dist_n_jobs below -1, (n_cpus + 1 + core_dist_n_jobs) are used.

cluster_selection_methodstring, optional (default=’eom’)

The method used to select clusters from the condensed tree. The standard approach for HDBSCAN* is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:

  • eom

  • leaf

allow_single_clusterbool, optional (default=False)

By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset.

prediction_databoolean, optional

Whether to generate extra cached data for predicting labels or membership vectors few new unseen points later. If you wish to persist the clustering object for later re-use you probably want to set this to True. (default False)

match_reference_implementationbool, optional (default=False)

There exist some interpretational differences between this HDBSCAN* implementation and the original authors reference implementation in Java. This can result in very minor differences in clustering results. Setting this flag to True will, at a some performance cost, ensure that the clustering results match the reference implementation.

connectivity{‘pairwise’, ‘knn’}, default=’knn’
The type of connectivity matrix to compute.
  • ‘pairwise’ will compute the entire fully-connected graph of

pairwise distances between each set of points. This is the fastest to compute and can be very fast for smaller datasets but requires O(n^2) space.

  • ‘knn’ will sparsify the fully-connected connectivity matrix to

save memory and enable much larger inputs. “n_neighbors” will control the amount of memory used and the graph will be connected automatically in the event “n_neighbors” was not large enough to connect it.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

Examples

>>> from dasf.ml.cluster import HDBSCAN
>>> import numpy as np
>>> X = np.array([[1, 2], [2, 2], [2, 3],
...               [8, 7], [8, 8], [25, 80]])
>>> clustering = HDBSCAN(min_cluster_size=30, min_samples=2).fit(X)
>>> clustering
HDBSCAN(min_cluster_size=30, min_samples=2)

For further informations see: - https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN - https://docs.rapids.ai/api/cuml/stable/api.html#dbscan - https://docs.rapids.ai/api/cuml/stable/api.html#dbscan-clustering

References

Constructor of the class HDBSCAN.

alpha
gen_min_span_tree
leaf_size
metric
min_cluster_size
min_samples
p
algorithm
approx_min_span_tree
core_dist_n_jobs
cluster_selection_method
allow_single_cluster
prediction_data
match_reference_implementation
connectivity
output_type
verbose
__hdbscan_cpu
_fit_cpu(X, y=None)[source]

Perform HDBSCAN clustering from features or distance matrix using CPU only.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features), or array-like of shape (n_samples, n_samples)

A feature array, or array of distances between samples if metric='precomputed'.

Returns

selfobject

Fitted estimator.

_fit_gpu(X, y=None, convert_dtype=True)[source]

Perform HDBSCAN clustering from features or distance matrix using GPU only (from CuML).

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features), or array-like of shape (n_samples, n_samples)

A feature array, or array of distances between samples if metric='precomputed'.

Returns

selfobject

Fitted estimator.

_fit_predict_cpu(X, y=None)[source]

Performs clustering on X and returns cluster labels using only CPU.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features), or array-like of shape (n_samples, n_samples)

A feature array, or array of distances between samples if metric='precomputed'.

Returns

yndarray, shape (n_samples, )

cluster labels

_fit_predict_gpu(X, y=None)[source]

Performs clustering on X and returns cluster labels using only GPU (from CuML).

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features), or array-like of shape (n_samples, n_samples)

A feature array, or array of distances between samples if metric='precomputed'.

Returns

yndarray, shape (n_samples, )

cluster labels

class dasf.ml.cluster.KMeans(n_clusters=8, init=None, n_init=None, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd', oversampling_factor=2.0, n_jobs=1, init_max_iter=None, max_samples_per_batch=32768, precompute_distances='auto', output_type=None, **kwargs)[source]

Bases: dasf.ml.cluster.classifier.ClusterClassifier

K-Means clustering.

Read more in the User Guide.

Parameters

n_clustersint, default=8

The number of clusters to form as well as the number of centroids to generate.

init : {‘k-means++’, ‘random’}, callable or array-like of shape (n_clusters, n_features), default=’k-means++’

Method for initialization:

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

‘random’: choose n_clusters observations (rows) at random from data for the initial centroids.

If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization.

n_initint, default=10

Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.

max_iterint, default=300

Maximum number of iterations of the k-means algorithm for a single run.

tolfloat, default=1e-4

Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.

precompute_distances{‘auto’, True, False}, default=’auto’

Precompute distances (faster but takes more memory).

‘auto’ : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision. IMPORTANT: This is used only in Dask ML version.

True : always precompute distances.

False : never precompute distances.

verboseint, default=0

Verbosity mode.

random_stateint, RandomState instance or None, default=None

Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See Glossary.

copy_xbool, default=True

When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False. If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False.

n_jobsint, default=1

The number of OpenMP threads to use for the computation. Parallelism is sample-wise on the main cython loop which assigns each sample to its closest center. IMPORTANT: This is used only in Dask ML version.

None or -1 means using all processors.

init_max_iterint, default=None

Number of iterations for init step.

algorithm{“lloyd”, “elkan”}, default=”lloyd”

K-means algorithm to use. The classical EM-style algorithm is “lloyd”. The “elkan” variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).

Changed in version 0.18: Added Elkan algorithm

oversampling_factorint, default=2

The amount of points to sample in scalable k-means++ initialization for potential centroids. Increasing this value can lead to better initial centroids at the cost of memory. The total number of centroids sampled in scalable k-means++ is oversampling_factor * n_clusters * 8.

max_samples_per_batchint, default=32768

The number of data samples to use for batches of the pairwise distance computation. This computation is done throughout both fit predict. The default should suit most cases. The total number of elements in the batched pairwise distance computation is max_samples_per_batch * n_clusters. It might become necessary to lower this number when n_clusters becomes prohibitively large.

output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None

Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.

See Also

MiniBatchKMeansAlternative online implementation that does incremental

updates of the centers positions using mini-batches. For large scale learning (say n_samples > 10k) MiniBatchKMeans is probably much faster than the default batch implementation.

Notes

The k-means problem is solved using either Lloyd’s or Elkan’s algorithm.

The average complexity is given by O(k n T), where n is the number of samples and T is the number of iteration.

The worst case complexity is given by O(n^(k+2/p)) with n = n_samples, p = n_features. (D. Arthur and S. Vassilvitskii, ‘How slow is the k-means method?’ SoCG2006)

In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That’s why it can be useful to restart it several times.

If the algorithm stops before fully converging (because of tol or max_iter), labels_ and cluster_centers_ will not be consistent, i.e. the cluster_centers_ will not be the means of the points in each cluster. Also, the estimator will reassign labels_ after the last iteration to make labels_ consistent with predict on the training set.

Examples

>>> from dasf.ml.cluster import KMeans
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...               [10, 2], [10, 4], [10, 0]])
>>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
>>> kmeans.predict([[0, 0], [12, 3]])
array([1, 0], dtype=int32)

For further informations see: - https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html - https://ml.dask.org/modules/generated/dask_ml.cluster.KMeans.html - https://docs.rapids.ai/api/cuml/stable/api.html#k-means-clustering - https://docs.rapids.ai/api/cuml/stable/api.html#cuml.dask.cluster.KMeans

Constructor of the class KMeans.

n_clusters
random_state
max_iter
init
n_init
tol
verbose
copy_x
algorithm
oversampling_factor
n_jobs
init_max_iter
max_samples_per_batch
precompute_distances
output_type
__kmeans_cpu
__kmeans_mcpu
_lazy_fit_cpu(X, y=None, sample_weight=None)[source]

Compute Dask k-means clustering.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it's not in CSR format.

yIgnored

Not used, present here for API consistency by convention.

sample_weightIgnored

Not used, present here for API consistency by convention.

Returns

selfobject

Fitted estimator.

_lazy_fit_gpu(X, y=None, sample_weight=None)[source]

Compute Dask CuML k-means clustering.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it's not in CSR format.

yIgnored

Not used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight.

Returns

selfobject

Fitted estimator.

_fit_cpu(X, y=None, sample_weight=None)[source]

Compute Scikit Learn k-means clustering.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it's not in CSR format.

yIgnored

Not used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight.

Returns

selfobject

Fitted estimator.

_fit_gpu(X, y=None, sample_weight=None)[source]

Compute CuML k-means clustering.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it's not in CSR format.

yIgnored

Not used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight.

Returns

selfobject

Fitted estimator.

_lazy_fit_predict_cpu(X, y=None, sample_weight=None)[source]

Compute cluster centers and predict cluster index for each sample using Dask ML.

Convenience method; equivalent to calling fit(X) followed by predict(X).

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to transform.

yIgnored

Not used, present here for API consistency by convention.

sample_weightIgnored

Not used, present here for API consistency by convention.

Returns

labelsndarray of shape (n_samples,)

Index of the cluster each sample belongs to.

_lazy_fit_predict_gpu(X, y=None, sample_weight=None)[source]

Compute cluster centers and predict cluster index for each sample using Dask CuML.

Convenience method; equivalent to calling fit(X) followed by predict(X).

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to transform.

yIgnored

Not used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight.

Returns

labelsndarray of shape (n_samples,)

Index of the cluster each sample belongs to.

_fit_predict_cpu(X, y=None, sample_weight=None)[source]

Compute cluster centers and predict cluster index for each sample using Scikit Learn.

Convenience method; equivalent to calling fit(X) followed by predict(X).

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to transform.

yIgnored

Not used, present here for API consistency by convention.

sample_weightIgnored

Not used, present here for API consistency by convention.

Returns

labelsndarray of shape (n_samples,)

Index of the cluster each sample belongs to.

_fit_predict_gpu(X, y=None, sample_weight=None)[source]

Compute cluster centers and predict cluster index for each sample using CuML.

Convenience method; equivalent to calling fit(X) followed by predict(X).

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to transform.

yIgnored

Not used, present here for API consistency by convention.

sample_weightarray-like of shape (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight.

Returns

labelsndarray of shape (n_samples,)

Index of the cluster each sample belongs to.

_lazy_predict_cpu(X, sample_weight=None)[source]

Predict the closest cluster each sample in X belongs to using Dask ML.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to predict.

sample_weightIgnored

Not used, present here for API consistency by convention.

Returns

labelsndarray of shape (n_samples,)

Index of the cluster each sample belongs to.

_lazy_predict_gpu(X, sample_weight=None)[source]

Predict the closest cluster each sample in X belongs to using Dask CuML.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to predict.

sample_weightarray-like of shape (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight.

Returns

labelsndarray of shape (n_samples,)

Index of the cluster each sample belongs to.

_predict_cpu(X, sample_weight=None)[source]

Predict the closest cluster each sample in X belongs to using Scikit Learn.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to predict.

sample_weightarray-like of shape (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight.

Returns

labelsndarray of shape (n_samples,)

Index of the cluster each sample belongs to.

_predict_gpu(X, sample_weight=None)[source]

Predict the closest cluster each sample in X belongs to using CuML.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to predict.

sample_weightarray-like of shape (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight.

Returns

labelsndarray of shape (n_samples,)

Index of the cluster each sample belongs to.

_lazy_predict2_cpu(X, sample_weight=None)[source]

A block predict using Scikit Learn variant but for Dask.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to predict.

sample_weightarray-like of shape (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight.

Returns

labelsndarray of shape (n_samples,)

Index of the cluster each sample belongs to.

_lazy_predict2_gpu(X, sample_weight=None)[source]

A block predict using CuML variant but for Dask.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to predict.

sample_weightarray-like of shape (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight.

Returns

labelsndarray of shape (n_samples,)

Index of the cluster each sample belongs to.

_predict2_cpu(X, sample_weight=None, compat=True)[source]

A block predict using Scikit Learn variant as a placeholder.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to predict.

sample_weightarray-like of shape (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight.

compatbool

There is no version for single CPU/GPU for predict2. This compatibility parameter uses the original predict method. Otherwise, it raises an exception.

Returns

labelsndarray of shape (n_samples,)

Index of the cluster each sample belongs to.

_predict2_gpu(X, sample_weight=None, compat=True)[source]

A block predict using CuML variant as a placeholder.

In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features)

New data to predict.

sample_weightarray-like of shape (n_samples,), default=None

The weights for each observation in X. If None, all observations are assigned equal weight.

compatbool

There is no version for single CPU/GPU for predict2. This compatibility parameter uses the original predict method. Otherwise, it raises an exception.

Returns

labelsndarray of shape (n_samples,)

Index of the cluster each sample belongs to.

predict2(sample_weight=None)[source]

Generic predict2 funtion according executor (for some ML methods only).

class dasf.ml.cluster.SOM(x, y, input_len, num_epochs=100, sigma=0, sigmaN=1, learning_rate=0.5, learning_rateN=0.01, decay_function='exponential', neighborhood_function='gaussian', std_coeff=0.5, topology='rectangular', activation_distance='euclidean', random_seed=None, n_parallel=0, compact_support=False, **kwargs)[source]

Bases: dasf.ml.cluster.classifier.ClusterClassifier

Initializes a Self Organizing Maps.

A rule of thumb to set the size of the grid for a dimensionality reduction task is that it should contain 5*sqrt(N) neurons where N is the number of samples in the dataset to analyze.

E.g. if your dataset has 150 samples, 5*sqrt(150) = 61.23 hence a map 8-by-8 should perform well.

Parameters

xint

x dimension of the SOM.

yint

y dimension of the SOM.

input_lenint

Number of the elements of the vectors in input.

sigmafloat, default=min(x,y)/2

Spread of the neighborhood function, needs to be adequate to the dimensions of the map.

sigmaNfloat, default=0.01

Spread of the neighborhood function at last iteration.

learning_ratefloat, default=0.5

initial learning rate.

learning_rateNfloat, default=0.01

final learning rate

decay_functionstring, default=’exponential’

Function that reduces learning_rate and sigma at each iteration. Possible values: ‘exponential’, ‘linear’, ‘aymptotic’

neighborhood_functionstring, default=’gaussian’

Function that weights the neighborhood of a position in the map. Possible values: ‘gaussian’, ‘mexican_hat’, ‘bubble’, ‘triangle’

topologystring, default=’rectangular’

Topology of the map. Possible values: ‘rectangular’, ‘hexagonal’

activation_distancestring, default=’euclidean’

Distance used to activate the map. Possible values: ‘euclidean’, ‘cosine’, ‘manhattan’

random_seedint, default=None

Random seed to use.

n_paralleluint, default=#max_CUDA_threads or 500*#CPUcores

Number of samples to be processed at a time. Setting a too low value may drastically lower performance due to under-utilization, setting a too high value increases memory usage without granting any significant performance benefit.

xpnumpy or cupy, default=cupy if can be imported else numpy

Use numpy (CPU) or cupy (GPU) for computations.

std_coeff: float, default=0.5

Used to calculate gausssian exponent denominator: d = 2*std_coeff**2*sigma**2

compact_support: bool, default=False

Cut the neighbor function to 0 beyond neighbor radius sigma

Examples

>>> from dasf.ml.cluster import SOM
>>> import numpy as np
>>> X = np.array([[1, 1], [2, 1], [1, 0],
...               [4, 7], [3, 5], [3, 6]])
>>> som = SOM(x=3, y=2, input_len=2,
...           num_epochs=100).fit(X)
>>> som
SOM(x=3, y=2, input_len=2, num_epochs=100)

Constructor of the class SOM.

x
y
input_len
num_epochs
sigma
sigmaN
learning_rate
learning_rateN
decay_function
neighborhood_function
std_coeff
topology
activation_distance
random_seed
n_parallel
compact_support
__som_cpu
__som_mcpu
_lazy_fit_cpu(X, y=None, sample_weight=None)[source]

Fit SOM method using Dask with CPUs only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

sample_weightarray-like of shape (n_samples,), default=None

This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.

Returns

selfobject

Returns a fitted instance of self.

_lazy_fit_gpu(X, y=None, sample_weight=None)[source]

Fit SOM method using Dask with GPUs only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

sample_weightarray-like of shape (n_samples,), default=None

This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.

Returns

selfobject

Returns a fitted instance of self.

_fit_cpu(X, y=None, sample_weight=None)[source]

Fit SOM method using CPU only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

sample_weightarray-like of shape (n_samples,), default=None

This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.

Returns

selfobject

Returns a fitted instance of self.

_fit_gpu(X, y=None, sample_weight=None)[source]

Fit SOM method using GPU only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

sample_weightarray-like of shape (n_samples,), default=None

This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.

Returns

selfobject

Returns a fitted instance of self.

_lazy_fit_predict_cpu(X, y=None, sample_weight=None)[source]

Fit SOM and select the winner neurons for the input using Dask with CPUs only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

y{array-like, sparse matrix} of shape (n_samples).

This is just a placeholder to keep the compatibility with other fit_predict methods. SOM does not use labels to verify the input.

sample_weightarray-like of shape (n_samples,), default=None

This is just a placeholder to keep the compatibility with other fit_predict methods. This is not used by SOM.

Returns

selfobject

Returns a fitted instance of self.

_lazy_fit_predict_gpu(X, y=None, sample_weight=None)[source]

Fit SOM and select the winner neurons for the input using Dask with GPUs only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

y{array-like, sparse matrix} of shape (n_samples).

This is just a placeholder to keep the compatibility with other fit_predict methods. SOM does not use labels to verify the input.

sample_weightarray-like of shape (n_samples,), default=None

This is just a placeholder to keep the compatibility with other fit_predict methods. This is not used by SOM.

Returns

selfobject

Returns a fitted instance of self.

_fit_predict_cpu(X, y=None, sample_weight=None)[source]

Fit SOM and select the winner neurons for the input using CPU only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

y{array-like, sparse matrix} of shape (n_samples).

This is just a placeholder to keep the compatibility with other fit_predict methods. SOM does not use labels to verify the input.

sample_weightarray-like of shape (n_samples,), default=None

This is just a placeholder to keep the compatibility with other fit_predict methods. This is not used by SOM.

Returns

selfobject

Returns a fitted instance of self.

_fit_predict_gpu(X, y=None, sample_weight=None)[source]

Fit SOM and select the winner neurons for the input using GPU only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

y{array-like, sparse matrix} of shape (n_samples).

This is just a placeholder to keep the compatibility with other fit_predict methods. SOM does not use labels to verify the input.

sample_weightarray-like of shape (n_samples,), default=None

This is just a placeholder to keep the compatibility with other fit_predict methods. This is not used by SOM.

Returns

selfobject

Returns a fitted instance of self.

_lazy_predict_cpu(X, sample_weight=None)[source]

Predict the input using a fitted SOM using Dask with CPUs only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

sample_weightarray-like of shape (n_samples,), default=None

This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.

Returns

labelsndarray of shape (n_samples,)

Cluster labels. Noisy samples are given the label -1.

_lazy_predict_gpu(X, sample_weight=None)[source]

Predict the input using a fitted SOM using Dask with GPUs only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

sample_weightarray-like of shape (n_samples,), default=None

This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.

Returns

labelsndarray of shape (n_samples,)

Cluster labels. Noisy samples are given the label -1.

_predict_cpu(X, sample_weight=None)[source]

Predict the input using a fitted SOM using CPU only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

sample_weightarray-like of shape (n_samples,), default=None

This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.

Returns

labelsndarray of shape (n_samples,)

Cluster labels. Noisy samples are given the label -1.

_predict_gpu(X, sample_weight=None)[source]

Predict the input using a fitted SOM using GPU only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

sample_weightarray-like of shape (n_samples,), default=None

This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.

Returns

labelsndarray of shape (n_samples,)

Cluster labels. Noisy samples are given the label -1.

_lazy_quantization_error_cpu(X)[source]

Returns the quantization error computed as the average distance between each input sample and its best matching unit using Dask with CPUs only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

Returns

errorfloat

The quantization error of the trained SOM.

_lazy_quantization_error_gpu(X)[source]

Returns the quantization error computed as the average distance between each input sample and its best matching unit using Dask with GPUs only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

Returns

errorfloat

The quantization error of the trained SOM.

_quantization_error_cpu(X)[source]

Returns the quantization error computed as the average distance between each input sample and its best matching unit using CPU only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

Returns

errorfloat

The quantization error of the trained SOM.

_quantization_error_gpu(X)[source]

Returns the quantization error computed as the average distance between each input sample and its best matching unit using GPU only.

Parameters

X : {array-like, sparse matrix} of shape (n_samples, n_features).

Returns

errorfloat

The quantization error of the trained SOM.

quantization_error(X)[source]

Generic quantization_error funtion according executor (for SOM method only).

class dasf.ml.cluster.SpectralClustering(n_clusters=8, eigen_solver=None, random_state=None, n_init=10, gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol=0.0, assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=None, n_components=None, persist_embedding=False, kmeans_params=None, verbose=False, **kwargs)[source]

Bases: dasf.ml.cluster.classifier.ClusterClassifier

Apply clustering to a projection of the normalized Laplacian.

In practice Spectral Clustering is very useful when the structure of the individual clusters is highly non-convex, or more generally when a measure of the center and spread of the cluster is not a suitable description of the complete cluster, such as when clusters are nested circles on the 2D plane.

If the affinity matrix is the adjacency matrix of a graph, this method can be used to find normalized graph cuts.

When calling fit, an affinity matrix is constructed using either a kernel function such the Gaussian (aka RBF) kernel with Euclidean distance d(X, X):

np.exp(-gamma * d(X,X) ** 2)

or a k-nearest neighbors connectivity matrix.

Alternatively, a user-provided affinity matrix can be specified by setting affinity='precomputed'.

Read more in the User Guide.

Parameters

n_clustersint, default=8

The dimension of the projection subspace.

eigen_solver{‘arpack’, ‘lobpcg’, ‘amg’}, default=None

The eigenvalue decomposition strategy to use. AMG requires pyamg to be installed. It can be faster on very large, sparse problems, but may also lead to instabilities. If None, then 'arpack' is used.

n_componentsint, default=n_clusters

Number of eigenvectors to use for the spectral embedding

random_stateint, RandomState instance, default=None

A pseudo random number generator used for the initialization of the lobpcg eigenvectors decomposition when eigen_solver='amg' and by the K-Means initialization. Use an int to make the randomness deterministic. See Glossary.

n_initint, default=10

Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. Only used if assign_labels='kmeans'.

gammafloat, default=1.0

Kernel coefficient for rbf, poly, sigmoid, laplacian and chi2 kernels. Ignored for affinity='nearest_neighbors'.

affinitystr or callable, default=’rbf’
How to construct the affinity matrix.
  • ‘nearest_neighbors’: construct the affinity matrix by computing a graph of nearest neighbors.

  • ‘rbf’: construct the affinity matrix using a radial basis function (RBF) kernel.

  • ‘precomputed’: interpret X as a precomputed affinity matrix, where larger values indicate greater similarity between instances.

  • ‘precomputed_nearest_neighbors’: interpret X as a sparse graph of precomputed distances, and construct a binary affinity matrix from the n_neighbors nearest neighbors of each instance.

  • one of the kernels supported by pairwise_kernels().

Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm.

n_neighborsint, default=10

Number of neighbors to use when constructing the affinity matrix using the nearest neighbors method. Ignored for affinity='rbf'.

eigen_tolfloat, default=0.0

Stopping criterion for eigendecomposition of the Laplacian matrix when eigen_solver='arpack'.

assign_labels{‘kmeans’, ‘discretize’}, default=’kmeans’

The strategy for assigning labels in the embedding space. There are two ways to assign labels after the Laplacian embedding. k-means is a popular choice, but it can be sensitive to initialization. Discretization is another approach which is less sensitive to random initialization.

degreefloat, default=3

Degree of the polynomial kernel. Ignored by other kernels.

coef0float, default=1

Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels.

kernel_paramsdict of str to any, default=None

Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels.

n_jobsint, default=None

The number of parallel jobs to run when affinity=’nearest_neighbors’ or affinity=’precomputed_nearest_neighbors’. The neighbors search will be done in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

verbosebool, default=False

Verbosity mode.

Added in version 0.24.

persist_embeddingbool

Whether to persist the intermediate n_samples x n_components array used for clustering.

kmeans_paramsdictionary of string to any, optional

Keyword arguments for the KMeans clustering used for the final clustering.

Examples

>>> from dasf.ml.cluster import SpectralClustering
>>> import numpy as np
>>> X = np.array([[1, 1], [2, 1], [1, 0],
...               [4, 7], [3, 5], [3, 6]])
>>> clustering = SpectralClustering(n_clusters=2,
...         assign_labels='discretize',
...         random_state=0).fit(X)
>>> clustering
SpectralClustering(assign_labels='discretize', n_clusters=2,
    random_state=0)

For further informations see: - https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering - https://ml.dask.org/modules/generated/dask_ml.cluster.SpectralClustering.html

Notes

A distance matrix for which 0 indicates identical elements and high values indicate very dissimilar elements can be transformed into an affinity / similarity matrix that is well-suited for the algorithm by applying the Gaussian (aka RBF, heat) kernel:

np.exp(- dist_matrix ** 2 / (2. * delta ** 2))

where delta is a free parameter representing the width of the Gaussian kernel.

An alternative is to take a symmetric version of the k-nearest neighbors connectivity matrix of the points.

If the pyamg package is installed, it is used: this greatly speeds up computation.

References

Constructor of the class SpectralClustering.

n_clusters
eigen_solver
random_state
n_init
gamma
affinity
n_neighbors
eigen_tol
assign_labels
degree
coef0
kernel_params
n_jobs
n_components
persist_embedding
kmeans_params
verbose
__sc_cpu
__sc_mcpu
_fit_cpu(X, y=None, sample_weight=None)[source]

Perform spectral clustering from features, or affinity matrix using CPU only.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)

Training instances to cluster, similarities / affinities between instances if affinity='precomputed', or distances between instances if affinity='precomputed_nearest_neighbors. If a sparse matrix is provided in a format other than csr_matrix, csc_matrix, or coo_matrix, it will be converted into a sparse csr_matrix.

yIgnored

Not used, present here for API consistency by convention.

Returns

selfobject

A fitted instance of the estimator.

_lazy_fit_predict_cpu(X, y=None, sample_weight=None)[source]

Perform spectral clustering on X and return cluster labels using Dask with CPU only.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)

Training instances to cluster, similarities / affinities between instances if affinity='precomputed', or distances between instances if affinity='precomputed_nearest_neighbors. If a sparse matrix is provided in a format other than csr_matrix, csc_matrix, or coo_matrix, it will be converted into a sparse csr_matrix.

yIgnored

Not used, present here for API consistency by convention.

Returns

labelsndarray of shape (n_samples,)

Cluster labels.

_fit_predict_cpu(X, y=None, sample_weight=None)[source]

Perform spectral clustering on X and return cluster labels using CPU only.

Parameters

X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)

Training instances to cluster, similarities / affinities between instances if affinity='precomputed', or distances between instances if affinity='precomputed_nearest_neighbors. If a sparse matrix is provided in a format other than csr_matrix, csc_matrix, or coo_matrix, it will be converted into a sparse csr_matrix.

yIgnored

Not used, present here for API consistency by convention.

Returns

labelsndarray of shape (n_samples,)

Cluster labels.