dasf.ml.cluster
Init module for Clustering ML algorithms.
Submodules
Classes
Agglomerative Clustering |
|
Perform DBSCAN clustering from vector array or distance matrix. |
|
Perform HDBSCAN clustering from vector array or distance matrix. |
|
K-Means clustering. |
|
Initializes a Self Organizing Maps. |
|
Apply clustering to a projection of the normalized Laplacian. |
Package Contents
- class dasf.ml.cluster.AgglomerativeClustering(n_clusters=2, metric='euclidean', connectivity=None, linkage='single', memory=None, compute_full_tree='auto', distance_threshold=None, compute_distances=False, handle=None, verbose=False, n_neighbors=10, output_type=None, **kwargs)[source]
Bases:
dasf.ml.cluster.classifier.ClusterClassifier
Agglomerative Clustering
Recursively merges the pair of clusters that minimally increases a given linkage distance.
Read more in the User Guide.
Parameters
- n_clustersint or None, default=2
The number of clusters to find. It must be
None
ifdistance_threshold
is notNone
.- metricstr or callable, default=”euclidean”
Metric used to compute the linkage. Can be “euclidean”, “l1”, “l2”, “manhattan”, “cosine”, or “precomputed”. If linkage is “ward”, only “euclidean” is accepted. If “precomputed”, a distance matrix is needed as input for the fit method.
- memorystr or object with the joblib.Memory interface, default=None
Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.
- connectivityarray-like or callable, default=None
Connectivity matrix. Defines for each sample the neighboring samples following a given structure of the data. This can be a connectivity matrix itself or a callable that transforms the data into a connectivity matrix, such as derived from kneighbors_graph. Default is
None
, i.e, the hierarchical clustering algorithm is unstructured.- compute_full_tree‘auto’ or bool, default=’auto’
Stop early the construction of the tree at
n_clusters
. This is useful to decrease computation time if the number of clusters is not small compared to the number of samples. This option is useful only when specifying a connectivity matrix. Note also that when varying the number of clusters and using caching, it may be advantageous to compute the full tree. It must beTrue
ifdistance_threshold
is notNone
. By default compute_full_tree is “auto”, which is equivalent to True when distance_threshold is not None or that n_clusters is inferior to the maximum between 100 or 0.02 * n_samples. Otherwise, “auto” is equivalent to False.- linkage{‘ward’, ‘complete’, ‘average’, ‘single’}, default=’ward’
Which linkage criterion to use. The linkage criterion determines which distance to use between sets of observation. The algorithm will merge the pairs of cluster that minimize this criterion.
‘ward’ minimizes the variance of the clusters being merged.
‘average’ uses the average of the distances of each observation of the two sets.
‘complete’ or ‘maximum’ linkage uses the maximum distances between all observations of the two sets.
‘single’ uses the minimum of the distances between all observations of the two sets.
Added in version 0.20: Added the ‘single’ option
- distance_thresholdfloat, default=None
The linkage distance threshold above which, clusters will not be merged. If not
None
,n_clusters
must beNone
andcompute_full_tree
must beTrue
.Added in version 0.21.
- compute_distancesbool, default=False
Computes distances between clusters even if distance_threshold is not used. This can be used to make dendrogram visualization, but introduces a computational and memory overhead.
Added in version 0.24.
- n_neighborsint, default = 15
The number of neighbors to compute when connectivity = “knn”
- output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.
Examples
>>> from dasf.ml.cluster import AgglomerativeClustering >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [4, 2], [4, 4], [4, 0]]) >>> clustering = AgglomerativeClustering().fit(X) >>> clustering AgglomerativeClustering()
For further informations see: - https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html - https://docs.rapids.ai/api/cuml/stable/api.html#agglomerative-clustering
Constructor of the class AgglomerativeClustering.
- n_clusters
- metric
- connectivity
- linkage
- memory
- compute_full_tree
- distance_threshold
- compute_distances
- handle
- verbose
- n_neighbors
- output_type
- __agg_cluster_cpu
- _fit_cpu(X, y=None, convert_dtype=True)[source]
Fit without validation using CPU only.
Parameters
- Xndarray of shape (n_samples, n_features) or (n_samples, n_samples)
Training instances to cluster, or distances between instances if
affinity='precomputed'
.
Returns
- selfobject
Returns the fitted instance.
- _fit_gpu(X, y=None, convert_dtype=True)[source]
Fit without validation using GPU only.
Parameters
- Xndarray of shape (n_samples, n_features) or (n_samples, n_samples)
Training instances to cluster, or distances between instances if
affinity='precomputed'
.
Returns
- selfobject
Returns the fitted instance.
- _fit_predict_cpu(X, y=None)[source]
Fit and return the result of each sample’s clustering assignment using CPU only.
In addition to fitting, this method also return the result of the clustering assignment for each sample in the training set.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)
Training instances to cluster, or distances between instances if
affinity='precomputed'
.- yIgnored
Not used, present here for API consistency by convention.
Returns
- labelsndarray of shape (n_samples,)
Cluster labels.
- _fit_predict_gpu(X, y=None)[source]
Fit and return the result of each sample’s clustering assignment using GPU only.
In addition to fitting, this method also return the result of the clustering assignment for each sample in the training set.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)
Training instances to cluster, or distances between instances if
affinity='precomputed'
.- yIgnored
Not used, present here for API consistency by convention.
Returns
- labelsndarray of shape (n_samples,)
Cluster labels.
- class dasf.ml.cluster.DBSCAN(eps=0.5, leaf_size=40, metric='euclidean', min_samples=5, p=None, output_type=None, calc_core_sample_indices=True, verbose=False, **kwargs)[source]
Bases:
dasf.ml.cluster.classifier.ClusterClassifier
Perform DBSCAN clustering from vector array or distance matrix.
DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.
Read more in the User Guide.
Parameters
- epsfloat, default=0.5
The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function.
- min_samplesint, default=5
The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
- metricstring, or callable, default=’euclidean’
The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by
sklearn.metrics.pairwise_distances()
for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square. X may be a Glossary, in which case only “nonzero” elements may be considered neighbors for DBSCAN.Added in version 0.17: metric precomputed to accept precomputed sparse matrix.
- metric_paramsdict, default=None
Additional keyword arguments for the metric function.
Added in version 0.19.
- algorithm{‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’
The algorithm to be used by the NearestNeighbors module to compute pointwise distances and find nearest neighbors. See NearestNeighbors module documentation for details.
- leaf_sizeint, default=30
Leaf size passed to BallTree or cKDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.
- pfloat, default=None
The power of the Minkowski metric to be used to calculate distance between points. If None, then
p=2
(equivalent to the Euclidean distance).- n_jobsint, default=None
The number of parallel jobs to run.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. See Glossary for more details.- output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.
- calc_core_sample_indices(optional)boolean, default = True
Indicates whether the indices of the core samples should be calculated. The the attribute core_sample_indices_ will not be used, setting this to False will avoid unnecessary kernel launches.
Examples
>>> from dasf.ml.cluster import DBSCAN >>> import numpy as np >>> X = np.array([[1, 2], [2, 2], [2, 3], ... [8, 7], [8, 8], [25, 80]]) >>> clustering = DBSCAN(eps=3, min_samples=2).fit(X) >>> clustering DBSCAN(eps=3, min_samples=2)
For further informations see: - https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN - https://docs.rapids.ai/api/cuml/stable/api.html#dbscan - https://docs.rapids.ai/api/cuml/stable/api.html#dbscan-clustering
See Also
- OPTICSA similar clustering at multiple values of eps. Our implementation
is optimized for memory usage.
References
Ester, M., H. P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996
Schubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu, X. (2017). DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), 19.
Constructor of the class DBSCAN.
- eps
- leaf_size
- metric
- min_samples
- p
- output_type
- calc_core_sample_indices
- verbose
- __dbscan_cpu
- _lazy_fit_gpu(X, y=None, out_dtype='int32')[source]
Perform DBSCAN clustering from features, or distance matrix using Dask with GPUs only (from CuML).
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features), or (n_samples, n_samples)
Training instances to cluster, or distances between instances if
metric='precomputed'
. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix
.- yIgnored
Not used, present here for API consistency by convention.
- sample_weightarray-like of shape (n_samples,), default=None
Weight of each sample, such that a sample with a weight of at least
min_samples
is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.
Returns
- selfobject
Returns a fitted instance of self.
- _fit_cpu(X, y=None, sample_weight=None)[source]
Perform DBSCAN clustering from features, or distance matrix using CPU only.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features), or (n_samples, n_samples)
Training instances to cluster, or distances between instances if
metric='precomputed'
. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix
.- yIgnored
Not used, present here for API consistency by convention.
- sample_weightarray-like of shape (n_samples,), default=None
Weight of each sample, such that a sample with a weight of at least
min_samples
is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.
Returns
- selfobject
Returns a fitted instance of self.
- _fit_gpu(X, y=None, out_dtype='int32')[source]
Perform DBSCAN clustering from features, or distance matrix using GPU only (from CuML).
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features), or (n_samples, n_samples)
Training instances to cluster, or distances between instances if
metric='precomputed'
. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix
.- yIgnored
Not used, present here for API consistency by convention.
- sample_weightarray-like of shape (n_samples,), default=None
Weight of each sample, such that a sample with a weight of at least
min_samples
is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.
Returns
- selfobject
Returns a fitted instance of self.
- _lazy_fit_predict_gpu(X, y=None, out_dtype='int32')[source]
Compute clusters from a data or distance matrix and predict labels using Dask and GPUs (from CuML).
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features), or (n_samples, n_samples)
Training instances to cluster, or distances between instances if
metric='precomputed'
. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix
.- yIgnored
Not used, present here for API consistency by convention.
- sample_weightarray-like of shape (n_samples,), default=None
Weight of each sample, such that a sample with a weight of at least
min_samples
is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.
Returns
- labelsndarray of shape (n_samples,)
Cluster labels. Noisy samples are given the label -1.
- _fit_predict_cpu(X, y=None, sample_weight=None)[source]
Compute clusters from a data or distance matrix and predict labels using CPU only.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features), or (n_samples, n_samples)
Training instances to cluster, or distances between instances if
metric='precomputed'
. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix
.- yIgnored
Not used, present here for API consistency by convention.
- sample_weightarray-like of shape (n_samples,), default=None
Weight of each sample, such that a sample with a weight of at least
min_samples
is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.
Returns
- labelsndarray of shape (n_samples,)
Cluster labels. Noisy samples are given the label -1.
- _fit_predict_gpu(X, y=None, out_dtype='int32')[source]
Compute clusters from a data or distance matrix and predict labels using GPU only (from CuML).
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features), or (n_samples, n_samples)
Training instances to cluster, or distances between instances if
metric='precomputed'
. If a sparse matrix is provided, it will be converted into a sparsecsr_matrix
.- yIgnored
Not used, present here for API consistency by convention.
- sample_weightarray-like of shape (n_samples,), default=None
Weight of each sample, such that a sample with a weight of at least
min_samples
is by itself a core sample; a sample with a negative weight may inhibit its eps-neighbor from being core. Note that weights are absolute, and default to 1.
Returns
- labelsndarray of shape (n_samples,)
Cluster labels. Noisy samples are given the label -1.
- class dasf.ml.cluster.HDBSCAN(alpha=1.0, gen_min_span_tree=False, leaf_size=40, metric='euclidean', min_cluster_size=5, min_samples=None, p=None, algorithm='auto', approx_min_span_tree=True, core_dist_n_jobs=4, cluster_selection_method='eom', allow_single_cluster=False, prediction_data=False, match_reference_implementation=False, connectivity='knn', output_type=None, verbose=0, **kwargs)[source]
Bases:
dasf.ml.cluster.classifier.ClusterClassifier
Perform HDBSCAN clustering from vector array or distance matrix.
HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.
Parameters
- min_cluster_sizeint, optional (default=5)
The minimum size of clusters; single linkage splits that contain fewer points than this will be considered points “falling out” of a cluster rather than a cluster splitting into two new clusters.
- min_samplesint, optional (default=None)
The number of samples in a neighbourhood for a point to be considered a core point.
- metricstring, or callable, optional (default=’euclidean’)
The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.pairwise_distances for its metric parameter. If metric is “precomputed”, X is assumed to be a distance matrix and must be square.
- pint, optional (default=None)
p value to use if using the minkowski metric.
- alphafloat, optional (default=1.0)
A distance scaling parameter as used in robust single linkage. See [3] for more information.
- cluster_selection_epsilon: float, optional (default=0.0)
A distance threshold. Clusters below this value will be merged.
See [5] for more information.
- algorithmstring, optional (default=’best’)
Exactly which algorithm to use; hdbscan has variants specialised for different characteristics of the data. By default this is set to
best
which chooses the “best” algorithm given the nature of the data. You can force other options if you believe you know better. Options are:best
generic
prims_kdtree
prims_balltree
boruvka_kdtree
boruvka_balltree
- leaf_size: int, optional (default=40)
If using a space tree algorithm (kdtree, or balltree) the number of points ina leaf node of the tree. This does not alter the resulting clustering, but may have an effect on the runtime of the algorithm.
- memoryInstance of joblib.Memory or string (optional)
Used to cache the output of the computation of the tree. By default, no caching is done. If a string is given, it is the path to the caching directory.
- approx_min_span_treebool, optional (default=True)
Whether to accept an only approximate minimum spanning tree. For some algorithms this can provide a significant speedup, but the resulting clustering may be of marginally lower quality. If you are willing to sacrifice speed for correctness you may want to explore this; in general this should be left at the default True.
- gen_min_span_tree: bool, optional (default=False)
Whether to generate the minimum spanning tree with regard to mutual reachability distance for later analysis.
- core_dist_n_jobsint, optional (default=4)
Number of parallel jobs to run in core distance computations (if supported by the specific algorithm). For
core_dist_n_jobs
below -1, (n_cpus + 1 + core_dist_n_jobs) are used.- cluster_selection_methodstring, optional (default=’eom’)
The method used to select clusters from the condensed tree. The standard approach for HDBSCAN* is to use an Excess of Mass algorithm to find the most persistent clusters. Alternatively you can instead select the clusters at the leaves of the tree – this provides the most fine grained and homogeneous clusters. Options are:
eom
leaf
- allow_single_clusterbool, optional (default=False)
By default HDBSCAN* will not produce a single cluster, setting this to True will override this and allow single cluster results in the case that you feel this is a valid result for your dataset.
- prediction_databoolean, optional
Whether to generate extra cached data for predicting labels or membership vectors few new unseen points later. If you wish to persist the clustering object for later re-use you probably want to set this to True. (default False)
- match_reference_implementationbool, optional (default=False)
There exist some interpretational differences between this HDBSCAN* implementation and the original authors reference implementation in Java. This can result in very minor differences in clustering results. Setting this flag to True will, at a some performance cost, ensure that the clustering results match the reference implementation.
- connectivity{‘pairwise’, ‘knn’}, default=’knn’
- The type of connectivity matrix to compute.
‘pairwise’ will compute the entire fully-connected graph of
pairwise distances between each set of points. This is the fastest to compute and can be very fast for smaller datasets but requires O(n^2) space.
‘knn’ will sparsify the fully-connected connectivity matrix to
save memory and enable much larger inputs. “n_neighbors” will control the amount of memory used and the graph will be connected automatically in the event “n_neighbors” was not large enough to connect it.
- output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.
Examples
>>> from dasf.ml.cluster import HDBSCAN >>> import numpy as np >>> X = np.array([[1, 2], [2, 2], [2, 3], ... [8, 7], [8, 8], [25, 80]]) >>> clustering = HDBSCAN(min_cluster_size=30, min_samples=2).fit(X) >>> clustering HDBSCAN(min_cluster_size=30, min_samples=2)
For further informations see: - https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN - https://docs.rapids.ai/api/cuml/stable/api.html#dbscan - https://docs.rapids.ai/api/cuml/stable/api.html#dbscan-clustering
References
Constructor of the class HDBSCAN.
- alpha
- gen_min_span_tree
- leaf_size
- metric
- min_cluster_size
- min_samples
- p
- algorithm
- approx_min_span_tree
- core_dist_n_jobs
- cluster_selection_method
- allow_single_cluster
- prediction_data
- match_reference_implementation
- connectivity
- output_type
- verbose
- __hdbscan_cpu
- _fit_cpu(X, y=None)[source]
Perform HDBSCAN clustering from features or distance matrix using CPU only.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features), or array-like of shape (n_samples, n_samples)
A feature array, or array of distances between samples if
metric='precomputed'
.
Returns
- selfobject
Fitted estimator.
- _fit_gpu(X, y=None, convert_dtype=True)[source]
Perform HDBSCAN clustering from features or distance matrix using GPU only (from CuML).
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features), or array-like of shape (n_samples, n_samples)
A feature array, or array of distances between samples if
metric='precomputed'
.
Returns
- selfobject
Fitted estimator.
- _fit_predict_cpu(X, y=None)[source]
Performs clustering on X and returns cluster labels using only CPU.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features), or array-like of shape (n_samples, n_samples)
A feature array, or array of distances between samples if
metric='precomputed'
.
Returns
- yndarray, shape (n_samples, )
cluster labels
- _fit_predict_gpu(X, y=None)[source]
Performs clustering on X and returns cluster labels using only GPU (from CuML).
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features), or array-like of shape (n_samples, n_samples)
A feature array, or array of distances between samples if
metric='precomputed'
.
Returns
- yndarray, shape (n_samples, )
cluster labels
- class dasf.ml.cluster.KMeans(n_clusters=8, init=None, n_init=None, max_iter=300, tol=0.0001, verbose=0, random_state=None, copy_x=True, algorithm='lloyd', oversampling_factor=2.0, n_jobs=1, init_max_iter=None, max_samples_per_batch=32768, precompute_distances='auto', output_type=None, **kwargs)[source]
Bases:
dasf.ml.cluster.classifier.ClusterClassifier
K-Means clustering.
Read more in the User Guide.
Parameters
- n_clustersint, default=8
The number of clusters to form as well as the number of centroids to generate.
init : {‘k-means++’, ‘random’}, callable or array-like of shape (n_clusters, n_features), default=’k-means++’
Method for initialization:
‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.
‘random’: choose n_clusters observations (rows) at random from data for the initial centroids.
If an array is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.
If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization.
- n_initint, default=10
Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
- max_iterint, default=300
Maximum number of iterations of the k-means algorithm for a single run.
- tolfloat, default=1e-4
Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.
- precompute_distances{‘auto’, True, False}, default=’auto’
Precompute distances (faster but takes more memory).
‘auto’ : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision. IMPORTANT: This is used only in Dask ML version.
True : always precompute distances.
False : never precompute distances.
- verboseint, default=0
Verbosity mode.
- random_stateint, RandomState instance or None, default=None
Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See Glossary.
- copy_xbool, default=True
When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True (default), then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False. If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False.
- n_jobsint, default=1
The number of OpenMP threads to use for the computation. Parallelism is sample-wise on the main cython loop which assigns each sample to its closest center. IMPORTANT: This is used only in Dask ML version.
None
or-1
means using all processors.- init_max_iterint, default=None
Number of iterations for init step.
- algorithm{“lloyd”, “elkan”}, default=”lloyd”
K-means algorithm to use. The classical EM-style algorithm is “lloyd”. The “elkan” variation can be more efficient on some datasets with well-defined clusters, by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).
Changed in version 0.18: Added Elkan algorithm
- oversampling_factorint, default=2
The amount of points to sample in scalable k-means++ initialization for potential centroids. Increasing this value can lead to better initial centroids at the cost of memory. The total number of centroids sampled in scalable k-means++ is oversampling_factor * n_clusters * 8.
- max_samples_per_batchint, default=32768
The number of data samples to use for batches of the pairwise distance computation. This computation is done throughout both fit predict. The default should suit most cases. The total number of elements in the batched pairwise distance computation is max_samples_per_batch * n_clusters. It might become necessary to lower this number when n_clusters becomes prohibitively large.
- output_type{‘input’, ‘cudf’, ‘cupy’, ‘numpy’, ‘numba’}, default=None
Variable to control output type of the results and attributes of the estimator. If None, it’ll inherit the output type set at the module level, cuml.global_settings.output_type. See Output Data Type Configuration for more info.
See Also
- MiniBatchKMeansAlternative online implementation that does incremental
updates of the centers positions using mini-batches. For large scale learning (say n_samples > 10k) MiniBatchKMeans is probably much faster than the default batch implementation.
Notes
The k-means problem is solved using either Lloyd’s or Elkan’s algorithm.
The average complexity is given by O(k n T), where n is the number of samples and T is the number of iteration.
The worst case complexity is given by O(n^(k+2/p)) with n = n_samples, p = n_features. (D. Arthur and S. Vassilvitskii, ‘How slow is the k-means method?’ SoCG2006)
In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That’s why it can be useful to restart it several times.
If the algorithm stops before fully converging (because of
tol
ormax_iter
),labels_
andcluster_centers_
will not be consistent, i.e. thecluster_centers_
will not be the means of the points in each cluster. Also, the estimator will reassignlabels_
after the last iteration to makelabels_
consistent withpredict
on the training set.Examples
>>> from dasf.ml.cluster import KMeans >>> import numpy as np >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [10, 2], [10, 4], [10, 0]]) >>> kmeans = KMeans(n_clusters=2, random_state=0).fit(X) >>> kmeans.predict([[0, 0], [12, 3]]) array([1, 0], dtype=int32)
For further informations see: - https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html - https://ml.dask.org/modules/generated/dask_ml.cluster.KMeans.html - https://docs.rapids.ai/api/cuml/stable/api.html#k-means-clustering - https://docs.rapids.ai/api/cuml/stable/api.html#cuml.dask.cluster.KMeans
Constructor of the class KMeans.
- n_clusters
- random_state
- max_iter
- init
- n_init
- tol
- verbose
- copy_x
- algorithm
- oversampling_factor
- n_jobs
- init_max_iter
- max_samples_per_batch
- precompute_distances
- output_type
- __kmeans_cpu
- __kmeans_mcpu
- _lazy_fit_cpu(X, y=None, sample_weight=None)[source]
Compute Dask k-means clustering.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it's not in CSR format.
- yIgnored
Not used, present here for API consistency by convention.
- sample_weightIgnored
Not used, present here for API consistency by convention.
Returns
- selfobject
Fitted estimator.
- _lazy_fit_gpu(X, y=None, sample_weight=None)[source]
Compute Dask CuML k-means clustering.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it's not in CSR format.
- yIgnored
Not used, present here for API consistency by convention.
- sample_weightarray-like of shape (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight.
Returns
- selfobject
Fitted estimator.
- _fit_cpu(X, y=None, sample_weight=None)[source]
Compute Scikit Learn k-means clustering.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it's not in CSR format.
- yIgnored
Not used, present here for API consistency by convention.
- sample_weightarray-like of shape (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight.
Returns
- selfobject
Fitted estimator.
- _fit_gpu(X, y=None, sample_weight=None)[source]
Compute CuML k-means clustering.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
Training instances to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous. If a sparse matrix is passed, a copy will be made if it's not in CSR format.
- yIgnored
Not used, present here for API consistency by convention.
- sample_weightarray-like of shape (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight.
Returns
- selfobject
Fitted estimator.
- _lazy_fit_predict_cpu(X, y=None, sample_weight=None)[source]
Compute cluster centers and predict cluster index for each sample using Dask ML.
Convenience method; equivalent to calling fit(X) followed by predict(X).
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
New data to transform.
- yIgnored
Not used, present here for API consistency by convention.
- sample_weightIgnored
Not used, present here for API consistency by convention.
Returns
- labelsndarray of shape (n_samples,)
Index of the cluster each sample belongs to.
- _lazy_fit_predict_gpu(X, y=None, sample_weight=None)[source]
Compute cluster centers and predict cluster index for each sample using Dask CuML.
Convenience method; equivalent to calling fit(X) followed by predict(X).
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
New data to transform.
- yIgnored
Not used, present here for API consistency by convention.
- sample_weightarray-like of shape (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight.
Returns
- labelsndarray of shape (n_samples,)
Index of the cluster each sample belongs to.
- _fit_predict_cpu(X, y=None, sample_weight=None)[source]
Compute cluster centers and predict cluster index for each sample using Scikit Learn.
Convenience method; equivalent to calling fit(X) followed by predict(X).
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
New data to transform.
- yIgnored
Not used, present here for API consistency by convention.
- sample_weightIgnored
Not used, present here for API consistency by convention.
Returns
- labelsndarray of shape (n_samples,)
Index of the cluster each sample belongs to.
- _fit_predict_gpu(X, y=None, sample_weight=None)[source]
Compute cluster centers and predict cluster index for each sample using CuML.
Convenience method; equivalent to calling fit(X) followed by predict(X).
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
New data to transform.
- yIgnored
Not used, present here for API consistency by convention.
- sample_weightarray-like of shape (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight.
Returns
- labelsndarray of shape (n_samples,)
Index of the cluster each sample belongs to.
- _lazy_predict_cpu(X, sample_weight=None)[source]
Predict the closest cluster each sample in X belongs to using Dask ML.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
New data to predict.
- sample_weightIgnored
Not used, present here for API consistency by convention.
Returns
- labelsndarray of shape (n_samples,)
Index of the cluster each sample belongs to.
- _lazy_predict_gpu(X, sample_weight=None)[source]
Predict the closest cluster each sample in X belongs to using Dask CuML.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
New data to predict.
- sample_weightarray-like of shape (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight.
Returns
- labelsndarray of shape (n_samples,)
Index of the cluster each sample belongs to.
- _predict_cpu(X, sample_weight=None)[source]
Predict the closest cluster each sample in X belongs to using Scikit Learn.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
New data to predict.
- sample_weightarray-like of shape (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight.
Returns
- labelsndarray of shape (n_samples,)
Index of the cluster each sample belongs to.
- _predict_gpu(X, sample_weight=None)[source]
Predict the closest cluster each sample in X belongs to using CuML.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
New data to predict.
- sample_weightarray-like of shape (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight.
Returns
- labelsndarray of shape (n_samples,)
Index of the cluster each sample belongs to.
- _lazy_predict2_cpu(X, sample_weight=None)[source]
A block predict using Scikit Learn variant but for Dask.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
New data to predict.
- sample_weightarray-like of shape (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight.
Returns
- labelsndarray of shape (n_samples,)
Index of the cluster each sample belongs to.
- _lazy_predict2_gpu(X, sample_weight=None)[source]
A block predict using CuML variant but for Dask.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
New data to predict.
- sample_weightarray-like of shape (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight.
Returns
- labelsndarray of shape (n_samples,)
Index of the cluster each sample belongs to.
- _predict2_cpu(X, sample_weight=None, compat=True)[source]
A block predict using Scikit Learn variant as a placeholder.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
New data to predict.
- sample_weightarray-like of shape (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight.
- compatbool
There is no version for single CPU/GPU for predict2. This compatibility parameter uses the original predict method. Otherwise, it raises an exception.
Returns
- labelsndarray of shape (n_samples,)
Index of the cluster each sample belongs to.
- _predict2_gpu(X, sample_weight=None, compat=True)[source]
A block predict using CuML variant as a placeholder.
In the vector quantization literature, cluster_centers_ is called the code book and each value returned by predict is the index of the closest code in the code book.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features)
New data to predict.
- sample_weightarray-like of shape (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight.
- compatbool
There is no version for single CPU/GPU for predict2. This compatibility parameter uses the original predict method. Otherwise, it raises an exception.
Returns
- labelsndarray of shape (n_samples,)
Index of the cluster each sample belongs to.
- class dasf.ml.cluster.SOM(x, y, input_len, num_epochs=100, sigma=0, sigmaN=1, learning_rate=0.5, learning_rateN=0.01, decay_function='exponential', neighborhood_function='gaussian', std_coeff=0.5, topology='rectangular', activation_distance='euclidean', random_seed=None, n_parallel=0, compact_support=False, **kwargs)[source]
Bases:
dasf.ml.cluster.classifier.ClusterClassifier
Initializes a Self Organizing Maps.
A rule of thumb to set the size of the grid for a dimensionality reduction task is that it should contain 5*sqrt(N) neurons where N is the number of samples in the dataset to analyze.
E.g. if your dataset has 150 samples, 5*sqrt(150) = 61.23 hence a map 8-by-8 should perform well.
Parameters
- xint
x dimension of the SOM.
- yint
y dimension of the SOM.
- input_lenint
Number of the elements of the vectors in input.
- sigmafloat, default=min(x,y)/2
Spread of the neighborhood function, needs to be adequate to the dimensions of the map.
- sigmaNfloat, default=0.01
Spread of the neighborhood function at last iteration.
- learning_ratefloat, default=0.5
initial learning rate.
- learning_rateNfloat, default=0.01
final learning rate
- decay_functionstring, default=’exponential’
Function that reduces learning_rate and sigma at each iteration. Possible values: ‘exponential’, ‘linear’, ‘aymptotic’
- neighborhood_functionstring, default=’gaussian’
Function that weights the neighborhood of a position in the map. Possible values: ‘gaussian’, ‘mexican_hat’, ‘bubble’, ‘triangle’
- topologystring, default=’rectangular’
Topology of the map. Possible values: ‘rectangular’, ‘hexagonal’
- activation_distancestring, default=’euclidean’
Distance used to activate the map. Possible values: ‘euclidean’, ‘cosine’, ‘manhattan’
- random_seedint, default=None
Random seed to use.
- n_paralleluint, default=#max_CUDA_threads or 500*#CPUcores
Number of samples to be processed at a time. Setting a too low value may drastically lower performance due to under-utilization, setting a too high value increases memory usage without granting any significant performance benefit.
- xpnumpy or cupy, default=cupy if can be imported else numpy
Use numpy (CPU) or cupy (GPU) for computations.
- std_coeff: float, default=0.5
Used to calculate gausssian exponent denominator: d = 2*std_coeff**2*sigma**2
- compact_support: bool, default=False
Cut the neighbor function to 0 beyond neighbor radius sigma
Examples
>>> from dasf.ml.cluster import SOM >>> import numpy as np >>> X = np.array([[1, 1], [2, 1], [1, 0], ... [4, 7], [3, 5], [3, 6]]) >>> som = SOM(x=3, y=2, input_len=2, ... num_epochs=100).fit(X) >>> som SOM(x=3, y=2, input_len=2, num_epochs=100)
Constructor of the class SOM.
- x
- y
- input_len
- num_epochs
- sigma
- sigmaN
- learning_rate
- learning_rateN
- decay_function
- neighborhood_function
- std_coeff
- topology
- activation_distance
- random_seed
- n_parallel
- compact_support
- __som_cpu
- __som_mcpu
- _lazy_fit_cpu(X, y=None, sample_weight=None)[source]
Fit SOM method using Dask with CPUs only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
- sample_weightarray-like of shape (n_samples,), default=None
This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.
Returns
- selfobject
Returns a fitted instance of self.
- _lazy_fit_gpu(X, y=None, sample_weight=None)[source]
Fit SOM method using Dask with GPUs only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
- sample_weightarray-like of shape (n_samples,), default=None
This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.
Returns
- selfobject
Returns a fitted instance of self.
- _fit_cpu(X, y=None, sample_weight=None)[source]
Fit SOM method using CPU only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
- sample_weightarray-like of shape (n_samples,), default=None
This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.
Returns
- selfobject
Returns a fitted instance of self.
- _fit_gpu(X, y=None, sample_weight=None)[source]
Fit SOM method using GPU only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
- sample_weightarray-like of shape (n_samples,), default=None
This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.
Returns
- selfobject
Returns a fitted instance of self.
- _lazy_fit_predict_cpu(X, y=None, sample_weight=None)[source]
Fit SOM and select the winner neurons for the input using Dask with CPUs only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
- y{array-like, sparse matrix} of shape (n_samples).
This is just a placeholder to keep the compatibility with other fit_predict methods. SOM does not use labels to verify the input.
- sample_weightarray-like of shape (n_samples,), default=None
This is just a placeholder to keep the compatibility with other fit_predict methods. This is not used by SOM.
Returns
- selfobject
Returns a fitted instance of self.
- _lazy_fit_predict_gpu(X, y=None, sample_weight=None)[source]
Fit SOM and select the winner neurons for the input using Dask with GPUs only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
- y{array-like, sparse matrix} of shape (n_samples).
This is just a placeholder to keep the compatibility with other fit_predict methods. SOM does not use labels to verify the input.
- sample_weightarray-like of shape (n_samples,), default=None
This is just a placeholder to keep the compatibility with other fit_predict methods. This is not used by SOM.
Returns
- selfobject
Returns a fitted instance of self.
- _fit_predict_cpu(X, y=None, sample_weight=None)[source]
Fit SOM and select the winner neurons for the input using CPU only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
- y{array-like, sparse matrix} of shape (n_samples).
This is just a placeholder to keep the compatibility with other fit_predict methods. SOM does not use labels to verify the input.
- sample_weightarray-like of shape (n_samples,), default=None
This is just a placeholder to keep the compatibility with other fit_predict methods. This is not used by SOM.
Returns
- selfobject
Returns a fitted instance of self.
- _fit_predict_gpu(X, y=None, sample_weight=None)[source]
Fit SOM and select the winner neurons for the input using GPU only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
- y{array-like, sparse matrix} of shape (n_samples).
This is just a placeholder to keep the compatibility with other fit_predict methods. SOM does not use labels to verify the input.
- sample_weightarray-like of shape (n_samples,), default=None
This is just a placeholder to keep the compatibility with other fit_predict methods. This is not used by SOM.
Returns
- selfobject
Returns a fitted instance of self.
- _lazy_predict_cpu(X, sample_weight=None)[source]
Predict the input using a fitted SOM using Dask with CPUs only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
- sample_weightarray-like of shape (n_samples,), default=None
This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.
Returns
- labelsndarray of shape (n_samples,)
Cluster labels. Noisy samples are given the label -1.
- _lazy_predict_gpu(X, sample_weight=None)[source]
Predict the input using a fitted SOM using Dask with GPUs only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
- sample_weightarray-like of shape (n_samples,), default=None
This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.
Returns
- labelsndarray of shape (n_samples,)
Cluster labels. Noisy samples are given the label -1.
- _predict_cpu(X, sample_weight=None)[source]
Predict the input using a fitted SOM using CPU only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
- sample_weightarray-like of shape (n_samples,), default=None
This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.
Returns
- labelsndarray of shape (n_samples,)
Cluster labels. Noisy samples are given the label -1.
- _predict_gpu(X, sample_weight=None)[source]
Predict the input using a fitted SOM using GPU only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
- sample_weightarray-like of shape (n_samples,), default=None
This is just a placeholder to keep the compatibility with other fit methods. This is not used by SOM.
Returns
- labelsndarray of shape (n_samples,)
Cluster labels. Noisy samples are given the label -1.
- _lazy_quantization_error_cpu(X)[source]
Returns the quantization error computed as the average distance between each input sample and its best matching unit using Dask with CPUs only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
Returns
- errorfloat
The quantization error of the trained SOM.
- _lazy_quantization_error_gpu(X)[source]
Returns the quantization error computed as the average distance between each input sample and its best matching unit using Dask with GPUs only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
Returns
- errorfloat
The quantization error of the trained SOM.
- _quantization_error_cpu(X)[source]
Returns the quantization error computed as the average distance between each input sample and its best matching unit using CPU only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
Returns
- errorfloat
The quantization error of the trained SOM.
- _quantization_error_gpu(X)[source]
Returns the quantization error computed as the average distance between each input sample and its best matching unit using GPU only.
Parameters
X : {array-like, sparse matrix} of shape (n_samples, n_features).
Returns
- errorfloat
The quantization error of the trained SOM.
- class dasf.ml.cluster.SpectralClustering(n_clusters=8, eigen_solver=None, random_state=None, n_init=10, gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol=0.0, assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=None, n_components=None, persist_embedding=False, kmeans_params=None, verbose=False, **kwargs)[source]
Bases:
dasf.ml.cluster.classifier.ClusterClassifier
Apply clustering to a projection of the normalized Laplacian.
In practice Spectral Clustering is very useful when the structure of the individual clusters is highly non-convex, or more generally when a measure of the center and spread of the cluster is not a suitable description of the complete cluster, such as when clusters are nested circles on the 2D plane.
If the affinity matrix is the adjacency matrix of a graph, this method can be used to find normalized graph cuts.
When calling
fit
, an affinity matrix is constructed using either a kernel function such the Gaussian (aka RBF) kernel with Euclidean distanced(X, X)
:np.exp(-gamma * d(X,X) ** 2)
or a k-nearest neighbors connectivity matrix.
Alternatively, a user-provided affinity matrix can be specified by setting
affinity='precomputed'
.Read more in the User Guide.
Parameters
- n_clustersint, default=8
The dimension of the projection subspace.
- eigen_solver{‘arpack’, ‘lobpcg’, ‘amg’}, default=None
The eigenvalue decomposition strategy to use. AMG requires pyamg to be installed. It can be faster on very large, sparse problems, but may also lead to instabilities. If None, then
'arpack'
is used.- n_componentsint, default=n_clusters
Number of eigenvectors to use for the spectral embedding
- random_stateint, RandomState instance, default=None
A pseudo random number generator used for the initialization of the lobpcg eigenvectors decomposition when
eigen_solver='amg'
and by the K-Means initialization. Use an int to make the randomness deterministic. See Glossary.- n_initint, default=10
Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia. Only used if
assign_labels='kmeans'
.- gammafloat, default=1.0
Kernel coefficient for rbf, poly, sigmoid, laplacian and chi2 kernels. Ignored for
affinity='nearest_neighbors'
.- affinitystr or callable, default=’rbf’
- How to construct the affinity matrix.
‘nearest_neighbors’: construct the affinity matrix by computing a graph of nearest neighbors.
‘rbf’: construct the affinity matrix using a radial basis function (RBF) kernel.
‘precomputed’: interpret
X
as a precomputed affinity matrix, where larger values indicate greater similarity between instances.‘precomputed_nearest_neighbors’: interpret
X
as a sparse graph of precomputed distances, and construct a binary affinity matrix from then_neighbors
nearest neighbors of each instance.one of the kernels supported by
pairwise_kernels()
.
Only kernels that produce similarity scores (non-negative values that increase with similarity) should be used. This property is not checked by the clustering algorithm.
- n_neighborsint, default=10
Number of neighbors to use when constructing the affinity matrix using the nearest neighbors method. Ignored for
affinity='rbf'
.- eigen_tolfloat, default=0.0
Stopping criterion for eigendecomposition of the Laplacian matrix when
eigen_solver='arpack'
.- assign_labels{‘kmeans’, ‘discretize’}, default=’kmeans’
The strategy for assigning labels in the embedding space. There are two ways to assign labels after the Laplacian embedding. k-means is a popular choice, but it can be sensitive to initialization. Discretization is another approach which is less sensitive to random initialization.
- degreefloat, default=3
Degree of the polynomial kernel. Ignored by other kernels.
- coef0float, default=1
Zero coefficient for polynomial and sigmoid kernels. Ignored by other kernels.
- kernel_paramsdict of str to any, default=None
Parameters (keyword arguments) and values for kernel passed as callable object. Ignored by other kernels.
- n_jobsint, default=None
The number of parallel jobs to run when affinity=’nearest_neighbors’ or affinity=’precomputed_nearest_neighbors’. The neighbors search will be done in parallel.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors. See Glossary for more details.- verbosebool, default=False
Verbosity mode.
Added in version 0.24.
- persist_embeddingbool
Whether to persist the intermediate n_samples x n_components array used for clustering.
- kmeans_paramsdictionary of string to any, optional
Keyword arguments for the KMeans clustering used for the final clustering.
Examples
>>> from dasf.ml.cluster import SpectralClustering >>> import numpy as np >>> X = np.array([[1, 1], [2, 1], [1, 0], ... [4, 7], [3, 5], [3, 6]]) >>> clustering = SpectralClustering(n_clusters=2, ... assign_labels='discretize', ... random_state=0).fit(X) >>> clustering SpectralClustering(assign_labels='discretize', n_clusters=2, random_state=0)
For further informations see: - https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering - https://ml.dask.org/modules/generated/dask_ml.cluster.SpectralClustering.html
Notes
A distance matrix for which 0 indicates identical elements and high values indicate very dissimilar elements can be transformed into an affinity / similarity matrix that is well-suited for the algorithm by applying the Gaussian (aka RBF, heat) kernel:
np.exp(- dist_matrix ** 2 / (2. * delta ** 2))
where
delta
is a free parameter representing the width of the Gaussian kernel.An alternative is to take a symmetric version of the k-nearest neighbors connectivity matrix of the points.
If the pyamg package is installed, it is used: this greatly speeds up computation.
References
Normalized cuts and image segmentation, 2000 Jianbo Shi, Jitendra Malik http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.160.2324
A Tutorial on Spectral Clustering, 2007 Ulrike von Luxburg http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.165.9323
Multiclass spectral clustering, 2003 Stella X. Yu, Jianbo Shi https://www1.icsi.berkeley.edu/~stellayu/publication/doc/2003kwayICCV.pdf
Constructor of the class SpectralClustering.
- n_clusters
- eigen_solver
- random_state
- n_init
- gamma
- affinity
- n_neighbors
- eigen_tol
- assign_labels
- degree
- coef0
- kernel_params
- n_jobs
- n_components
- persist_embedding
- kmeans_params
- verbose
- __sc_cpu
- __sc_mcpu
- _fit_cpu(X, y=None, sample_weight=None)[source]
Perform spectral clustering from features, or affinity matrix using CPU only.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)
Training instances to cluster, similarities / affinities between instances if
affinity='precomputed'
, or distances between instances ifaffinity='precomputed_nearest_neighbors
. If a sparse matrix is provided in a format other thancsr_matrix
,csc_matrix
, orcoo_matrix
, it will be converted into a sparsecsr_matrix
.- yIgnored
Not used, present here for API consistency by convention.
Returns
- selfobject
A fitted instance of the estimator.
- _lazy_fit_predict_cpu(X, y=None, sample_weight=None)[source]
Perform spectral clustering on X and return cluster labels using Dask with CPU only.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)
Training instances to cluster, similarities / affinities between instances if
affinity='precomputed'
, or distances between instances ifaffinity='precomputed_nearest_neighbors
. If a sparse matrix is provided in a format other thancsr_matrix
,csc_matrix
, orcoo_matrix
, it will be converted into a sparsecsr_matrix
.- yIgnored
Not used, present here for API consistency by convention.
Returns
- labelsndarray of shape (n_samples,)
Cluster labels.
- _fit_predict_cpu(X, y=None, sample_weight=None)[source]
Perform spectral clustering on X and return cluster labels using CPU only.
Parameters
- X{array-like, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_samples)
Training instances to cluster, similarities / affinities between instances if
affinity='precomputed'
, or distances between instances ifaffinity='precomputed_nearest_neighbors
. If a sparse matrix is provided in a format other thancsr_matrix
,csc_matrix
, orcoo_matrix
, it will be converted into a sparsecsr_matrix
.- yIgnored
Not used, present here for API consistency by convention.
Returns
- labelsndarray of shape (n_samples,)
Cluster labels.