{ "cells": [ { "cell_type": "markdown", "id": "104e064c", "metadata": {}, "source": [ "### Tutorial 8 - HDBSCAN Differences between CPU and GPU versions\n", "\n", "This tutorial is a basic example of some differences that we should expect when we run two models that were implemented for different device targets.\n", "\n", "We will use DASF to demonstrate how to compare two models using different samples of data.\n", "\n", "First, let's create a collection of 2-D samples (because it is easy to visualize using simple techniques). Feel free to test it using other dimensions as an excercise." ] }, { "cell_type": "code", "execution_count": 1, "id": "6575d458", "metadata": {}, "outputs": [], "source": [ "from dasf.datasets import make_blobs\n", "\n", "# There is no make_moons for GPU, so we need to use from sklearn\n", "from sklearn.datasets import make_blobs, make_moons\n", "\n", "\n", "class Params:\n", " \"\"\" Convert a dict into an object. \"\"\"\n", " def __init__(self, **kwargs):\n", " self.__dict__.update(kwargs)\n", "\n", "def generate_blobs_collection():\n", " blobs = []\n", "\n", " X, y, c = make_blobs(n_samples=1000,\n", " centers=3,\n", " n_features=2,\n", " return_centers=True,\n", " random_state=42)\n", "\n", " blobs.append({'X': X, 'y': y, 'centroids': c})\n", "\n", " X, y, c = make_blobs(n_samples=1000,\n", " return_centers=True,\n", " random_state=30)\n", "\n", " blobs.append({'X': X, 'y': y, 'centroids': c})\n", "\n", " X, y, c = make_blobs(n_samples=4000,\n", " centers=[(-0.75,2.25),\n", " (1.0, 2.0),\n", " (1.0, 1.0),\n", " (2.0, -0.5),\n", " (-1.0, -1.0),\n", " (0.0, 0.0)],\n", " cluster_std=0.5,\n", " return_centers=True,\n", " random_state=12)\n", "\n", " blobs.append({'X': X, 'y': y, 'centroids': c})\n", "\n", " X, y = make_moons(n_samples=3000,\n", " noise=0.1,\n", " random_state=42)\n", "\n", " blobs.append({'X': X, 'y': y, 'centroids': []})\n", "\n", " X, y, c = make_blobs(n_samples=2000,\n", " n_features=2,\n", " center_box=(100, 200),\n", " return_centers=True,\n", " cluster_std=15)\n", "\n", " blobs.append({'X': X, 'y': y, 'centroids': c})\n", "\n", " params = []\n", "\n", " for i in range(len(blobs)):\n", " for size in [5, 10, 50]:\n", " blobs[i]['min_cluster_size'] = size\n", " params.append(Params(**blobs[i]))\n", "\n", " blobs[i]['min_samples'] = 1\n", " params.append(Params(**blobs[i]))\n", " del blobs[i]['min_samples']\n", "\n", " return params" ] }, { "cell_type": "markdown", "id": "7e43e244", "metadata": {}, "source": [ "Even if we fix the seed, we can have different classes bewteen CPU and GPU. The function below is defined just to guarantee that the classes matches. It is a kind of brute force function to have a minimal map between classes.\n", "\n", "The heuristc just change one of the samples to negative and try to compare with the positive sample. If some negative number still remains in the converted sample, it means that there is no match between samples. So, the action is to revert to original value." ] }, { "cell_type": "code", "execution_count": 2, "id": "41cc8550", "metadata": {}, "outputs": [], "source": [ "def match_classes_heuristic(y1, y2):\n", " # Convert y2 to negative to avoid extra checks\n", " y2 = (y2 * -1) - 2\n", " \n", " # Register conveersions\n", " conversion_dict = {}\n", "\n", " for i in range(len(y1)):\n", " # Try to match positive numbers from y1\n", " # to negative numbers from y2.\n", " if y1[i] >= 0 and y2[i] < -1:\n", " conversion_dict[y2[i]] = y1[i]\n", " y2[y2 == y2[i]] = y1[i]\n", "\n", " # Convert non converted values\n", " for t2, t1 in conversion_dict.items():\n", " y2[y2 == t2] = t1\n", "\n", " for i in range(len(y2)):\n", " # Return to original value what does not match.\n", " if y2[i] < 0:\n", " y2[i] = (y2[i] + 2) * -1\n", "\n", " return y1, y2" ] }, { "cell_type": "markdown", "id": "45f9aa0f", "metadata": {}, "source": [ "Now, we can create a loop that iterates thru the samples and the parameters we introduced in function `generate_blobs_collection()` to test two versions of HDBSCAN.\n", "\n", "For this experiment, we don't need to create a pipeline because this is a simple one and HDBSCAN does not work with Dask. It should be a relatively simple experiment." ] }, { "cell_type": "code", "execution_count": 3, "id": "cbbfc34a", "metadata": {}, "outputs": [], "source": [ "from dasf.ml.cluster import HDBSCAN\n", "\n", "params = generate_blobs_collection()\n", "\n", "# If you would like to, here you can setup your figure or plot library.\n", "# Below, we will use a plotly code sample. Feel free to use and include\n", "# the required libraries with `pip3 install plotly kaleido`\n", "\n", "# (INSER CODE 1 HERE)\n", "\n", "for i in range(0, len(params)):\n", " if hasattr(params[i], 'min_samples'):\n", " cpu_model = HDBSCAN(min_cluster_size=params[i].min_cluster_size, min_samples=params[i].min_samples)\n", " else:\n", " cpu_model = HDBSCAN(min_cluster_size=params[i].min_cluster_size)\n", "\n", " cpu_pred = cpu_model._fit_predict_cpu(X=params[i].X)\n", "\n", " if hasattr(params[i], 'min_samples'): \n", " gpu_model = HDBSCAN(min_cluster_size=params[i].min_cluster_size, min_samples=params[i].min_samples)\n", " else:\n", " gpu_model = HDBSCAN(min_cluster_size=params[i].min_cluster_size)\n", "\n", " gpu_pred = gpu_model._fit_predict_gpu(X=params[i].X)\n", "\n", " cpu_pred, gpu_pred = match_classes_heuristic(cpu_pred, gpu_pred)\n", " \n", " # Here you can add some subplot content to show the samples and predictions.\n", " \n", " # (INSER CODE 2 HERE)" ] }, { "cell_type": "markdown", "id": "1addbf17", "metadata": {}, "source": [ "The results of the previous task can be seen in the image below. At least, for dense distributions it is clear to see some differences even when `min_samples` are default. It is even worst when `min_samples=1`.\n", "\n", "![alt text](imgs/tutorial_8_results.jpg \"HDBSCAN Results\")" ] }, { "cell_type": "markdown", "id": "cf134f56", "metadata": {}, "source": [ "#### Plot Data (Optional)\n", "\n", "If you want to use Plotly to plot the same results. You can include the following code into part 1 and 2. The part 1 can be like the code below.\n", "\n", "```python\n", "fig = make_subplots(rows=len(params), cols=4, subplot_titles=(\"Original\", \"CPU\", \"GPU\", \"Diff\"))\n", "\n", "fig.update_layout(height=5000, width=800, title_text=\"HDBSCAN Differences\")\n", "```\n", "\n", "And the second part that plots the subplots can be like the code below.\n", "\n", "```python\n", " if hasattr(params[i], 'min_samples'):\n", " yaxis_min_samples = str(params[i].min_samples)\n", " else:\n", " yaxis_min_samples = \"default\"\n", "\n", " fig.add_trace(\n", " go.Scatter(\n", " x=params[i].X[:, 0],\n", " y=params[i].X[:, 1],\n", " mode=\"markers\",\n", " marker=dict(color=params[i].y),\n", " showlegend=False,\n", " ),\n", " row=i + 1, col=1\n", " )\n", " \n", " fig.update_yaxes(title_text=f\"({params[i].min_cluster_size} - {yaxis_min_samples})\", row=i + 1, col=1)\n", "\n", " n_colors = np.max([np.max(cpu_pred), np.max(gpu_pred)]) + 1\n", " color_scale = px.colors.sample_colorscale(\"turbo\", [n/(n_colors) for n in range(n_colors)])\n", "\n", " cpu_pred_colors = ['rgb(192,192,192)' if x < 0 else color_scale[x] for x in cpu_pred]\n", "\n", " fig.add_trace(\n", " go.Scatter(\n", " x=params[i].X[:, 0],\n", " y=params[i].X[:, 1],\n", " mode=\"markers\",\n", " marker=dict(color=cpu_pred_colors),\n", " showlegend=False\n", " ),\n", " row=i + 1, col=2\n", " )\n", "\n", " gpu_pred_colors = ['rgb(192,192,192)' if x < 0 else color_scale[x] for x in gpu_pred]\n", "\n", " fig.add_trace(\n", " go.Scatter(\n", " x=params[i].X[:, 0],\n", " y=params[i].X[:, 1],\n", " mode=\"markers\",\n", " marker=dict(color=gpu_pred_colors),\n", " showlegend=False\n", " ),\n", " row=i + 1, col=3\n", " )\n", "\n", " diff_pred_colors = ['rgba(192,192,192,220)' if x == 0 else 'rgb(210,5,5)' for x in cpu_pred - gpu_pred]\n", "\n", " fig.add_trace(\n", " go.Scatter(\n", " x=params[i].X[:, 0],\n", " y=params[i].X[:, 1],\n", " mode=\"markers\",\n", " marker=dict(color=diff_pred_colors),\n", " showlegend=False\n", " ),\n", " row=i + 1, col=4\n", " ) \n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "43be2bac", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" }, "test_requirements": { "single_gpu": true } }, "nbformat": 4, "nbformat_minor": 5 }