Tutorial 2 - How to extend DASF Datasets

In this tutorial, we will teach you how you can extend DASF datasets to be loaded dynamically to all architetcure.

For this specific scenario we will use DASF Array Dataset class to show you how you can create a dataset like this using a simple NPY file.

To start, the first step is create and save a simple NPY file to be loaded by the dataset.

[1]:
### Serialize a simple array
import numpy as np

data = np.random.random((20, 20, 20))

np.save("data.npy", data)

Once we have the file saved, we can create our own array dataset.

[2]:
from dasf.datasets import DatasetArray

dataset = DatasetArray(name="My Saved NPY", root="data.npy")

From this moment, our dataset is not loaded yet. To load the data from NPY file, we need to run the function load. This object has the same dynamic generator from the previous tutorial. Here we are using a ipykernel with a GPU, then we are expecting the dataset to lad a CuPy Array. Let’s see if this is true.

[3]:
dataset.load()

Once it is loaded, we can slice the dataset and see what is the type of each slice.

[4]:
type(dataset[:2, :2, :2])
[4]:
cupy._core.core.ndarray

What should I do if I’m using a GPU but I want to load a Numpy array?

All the datasets have a protected load wrapper for each platform. The code discovers which platform you are in and bind the method to its respective protected mathod.

In other words, if you are using load in a GPU environment as we are doing here, in fact you are executing the protected method called _load_gpu.

Then to load Numpy arrays, all you need to do is call directly _load_cpu.

[5]:
dataset._load_cpu()

type(dataset[:2, :2, :2])
[5]:
numpy.ndarray

If you need to handle a Dask array in a multi clustered environment, you can use the protected lazy methods called _lazy_*.

For datasets, the respective methods for load are _lazy_load_cpu and _lazy_load_gpu. Both returns a Dask Array but with different metadata.

Let’s see how it looks like.

[6]:
dataset._lazy_load_cpu()

type(dataset[:2, :2, :2])
[6]:
dask.array.core.Array

See how the internal array of this Dask dataset looks.

[7]:
type(dataset[:2, :2, :2]._meta)
[7]:
numpy.ndarray