minerva.data.data_modules.har

Classes

MultiModalHARSeriesDataModule

A DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is

UserActivityFolderDataModule

A DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is

Functions

parse_num_workers([num_workers])

Parse the num_workers parameter. If None, use all cores.

parse_transforms(transforms)

Parse the transforms parameter to a dictionary with the split name as

Module Contents

class minerva.data.data_modules.har.MultiModalHARSeriesDataModule(data_path, feature_prefixes=('accel-x', 'accel-y', 'accel-z', 'gyro-x', 'gyro-y', 'gyro-z'), label='standard activity code', features_as_channels=True, transforms=None, cast_to='float32', batch_size=1, num_workers=None, data_percentage=1.0, use_train_as_validation=False, use_val_with_train=False, map_labels=None, drop_last=True, n_domains_per_sample=None, samples_per_class=None, seed=None, predict_split='test', shuffle_train=True)[source]

Bases: lightning.LightningDataModule

A DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is consistent data splits, data preparation and transforms across models.

Example:

import lightning as L
import torch.utils.data as data
from lightning.pytorch.demos.boring_classes import RandomDataset

class MyDataModule(L.LightningDataModule):
    def prepare_data(self):
        # download, IO, etc. Useful with shared filesystems
        # only called on 1 GPU/TPU in distributed
        ...

    def setup(self, stage):
        # make assignments here (val/train/test split)
        # called on every process in DDP
        dataset = RandomDataset(1, 100)
        self.train, self.val, self.test = data.random_split(
            dataset, [80, 10, 10], generator=torch.Generator().manual_seed(42)
        )

    def train_dataloader(self):
        return data.DataLoader(self.train)

    def val_dataloader(self):
        return data.DataLoader(self.val)

    def test_dataloader(self):
        return data.DataLoader(self.test)

    def on_exception(self, exception):
        # clean up state after the trainer faced an exception
        ...

    def teardown(self):
        # clean up state after the trainer stops, delete files...
        # called on every process in DDP
        ...

Define the dataloaders for train, validation and test splits for HAR datasets. This datasets assumes that the data is in a single CSV file with series of data. Each row is a single sample that can be composed of multiple modalities (series). Each column is a feature of some series with the prefix indicating the series. The suffix may indicates the time step. For instance, if we have two series, accel-x and accel-y, the data will look something like:

accel-x-0

accel-x-1

accel-y-0

accel-y-1

class

0.502123 0.6820123 0.498217

0.02123 0.02123 0.00001

0.502123 0.502123 1.414141

0.502123 0.502123 3.141592

0 1 2

The feature_prefixes parameter is used to select the columns that will be used as features. For instance, if we want to use only the accel-x series, we can set feature_prefixes=["accel-x"]. If we want to use both accel-x and accel-y, we can set feature_prefixes=["accel-x", "accel-y"]. If None is passed, all columns will be used as features, except the label column. The label column is specified by the label parameter.

The dataset will return a 2-element tuple with the data and the label, if the label parameter is specified, otherwise return only the data.

If features_as_channels is True, the data will be returned as a vector of shape (C, T), where C is the number of channels (features) and T is the number of time steps. Else, the data will be returned as a vector of shape T*C (a single vector with all the features).

Parameters

data_pathPathLike

The path to the folder with “train.csv”, “validation.csv” and “test.csv” files inside it.

feature_prefixesUnion[str, List[str]], optional

The prefix of the column names in the dataframe that will be used to become features. If None, all columns except the label will be used as features.

labelstr, optional

The name of the column that will be used as label

features_as_channelsbool, optional

If True, the data will be returned as a vector of shape (C, T), else the data will be returned as a vector of shape T*C.

cast_to: str, optional

Cast the numpy data to the specified type

transformsUnion[List[Callable], Dict[str, List[Callable]]], optional

This could be: - None: No transforms will be applied - List[Callable]: A list of transforms that will be applied to the

data. The same transforms will be applied to all splits.

  • Dict[str, List[Callable]]: A dictionary with the split name as

    key and a list of transforms as value. The split name must be one of: “train”, “validation”, “test” or “predict”.

batch_sizeint, optional

The size of the batch

num_workersint, optional

Number of workers to load data. If None, then use all cores

data_percentagefloat, optional

The percentage of the data that will be used. This is useful to create a small datasets.

use_train_as_validationbool, optional

If True, the train dataset will be used as validation dataset.

use_val_with_train: bool, optional

If True, the validation and train sets will be concatenated in order to create a large train set. By default, this is False.

map_labelsDict[int, int], optional

A dictionary to map the labels to a new label. The key is the original label and the value is the new label.

drop_lastbool, optional

Drop the last batch if it is not complete.

n_domains_per_sampleint, optional

This is inly useful when using multiple domains (data_path). It will allow creating batches with same number of samples from multiple domains. If None, it will just use concatenate all datasets and sample in a non-stratified way. By default, None-

samples_per_classint, optional

If not None, use this number of samples per class for the train split. This will override the data_percentage parameter.

seed: Optional[int] = None

Seed for sampling the dataset. If None, no seed is set.

predict_split: str

The name of the split to use for prediction. This will be used to load the dataset for prediction. By default, this is “test”.

shuffle_train: str

If True, the train dataset will be shuffled.

Notes

  • If data_percentage is set to a value less than 1.0, a random subset

    of the dataset will be used, containing approximately the specified percentage of the total data. This sampling is not stratified.

  • If samples_per_class is specified, the train split will contain an

    equal number of samples for each class, as defined by this parameter. This option is mutually exclusive with data_percentage; both cannot be used at the same time.

  • The seed parameter controls the randomness of sampling: If seed is

    set (i.e., an integer), sampling becomes deterministic, ensuring the same subset is selected on each run. This improves reproducibility and supports cumulative sampling—for example, progressively increasing samples_per_class will retain consistency across runs by sampling the same initial elements. If seed is None, sampling is non-deterministic, and different subsets may be chosen each time.

Raises

ValueError

If samples_per_class and data_percentage are both set.

__repr__()[source]
Return type:

str

__str__()[source]

Return a string representation of the datasets that are set up.

Returns:

A string representation of the datasets that are setup.

_get_loader(split_name, shuffle)[source]

Get a dataloader for the given split.

Parameters

split_namestr

The name of the split. This must be one of: “train”, “validation”, “test” or “predict”.

shufflebool

Shuffle the data or not.

Returns

DataLoader

A dataloader for the given split.

Parameters:
  • split_name (str)

  • shuffle (bool)

Return type:

torch.utils.data.DataLoader

_load_dataset(split_name)[source]

Create a MultiModalSeriesCSVDataset dataset with the given split.

Parameters

split_namestr

The name of the split. This must be one of: “train”, “validation”, “test” or “predict”.

Returns

MultiModalSeriesCSVDataset

A MultiModalSeriesCSVDataset dataset with the given split.

Parameters:

split_name (str)

Return type:

Tuple[Union[minerva.data.datasets.series_dataset.MultiModalSeriesCSVDataset, torch.utils.data.ConcatDataset], List[int]]

_sample_dataset(dataset)[source]

Sample the dataset based on the specified parameters.

If samples_per_class is specified, a subset will be created containing the specified number of samples for each class. If data_percentage is specified, a random subset of the dataset will be created containing approximately the specified percentage of the total data. If neither is specified, the entire dataset will be returned.

Note

The seed parameter controls the randomness of sampling: If seed is set (i.e., an integer), sampling becomes deterministic, ensuring the same subset is selected on each run and allowing for cumulative sampling (e.g., progressively increasing samples_per_class will retain consistency across runs by sampling the same initial elements). If seed is None, sampling is non-deterministic, and different subsets may be chosen each time.

Parameters

dataset: Dataset

A map-like dataset to sample from. This should be a-

Returns

Dataset

A sampled dataset.

Raises

ValueError

If samples_per_class is specified and a class has fewer samples than the specified number.

batch_size = 1
cast_to = 'float32'
data_path
data_percentage = 1.0
datasets
drop_last = True
feature_prefixes = ('accel-x', 'accel-y', 'accel-z', 'gyro-x', 'gyro-y', 'gyro-z')
features_as_channels = True
label = 'standard activity code'
map_labels = None
n_domains_per_sample = None
num_workers
predict_dataloader()[source]

An iterable or collection of iterables specifying prediction samples.

For more information about multiple dataloaders, see this section.

It’s recommended that all data downloads and preparation happen in prepare_data().

Note:

Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Return:

A torch.utils.data.DataLoader or a sequence of them specifying prediction samples.

Return type:

torch.utils.data.DataLoader

predict_split = 'test'
rng
samples_per_class = None
seed = None
setup(stage)[source]

Assign the datasets to the corresponding split. self.datasets will be a dictionary with the split name as key and the dataset as value.

Parameters

stagestr

The stage of the setup. This could be: - “fit”: Load the train and validation datasets - “test”: Load the test dataset - “predict”: Load the predict dataset

Raises

ValueError

If the stage is not one of: “fit”, “test” or “predict”

Parameters:

stage (str)

shuffle_train = True
test_dataloader()[source]

An iterable or collection of iterables specifying test samples.

For more information about multiple dataloaders, see this section.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note:

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Note:

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

Return type:

torch.utils.data.DataLoader

train_dataloader()[source]

An iterable or collection of iterables specifying training samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note:

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Return type:

torch.utils.data.DataLoader

transforms
use_train_as_validation = False
use_val_with_train = False
val_dataloader()[source]

An iterable or collection of iterables specifying validation samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

  • fit()

  • validate()

  • prepare_data()

  • setup()

Note:

Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Note:

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

Return type:

torch.utils.data.DataLoader

Parameters:
  • data_path (minerva.utils.typing.PathLike | List[minerva.utils.typing.PathLike])

  • feature_prefixes (List[str])

  • label (str)

  • features_as_channels (bool)

  • transforms (Optional[Union[List[Callable], Dict[str, List[Callable]]]])

  • cast_to (str)

  • batch_size (int)

  • num_workers (Optional[int])

  • data_percentage (float)

  • use_train_as_validation (bool)

  • use_val_with_train (bool)

  • map_labels (Optional[Dict[int, int]])

  • drop_last (bool)

  • n_domains_per_sample (Optional[int])

  • samples_per_class (Optional[int])

  • seed (Optional[int])

  • predict_split (str)

  • shuffle_train (bool)

class minerva.data.data_modules.har.UserActivityFolderDataModule(data_path, features=('accel-x', 'accel-y', 'accel-z', 'gyro-x', 'gyro-y', 'gyro-z'), label='standard activity code', pad=False, transforms=None, cast_to='float32', batch_size=1, num_workers=None)[source]

Bases: lightning.LightningDataModule

A DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is consistent data splits, data preparation and transforms across models.

Example:

import lightning as L
import torch.utils.data as data
from lightning.pytorch.demos.boring_classes import RandomDataset

class MyDataModule(L.LightningDataModule):
    def prepare_data(self):
        # download, IO, etc. Useful with shared filesystems
        # only called on 1 GPU/TPU in distributed
        ...

    def setup(self, stage):
        # make assignments here (val/train/test split)
        # called on every process in DDP
        dataset = RandomDataset(1, 100)
        self.train, self.val, self.test = data.random_split(
            dataset, [80, 10, 10], generator=torch.Generator().manual_seed(42)
        )

    def train_dataloader(self):
        return data.DataLoader(self.train)

    def val_dataloader(self):
        return data.DataLoader(self.val)

    def test_dataloader(self):
        return data.DataLoader(self.test)

    def on_exception(self, exception):
        # clean up state after the trainer faced an exception
        ...

    def teardown(self):
        # clean up state after the trainer stops, delete files...
        # called on every process in DDP
        ...

Define the dataloaders for train, validation and test splits for HAR datasets. The data must be in the following folder structure: It is a wrapper around SeriesFolderCSVDataset dataset class. The SeriesFolderCSVDataset class assumes that the data is in a folder with multiple CSV files. Each CSV file is a single sample that can be composed of multiple time steps (rows). Each column is a feature of the sample.

For instance, if we have two samples, user-1.csv and user-2.csv, the directory structure will look something like:

data_path ├── user-1.csv └── user-2.csv

And the data will look something like: - user-1.csv:

accel-x

accel-y

class

0.502123 0.682012 0.498217

0.02123 0.02123 0.00001

1 1 1

  • user-2.csv:

    accel-x

    accel-y

    class

    0.502123 0.682012 0.498217 3.141592

    0.02123 0.02123 0.00001 1.414141

    0 0 0 0

The features parameter is used to select the columns that will be used as features. For instance, if we want to use only the accel-x column, we can set features=["accel-x"]. If we want to use both accel-x and accel-y, we can set features=["accel-x", "accel-y"].

The label column is specified by the label parameter. Note that we have one label per time-step and not a single label per sample.

The dataset will return a 2-element tuple with the data and the label, if the label parameter is specified, otherwise return only the data.

Parameters

data_pathPathLike

The location of the directory with CSV files.

features: List[str]

A list with column names that will be used as features. If None, all columns except the label will be used as features.

pad: bool, optional

If True, the data will be padded to the length of the longest sample. Note that padding will be applyied after the transforms, and also to the labels if specified.

label: str, optional

Specify the name of the column with the label of the data

transformsUnion[List[Callable], Dict[str, List[Callable]]], optional

This could be: - None: No transforms will be applied - List[Callable]: A list of transforms that will be applied to the

data. The same transforms will be applied to all splits.

  • Dict[str, List[Callable]]: A dictionary with the split name as

    key and a list of transforms as value. The split name must be one of: “train”, “validation”, “test” or “predict”.

cast_to: str, optional

Cast the numpy data to the specified type

batch_sizeint, optional

The size of the batch

num_workersint, optional

Number of workers to load data. If None, then use all cores

__repr__()[source]
Return type:

str

__str__()[source]

Return a string representation of the datasets that are set up.

Returns:

A string representation of the datasets that are setup.

_get_loader(split_name, shuffle)[source]

Get a dataloader for the given split.

Parameters

split_namestr

The name of the split. This must be one of: “train”, “validation”, “test” or “predict”.

shufflebool

Shuffle the data or not.

Returns

DataLoader

A dataloader for the given split.

Parameters:
  • split_name (str)

  • shuffle (bool)

Return type:

torch.utils.data.DataLoader

_load_dataset(split_name)[source]

Create a SeriesFolderCSVDataset dataset with the given split.

Parameters

split_namestr

Name of the split (train, validation or test). This will be used to load the corresponding CSV file.

Returns

SeriesFolderCSVDataset

The dataset with the given split.

Parameters:

split_name (str)

Return type:

minerva.data.datasets.series_dataset.SeriesFolderCSVDataset

batch_size = 1
cast_to = 'float32'
data_path
datasets
features = ('accel-x', 'accel-y', 'accel-z', 'gyro-x', 'gyro-y', 'gyro-z')
label = 'standard activity code'
num_workers
pad = False
predict_dataloader()[source]

An iterable or collection of iterables specifying prediction samples.

For more information about multiple dataloaders, see this section.

It’s recommended that all data downloads and preparation happen in prepare_data().

Note:

Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Return:

A torch.utils.data.DataLoader or a sequence of them specifying prediction samples.

Return type:

torch.utils.data.DataLoader

setup(stage)[source]

Assign the datasets to the corresponding split. self.datasets will be a dictionary with the split name as key and the dataset as value.

Parameters

stagestr

The stage of the setup. This could be: - “fit”: Load the train and validation datasets - “test”: Load the test dataset - “predict”: Load the predict dataset

Raises

ValueError

If the stage is not one of: “fit”, “test” or “predict”

Parameters:

stage (str)

test_dataloader()[source]

An iterable or collection of iterables specifying test samples.

For more information about multiple dataloaders, see this section.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note:

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Note:

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

Return type:

torch.utils.data.DataLoader

train_dataloader()[source]

An iterable or collection of iterables specifying training samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

Note:

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Return type:

torch.utils.data.DataLoader

transforms
val_dataloader()[source]

An iterable or collection of iterables specifying validation samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

  • fit()

  • validate()

  • prepare_data()

  • setup()

Note:

Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Note:

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

Return type:

torch.utils.data.DataLoader

Parameters:
  • data_path (minerva.utils.typing.PathLike)

  • features (List[str])

  • label (str)

  • pad (bool)

  • transforms (Optional[Union[List[Callable], Dict[str, List[Callable]]]])

  • cast_to (str)

  • batch_size (int)

  • num_workers (Optional[int])

minerva.data.data_modules.har.parse_num_workers(num_workers=None)[source]

Parse the num_workers parameter. If None, use all cores.

Parameters

num_workersint

Number of workers to load data. If None, then use all cores

Returns

int

Number of workers to load data.

Parameters:

num_workers (Optional[int])

Return type:

int

minerva.data.data_modules.har.parse_transforms(transforms)[source]

Parse the transforms parameter to a dictionary with the split name as key and a list of transforms as value.

Parameters

transformsUnion[List[Callable], Dict[str, List[Callable]]]

This could be: - None: No transforms will be applied - List[Callable]: A list of transforms that will be applied to the

data. The same transforms will be applied to all splits.

  • Dict[str, List[Callable]]: A dictionary with the split name as

    key and a list of transforms as value. The split name must be one of: “train”, “validation”, “test” or “predict”.

Returns

Dict[str, List[Callable]]

A dictionary with the split name as key and a list of transforms as value.

Parameters:

transforms (Union[List[Callable], Dict[str, List[Callable]]])

Return type:

Dict[str, List[Callable]]