minerva.data.data_modules.har¶
Classes¶
A DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is |
|
A DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is |
Functions¶
|
Parse the num_workers parameter. If None, use all cores. |
|
Parse the transforms parameter to a dictionary with the split name as |
Module Contents¶
- class minerva.data.data_modules.har.MultiModalHARSeriesDataModule(data_path, feature_prefixes=('accel-x', 'accel-y', 'accel-z', 'gyro-x', 'gyro-y', 'gyro-z'), label='standard activity code', features_as_channels=True, transforms=None, cast_to='float32', batch_size=1, num_workers=None, data_percentage=1.0, use_train_as_validation=False, use_val_with_train=False, map_labels=None, drop_last=True, n_domains_per_sample=None, samples_per_class=None, seed=None, predict_split='test', shuffle_train=True)[source]¶
Bases:
lightning.LightningDataModuleA DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is consistent data splits, data preparation and transforms across models.
Example:
import lightning as L import torch.utils.data as data from lightning.pytorch.demos.boring_classes import RandomDataset class MyDataModule(L.LightningDataModule): def prepare_data(self): # download, IO, etc. Useful with shared filesystems # only called on 1 GPU/TPU in distributed ... def setup(self, stage): # make assignments here (val/train/test split) # called on every process in DDP dataset = RandomDataset(1, 100) self.train, self.val, self.test = data.random_split( dataset, [80, 10, 10], generator=torch.Generator().manual_seed(42) ) def train_dataloader(self): return data.DataLoader(self.train) def val_dataloader(self): return data.DataLoader(self.val) def test_dataloader(self): return data.DataLoader(self.test) def on_exception(self, exception): # clean up state after the trainer faced an exception ... def teardown(self): # clean up state after the trainer stops, delete files... # called on every process in DDP ...
Define the dataloaders for train, validation and test splits for HAR datasets. This datasets assumes that the data is in a single CSV file with series of data. Each row is a single sample that can be composed of multiple modalities (series). Each column is a feature of some series with the prefix indicating the series. The suffix may indicates the time step. For instance, if we have two series, accel-x and accel-y, the data will look something like:
accel-x-0
accel-x-1
accel-y-0
accel-y-1
class
0.502123 0.6820123 0.498217
0.02123 0.02123 0.00001
0.502123 0.502123 1.414141
0.502123 0.502123 3.141592
0 1 2
The
feature_prefixesparameter is used to select the columns that will be used as features. For instance, if we want to use only the accel-x series, we can setfeature_prefixes=["accel-x"]. If we want to use both accel-x and accel-y, we can setfeature_prefixes=["accel-x", "accel-y"]. If None is passed, all columns will be used as features, except the label column. The label column is specified by thelabelparameter.The dataset will return a 2-element tuple with the data and the label, if the
labelparameter is specified, otherwise return only the data.If
features_as_channelsisTrue, the data will be returned as a vector of shape (C, T), where C is the number of channels (features) and T is the number of time steps. Else, the data will be returned as a vector of shape T*C (a single vector with all the features).Parameters¶
- data_pathPathLike
The path to the folder with “train.csv”, “validation.csv” and “test.csv” files inside it.
- feature_prefixesUnion[str, List[str]], optional
The prefix of the column names in the dataframe that will be used to become features. If None, all columns except the label will be used as features.
- labelstr, optional
The name of the column that will be used as label
- features_as_channelsbool, optional
If True, the data will be returned as a vector of shape (C, T), else the data will be returned as a vector of shape T*C.
- cast_to: str, optional
Cast the numpy data to the specified type
- transformsUnion[List[Callable], Dict[str, List[Callable]]], optional
This could be: - None: No transforms will be applied - List[Callable]: A list of transforms that will be applied to the
data. The same transforms will be applied to all splits.
- Dict[str, List[Callable]]: A dictionary with the split name as
key and a list of transforms as value. The split name must be one of: “train”, “validation”, “test” or “predict”.
- batch_sizeint, optional
The size of the batch
- num_workersint, optional
Number of workers to load data. If None, then use all cores
- data_percentagefloat, optional
The percentage of the data that will be used. This is useful to create a small datasets.
- use_train_as_validationbool, optional
If True, the train dataset will be used as validation dataset.
- use_val_with_train: bool, optional
If True, the validation and train sets will be concatenated in order to create a large train set. By default, this is False.
- map_labelsDict[int, int], optional
A dictionary to map the labels to a new label. The key is the original label and the value is the new label.
- drop_lastbool, optional
Drop the last batch if it is not complete.
- n_domains_per_sampleint, optional
This is inly useful when using multiple domains (data_path). It will allow creating batches with same number of samples from multiple domains. If None, it will just use concatenate all datasets and sample in a non-stratified way. By default, None-
- samples_per_classint, optional
If not None, use this number of samples per class for the train split. This will override the data_percentage parameter.
- seed: Optional[int] = None
Seed for sampling the dataset. If None, no seed is set.
- predict_split: str
The name of the split to use for prediction. This will be used to load the dataset for prediction. By default, this is “test”.
- shuffle_train: str
If True, the train dataset will be shuffled.
Notes¶
- If data_percentage is set to a value less than 1.0, a random subset
of the dataset will be used, containing approximately the specified percentage of the total data. This sampling is not stratified.
- If samples_per_class is specified, the train split will contain an
equal number of samples for each class, as defined by this parameter. This option is mutually exclusive with data_percentage; both cannot be used at the same time.
- The seed parameter controls the randomness of sampling: If seed is
set (i.e., an integer), sampling becomes deterministic, ensuring the same subset is selected on each run. This improves reproducibility and supports cumulative sampling—for example, progressively increasing samples_per_class will retain consistency across runs by sampling the same initial elements. If seed is None, sampling is non-deterministic, and different subsets may be chosen each time.
Raises¶
- ValueError
If samples_per_class and data_percentage are both set.
- __str__()[source]¶
Return a string representation of the datasets that are set up.
- Returns:
A string representation of the datasets that are setup.
- _get_loader(split_name, shuffle)[source]¶
Get a dataloader for the given split.
Parameters¶
- split_namestr
The name of the split. This must be one of: “train”, “validation”, “test” or “predict”.
- shufflebool
Shuffle the data or not.
Returns¶
- DataLoader
A dataloader for the given split.
- Parameters:
split_name (str)
shuffle (bool)
- Return type:
torch.utils.data.DataLoader
- _load_dataset(split_name)[source]¶
Create a
MultiModalSeriesCSVDatasetdataset with the given split.Parameters¶
- split_namestr
The name of the split. This must be one of: “train”, “validation”, “test” or “predict”.
Returns¶
- MultiModalSeriesCSVDataset
A MultiModalSeriesCSVDataset dataset with the given split.
- Parameters:
split_name (str)
- Return type:
Tuple[Union[minerva.data.datasets.series_dataset.MultiModalSeriesCSVDataset, torch.utils.data.ConcatDataset], List[int]]
- _sample_dataset(dataset)[source]¶
Sample the dataset based on the specified parameters.
If samples_per_class is specified, a subset will be created containing the specified number of samples for each class. If data_percentage is specified, a random subset of the dataset will be created containing approximately the specified percentage of the total data. If neither is specified, the entire dataset will be returned.
Note¶
The seed parameter controls the randomness of sampling: If seed is set (i.e., an integer), sampling becomes deterministic, ensuring the same subset is selected on each run and allowing for cumulative sampling (e.g., progressively increasing samples_per_class will retain consistency across runs by sampling the same initial elements). If seed is None, sampling is non-deterministic, and different subsets may be chosen each time.
Parameters¶
- dataset: Dataset
A map-like dataset to sample from. This should be a-
Returns¶
- Dataset
A sampled dataset.
Raises¶
- ValueError
If samples_per_class is specified and a class has fewer samples than the specified number.
- batch_size = 1¶
- cast_to = 'float32'¶
- data_path¶
- data_percentage = 1.0¶
- datasets¶
- drop_last = True¶
- feature_prefixes = ('accel-x', 'accel-y', 'accel-z', 'gyro-x', 'gyro-y', 'gyro-z')¶
- features_as_channels = True¶
- label = 'standard activity code'¶
- map_labels = None¶
- n_domains_per_sample = None¶
- num_workers¶
- predict_dataloader()[source]¶
An iterable or collection of iterables specifying prediction samples.
For more information about multiple dataloaders, see this section.
It’s recommended that all data downloads and preparation happen in
prepare_data().predict()prepare_data()
- Note:
Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
- Return:
A
torch.utils.data.DataLoaderor a sequence of them specifying prediction samples.
- Return type:
torch.utils.data.DataLoader
- predict_split = 'test'¶
- rng¶
- samples_per_class = None¶
- seed = None¶
- setup(stage)[source]¶
Assign the datasets to the corresponding split.
self.datasetswill be a dictionary with the split name as key and the dataset as value.Parameters¶
- stagestr
The stage of the setup. This could be: - “fit”: Load the train and validation datasets - “test”: Load the test dataset - “predict”: Load the predict dataset
Raises¶
- ValueError
If the stage is not one of: “fit”, “test” or “predict”
- Parameters:
stage (str)
- shuffle_train = True¶
- test_dataloader()[source]¶
An iterable or collection of iterables specifying test samples.
For more information about multiple dataloaders, see this section.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
test()prepare_data()
- Note:
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- Note:
If you don’t need a test dataset and a
test_step(), you don’t need to implement this method.
- Return type:
torch.utils.data.DataLoader
- train_dataloader()[source]¶
An iterable or collection of iterables specifying training samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()prepare_data()
- Note:
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- Return type:
torch.utils.data.DataLoader
- transforms¶
- use_train_as_validation = False¶
- use_val_with_train = False¶
- val_dataloader()[source]¶
An iterable or collection of iterables specifying validation samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
It’s recommended that all data downloads and preparation happen in
prepare_data().fit()validate()prepare_data()
- Note:
Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
- Note:
If you don’t need a validation dataset and a
validation_step(), you don’t need to implement this method.
- Return type:
torch.utils.data.DataLoader
- Parameters:
data_path (minerva.utils.typing.PathLike | List[minerva.utils.typing.PathLike])
feature_prefixes (List[str])
label (str)
features_as_channels (bool)
transforms (Optional[Union[List[Callable], Dict[str, List[Callable]]]])
cast_to (str)
batch_size (int)
num_workers (Optional[int])
data_percentage (float)
use_train_as_validation (bool)
use_val_with_train (bool)
map_labels (Optional[Dict[int, int]])
drop_last (bool)
n_domains_per_sample (Optional[int])
samples_per_class (Optional[int])
seed (Optional[int])
predict_split (str)
shuffle_train (bool)
- class minerva.data.data_modules.har.UserActivityFolderDataModule(data_path, features=('accel-x', 'accel-y', 'accel-z', 'gyro-x', 'gyro-y', 'gyro-z'), label='standard activity code', pad=False, transforms=None, cast_to='float32', batch_size=1, num_workers=None)[source]¶
Bases:
lightning.LightningDataModuleA DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is consistent data splits, data preparation and transforms across models.
Example:
import lightning as L import torch.utils.data as data from lightning.pytorch.demos.boring_classes import RandomDataset class MyDataModule(L.LightningDataModule): def prepare_data(self): # download, IO, etc. Useful with shared filesystems # only called on 1 GPU/TPU in distributed ... def setup(self, stage): # make assignments here (val/train/test split) # called on every process in DDP dataset = RandomDataset(1, 100) self.train, self.val, self.test = data.random_split( dataset, [80, 10, 10], generator=torch.Generator().manual_seed(42) ) def train_dataloader(self): return data.DataLoader(self.train) def val_dataloader(self): return data.DataLoader(self.val) def test_dataloader(self): return data.DataLoader(self.test) def on_exception(self, exception): # clean up state after the trainer faced an exception ... def teardown(self): # clean up state after the trainer stops, delete files... # called on every process in DDP ...
Define the dataloaders for train, validation and test splits for HAR datasets. The data must be in the following folder structure: It is a wrapper around
SeriesFolderCSVDatasetdataset class. TheSeriesFolderCSVDatasetclass assumes that the data is in a folder with multiple CSV files. Each CSV file is a single sample that can be composed of multiple time steps (rows). Each column is a feature of the sample.For instance, if we have two samples, user-1.csv and user-2.csv, the directory structure will look something like:
data_path ├── user-1.csv └── user-2.csv
And the data will look something like: - user-1.csv:
accel-x
accel-y
class
0.502123 0.682012 0.498217
0.02123 0.02123 0.00001
1 1 1
- user-2.csv:
accel-x
accel-y
class
0.502123 0.682012 0.498217 3.141592
0.02123 0.02123 0.00001 1.414141
0 0 0 0
The
featuresparameter is used to select the columns that will be used as features. For instance, if we want to use only the accel-x column, we can setfeatures=["accel-x"]. If we want to use both accel-x and accel-y, we can setfeatures=["accel-x", "accel-y"].The label column is specified by the
labelparameter. Note that we have one label per time-step and not a single label per sample.The dataset will return a 2-element tuple with the data and the label, if the
labelparameter is specified, otherwise return only the data.Parameters¶
- data_pathPathLike
The location of the directory with CSV files.
- features: List[str]
A list with column names that will be used as features. If None, all columns except the label will be used as features.
- pad: bool, optional
If True, the data will be padded to the length of the longest sample. Note that padding will be applyied after the transforms, and also to the labels if specified.
- label: str, optional
Specify the name of the column with the label of the data
- transformsUnion[List[Callable], Dict[str, List[Callable]]], optional
This could be: - None: No transforms will be applied - List[Callable]: A list of transforms that will be applied to the
data. The same transforms will be applied to all splits.
- Dict[str, List[Callable]]: A dictionary with the split name as
key and a list of transforms as value. The split name must be one of: “train”, “validation”, “test” or “predict”.
- cast_to: str, optional
Cast the numpy data to the specified type
- batch_sizeint, optional
The size of the batch
- num_workersint, optional
Number of workers to load data. If None, then use all cores
- __str__()[source]¶
Return a string representation of the datasets that are set up.
- Returns:
A string representation of the datasets that are setup.
- _get_loader(split_name, shuffle)[source]¶
Get a dataloader for the given split.
Parameters¶
- split_namestr
The name of the split. This must be one of: “train”, “validation”, “test” or “predict”.
- shufflebool
Shuffle the data or not.
Returns¶
- DataLoader
A dataloader for the given split.
- Parameters:
split_name (str)
shuffle (bool)
- Return type:
torch.utils.data.DataLoader
- _load_dataset(split_name)[source]¶
Create a
SeriesFolderCSVDatasetdataset with the given split.Parameters¶
- split_namestr
Name of the split (train, validation or test). This will be used to load the corresponding CSV file.
Returns¶
- SeriesFolderCSVDataset
The dataset with the given split.
- Parameters:
split_name (str)
- Return type:
- batch_size = 1¶
- cast_to = 'float32'¶
- data_path¶
- datasets¶
- features = ('accel-x', 'accel-y', 'accel-z', 'gyro-x', 'gyro-y', 'gyro-z')¶
- label = 'standard activity code'¶
- num_workers¶
- pad = False¶
- predict_dataloader()[source]¶
An iterable or collection of iterables specifying prediction samples.
For more information about multiple dataloaders, see this section.
It’s recommended that all data downloads and preparation happen in
prepare_data().predict()prepare_data()
- Note:
Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
- Return:
A
torch.utils.data.DataLoaderor a sequence of them specifying prediction samples.
- Return type:
torch.utils.data.DataLoader
- setup(stage)[source]¶
Assign the datasets to the corresponding split.
self.datasetswill be a dictionary with the split name as key and the dataset as value.Parameters¶
- stagestr
The stage of the setup. This could be: - “fit”: Load the train and validation datasets - “test”: Load the test dataset - “predict”: Load the predict dataset
Raises¶
- ValueError
If the stage is not one of: “fit”, “test” or “predict”
- Parameters:
stage (str)
- test_dataloader()[source]¶
An iterable or collection of iterables specifying test samples.
For more information about multiple dataloaders, see this section.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
test()prepare_data()
- Note:
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- Note:
If you don’t need a test dataset and a
test_step(), you don’t need to implement this method.
- Return type:
torch.utils.data.DataLoader
- train_dataloader()[source]¶
An iterable or collection of iterables specifying training samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()prepare_data()
- Note:
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- Return type:
torch.utils.data.DataLoader
- transforms¶
- val_dataloader()[source]¶
An iterable or collection of iterables specifying validation samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~lightning.pytorch.trainer.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
It’s recommended that all data downloads and preparation happen in
prepare_data().fit()validate()prepare_data()
- Note:
Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
- Note:
If you don’t need a validation dataset and a
validation_step(), you don’t need to implement this method.
- Return type:
torch.utils.data.DataLoader
- Parameters:
data_path (minerva.utils.typing.PathLike)
features (List[str])
label (str)
pad (bool)
transforms (Optional[Union[List[Callable], Dict[str, List[Callable]]]])
cast_to (str)
batch_size (int)
num_workers (Optional[int])
- minerva.data.data_modules.har.parse_num_workers(num_workers=None)[source]¶
Parse the num_workers parameter. If None, use all cores.
Parameters¶
- num_workersint
Number of workers to load data. If None, then use all cores
Returns¶
- int
Number of workers to load data.
- Parameters:
num_workers (Optional[int])
- Return type:
int
- minerva.data.data_modules.har.parse_transforms(transforms)[source]¶
Parse the transforms parameter to a dictionary with the split name as key and a list of transforms as value.
Parameters¶
- transformsUnion[List[Callable], Dict[str, List[Callable]]]
This could be: - None: No transforms will be applied - List[Callable]: A list of transforms that will be applied to the
data. The same transforms will be applied to all splits.
- Dict[str, List[Callable]]: A dictionary with the split name as
key and a list of transforms as value. The split name must be one of: “train”, “validation”, “test” or “predict”.
Returns¶
- Dict[str, List[Callable]]
A dictionary with the split name as key and a list of transforms as value.
- Parameters:
transforms (Union[List[Callable], Dict[str, List[Callable]]])
- Return type:
Dict[str, List[Callable]]