minerva.data.datasets.series_dataset¶
Classes¶
An abstract class representing a |
|
An abstract class representing a |
Module Contents¶
- class minerva.data.datasets.series_dataset.MultiModalSeriesCSVDataset(data_path, feature_prefixes=None, label=None, features_as_channels=True, cast_to='float32', transforms=None, map_labels=None)[source]¶
Bases:
torch.utils.data.DatasetAn abstract class representing a
Dataset.All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite
__getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite__len__(), which is expected to return the size of the dataset by manySamplerimplementations and the default options ofDataLoader. Subclasses could also optionally implement__getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.Note
DataLoaderby default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.This datasets assumes that the data is in a single CSV file with series of data. Each row is a single sample that can be composed of multiple modalities (series). Each column is a feature of some series with the prefix indicating the series. The suffix may indicates the time step. For instance, if we have two series, accel-x and accel-y, the data will look something like:
accel-x-0
accel-x-1
accel-y-0
accel-y-1
class
0.502123 0.6820123 0.498217
0.02123 0.02123 0.00001
0.502123 0.502123 1.414141
0.502123 0.502123 3.141592
0 1 2
The
feature_prefixesparameter is used to select the columns that will be used as features. For instance, if we want to use only the accel-x series, we can setfeature_prefixes=["accel-x"]. If we want to use both accel-x and accel-y, we can setfeature_prefixes=["accel-x", "accel-y"]. If None is passed, all columns will be used as features, except the label column. The label column is specified by thelabelparameter.The dataset will return a 2-element tuple with the data and the label, if the
labelparameter is specified, otherwise return only the data.If
features_as_channelsisTrue, the data will be returned as a vector of shape (C, T), where C is the number of channels (features) and T is the number of time steps. Else, the data will be returned as a vector of shape T*C (a single vector with all the features).Parameters¶
- data_pathUnion[Path, str]
The location of the CSV file
- feature_prefixesUnion[str, List[str]], optional
The prefix of the column names in the dataframe that will be used to become features. If None, all columns except the label will be used as features.
- labelstr, optional
The name of the column that will be used as label
- features_as_channelsbool, optional
If True, the data will be returned as a vector of shape (C, T), else the data will be returned as a vector of shape T*C.
- cast_to: str, optional
Cast the numpy data to the specified type
- transforms: Optional[List[Callable]], optional
A list of transforms that will be applied to each sample individually. Each transform must be a callable that receives a numpy array and returns a numpy array. The transforms will be applied in the order they are specified.
- map_labels: Optional[Dict[int, int]], optional
A dictionary to map the labels to a different set of labels. The keys are the original labels and the values are the new labels.
Examples¶
# Using the data from the example above, and features_as_channels=False >>> data_path = “data.csv” >>> dataset = MultiModalSeriesCSVDataset(
data_path, feature_prefixes=[“accel-x”, “accel-y”], label=”class”
)
>>> data, label = dataset[0] >>> data.shape (4, )
# Using the data from the example above, and features_as_channels=True >>> dataset = MultiModalSeriesCSVDataset(
data_path, feature_prefixes=[“accel-x”, “accel-y”], label=”class”, features_as_channels=True
)
>>> data, label = dataset[0] >>> data.shape (2, 2)
# And the dataset length >>> len(dataset) 3
- __getitem__(index)[source]¶
- Parameters:
index (int)
- Return type:
Union[Tuple[numpy.ndarray, numpy.ndarray], numpy.ndarray]
- _load_data()[source]¶
Load data from the CSV file
Returns¶
- Tuple[np.ndarray, Optional[np.ndarray]]
A 2-element tuple with the data and the labels. The second element is None if the label is not specified.
- Return type:
Tuple[numpy.ndarray, Optional[numpy.ndarray]]
- cast_to = 'float32'¶
- data_path¶
- feature_prefixes = None¶
- features_as_channels = True¶
- label = None¶
- map_labels = None¶
- transforms = None¶
- Parameters:
data_path (Union[pathlib.Path, str])
feature_prefixes (Optional[Union[str, List[str]]])
label (Optional[str])
features_as_channels (bool)
cast_to (str)
transforms (Optional[Union[minerva.transforms.transform._Transform, List[minerva.transforms.transform._Transform]]])
map_labels (Optional[Dict[int, int]])
- class minerva.data.datasets.series_dataset.SeriesFolderCSVDataset(data_path, features=None, label=None, pad=False, cast_to='float32', transforms=None, lazy=False)[source]¶
Bases:
torch.utils.data.DatasetAn abstract class representing a
Dataset.All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite
__getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite__len__(), which is expected to return the size of the dataset by manySamplerimplementations and the default options ofDataLoader. Subclasses could also optionally implement__getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.Note
DataLoaderby default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.This dataset assumes that the data is in a folder with multiple CSV files. Each CSV file is a single sample that can be composed of multiple time steps (rows). Each column is a feature of the sample.
For instance, if we have two samples, sample-1.csv and sample-2.csv, the directory structure will look something like:
data_path ├── sample-1.csv └── sample-2.csv
And the data will look something like: - sample-1.csv:
accel-x
accel-y
class
0.502123 0.682012 0.498217
0.02123 0.02123 0.00001
1 1 1
- sample-2.csv:
accel-x
accel-y
class
0.502123 0.682012 0.498217 3.141592
0.02123 0.02123 0.00001 1.414141
0 0 0 0
The
featuresparameter is used to select the columns that will be used as features. For instance, if we want to use only the accel-x column, we can setfeatures=["accel-x"]. If we want to use both accel-x and accel-y, we can setfeatures=["accel-x", "accel-y"].The label column is specified by the
labelparameter. Note that we have one label per time-step and not a single label per sample.The dataset will return a 2-element tuple with the data and the label, if the
labelparameter is specified, otherwise return only the data.Notes¶
- Samples may have different number of time steps. Use
padto pad the data to the length of the longest sample.
- Samples may have different number of time steps. Use
Examples¶
# Using the data from the example above >>> data_dir = “train_folder” >>> dataset = SeriesFolderCSVDataset(
data_dir, features=[“accel-x”, “accel-y”], label=”class”
)
>>> data, label = dataset[0] >>> data.shape (2, 3) >>> label.shape (3,) >>> data, label = dataset[1] >>> data.shape (2, 4) >>> label.shape (4,)
Parameters¶
- data_pathstr
The location of the directory with CSV files
- features: List[str]
A list with column names that will be used as features. If None, all columns except the label will be used as features.
- pad: bool, optional
If True, the data will be padded to the length of the longest sample. Note that padding will be applyied after the transforms, and also to the labels if specified.
- label: str, optional
Specify the name of the column with the label of the data
- cast_to: str, optional
Cast the numpy data to the specified type
- transforms: Optional[List[Callable]], optional
A list of transforms that will be applied to each sample individually. Each transform must be a callable that receives a numpy array and returns a numpy array. The transforms will be applied in the order they are specified.
- lazy: bool, optional
If True, the data will be loaded lazily (i.e. the CSV files will be read only when needed)
- __getitem__(idx)[source]¶
Get a single sample from the dataset
Parameters¶
- idxint
The index of the sample
Returns¶
- Union[Tuple[np.ndarray, np.ndarray], np.ndarray]
A 2-element tuple with the data and the label if the label is specified, otherwise only the data.
- Parameters:
idx (int)
- Return type:
Union[Tuple[numpy.ndarray, numpy.ndarray], numpy.ndarray]
- _cache¶
- _files¶
- _get_longest_sample_size()[source]¶
Return the size of the longest sample in the dataset
Returns¶
- int
The size of the longest sample in the dataset
- Return type:
int
- _longest_sample_size = 0¶
- _pad_data(data)[source]¶
Pad the data to the length of the longest sample. In summary, this function makes the data cyclic.
Parameters¶
- datanp.ndarray
The data to pad
Returns¶
- np.ndarray
The padded data
- Parameters:
data (numpy.ndarray)
- Return type:
numpy.ndarray
- _read_all_csv()[source]¶
Read all the CSV files in the data directory
Returns¶
- Union[Tuple[np.ndarray, np.ndarray], np.ndarray]
A list of 2-element tuple with the data and the label. If the label is not specified, the second element of the tuples are None.
- Return type:
List[Tuple[numpy.ndarray, Optional[numpy.ndarray]]]
- _read_csv(path)[source]¶
Read a single CSV file (a single sample)
Parameters¶
- pathPath
The path to the CSV file
Returns¶
- Tuple[np.ndarray, Optional[np.ndarray]]
A 2-element tuple with the data and the label. If the label is not specified, the second element is None.
- Parameters:
path (pathlib.Path)
- Return type:
Tuple[numpy.ndarray, Optional[numpy.ndarray]]
- _scan_data()[source]¶
List the CSV files in the data directory
Returns¶
- List[Path]
List of CSV files
- Return type:
List[pathlib.Path]
- cast_to = 'float32'¶
- data_path¶
- features = None¶
- label = None¶
- pad = False¶
- transforms = None¶
- Parameters:
data_path (Union[pathlib.Path, str])
features (Optional[Union[str, List[str]]])
label (Optional[str])
pad (bool)
cast_to (str)
transforms (Optional[Union[minerva.transforms.transform._Transform, List[minerva.transforms.transform._Transform]]])
lazy (bool)