minerva.data.datasets.har_xu_23

Classes

HarDataset

An abstract class representing a Dataset.

TNCDataset

An abstract class representing a Dataset.

Module Contents

class minerva.data.datasets.har_xu_23.HarDataset(data_path, annotate, feature_column_prefixes=['accel-x', 'accel-y', 'accel-z', 'gyro-x', 'gyro-y', 'gyro-z'], target_column='standard activity code', flatten=False)[source]

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

Dataset class for human activity recognition (HAR) data.

Loads and prepares data from .npy files and returns features and labels.

Parameters

data_pathPathLike

Path to the directory containing dataset files. The directory should contain the following files: - train_data_subseq.npy - train_labels_subseq.npy - val_data.npy - val_labels_subseq.npy - test_data.npy - test_labels_subseq.npy

These files should correspond to data segmented into subsequences and their labels.

annotatestr

Annotation type, indicating which subset of the data to load (‘train’, ‘val’, or ‘test’).

feature_column_prefixesList[str], optional

List of prefixes for feature columns. Defaults to: [“accel-x”, “accel-y”, “accel-z”, “gyro-x”, “gyro-y”, “gyro-z”].

target_columnstr, optional

Name of the column for the target variable. Defaults to ‘standard activity code’.

flattenbool, optional

If True, flattens the input data. Defaults to False.

Attributes

datanumpy.ndarray

Array of features with shape (num_samples, num_timesteps, num_features). - num_samples: Total number of samples in the dataset. - num_timesteps: Length of each subsequence (e.g., 128). - num_features: Number of features per timestep (e.g., 6 for accelerometer and gyroscope data).

labelsnumpy.ndarray

Array of labels with shape (num_samples,). - num_samples: Total number of samples in the dataset.

Methods

__len__() -> int

Returns the number of samples in the dataset.

__getitem__(idx: int) -> Tuple[torch.Tensor, int]

Retrieves a sample from the dataset. - Features shape: [num_timesteps, num_features] if flatten is False, otherwise [num_timesteps * num_features]. - Label shape: Scalar.

Examples

from minerva.data.datasets.har_xu_23 import HarDataset >>> dataset = HarDataset(data_path=”/path/to/data”, annotate=”train”) >>> len(dataset) 3178 >>> sample = dataset[0] >>> features, label = sample >>> features.shape torch.Size([128, 6]) >>> label tensor(4)

__getitem__(idx)[source]

Get a sample from the dataset.

Parameters

idxint

Index of the sample to retrieve.

Returns

Tuple[torch.Tensor, int]

Tuple containing the features and the target label.

Parameters:

idx (int)

Return type:

Tuple[torch.Tensor, int]

__len__()[source]
annotate
data
data_path
feature_column_prefixes = ['accel-x', 'accel-y', 'accel-z', 'gyro-x', 'gyro-y', 'gyro-z']
flatten = False
labels
target_column = 'standard activity code'
Parameters:
  • data_path (minerva.utils.typing.PathLike)

  • annotate (str)

  • feature_column_prefixes (List[str])

  • target_column (str)

  • flatten (bool)

class minerva.data.datasets.har_xu_23.TNCDataset(x, mc_sample_size=5, window_size=128, epsilon=3, adf=True)[source]

Bases: torch.utils.data.Dataset

An abstract class representing a Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader. Subclasses could also optionally implement __getitems__(), for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

Note

DataLoader by default constructs an index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

This TNCDataset class is designed to handle time series data for the TNC (Temporal Neighborhood Coding) task. It includes methods to load data, find close neighbors using ADF testing or cosine similarity, and find distant non-neighbors. The dataset returns a tuple of the central window, close neighbors, and distant non-neighbors for each sample.

The time_series input should have the shape (n_samples, n_channels, n_timesteps).

The __getitem__ method returns: - central_window: (n_channels, window_size) - close_neighbors: (mc_sample_size, n_channels, window_size) - non_neighbors: (mc_sample_size, n_channels, window_size)

Parameters

xnp.ndarray

The time series data of shape (n_samples, n_channels, n_timesteps).

mc_sample_sizeint

This value determines how many neighboring and non-neighboring windows are used per data sample.

window_sizeint

The size of the window to be used for each sample.

epsilonint, optional

This parameter controls the “spread” of neighboring windows. Higher values lead to more diverse neighbors within a larger search radius around the center window.

adfbool, optional

A flag indicating whether to use ADF (Augmented Dickey-Fuller) testing for finding neighbors. Defaults to True.

Neighbor Selection

The selection of neighbors and non-neighbors is crucial for TNC. Here’s how it’s done:

  1. Finding Close Neighbors:
    • ADF (Augmented Dickey-Fuller) Testing:
      • The ADF test checks the stationarity of the time series

        segments.

      • For each time window of size w_t (ranging from window_size

        to 4 * window_size), the ADF test is applied to determine the p-value.

      • The average p-value across all channels is calculated.

      • The neighborhood size epsilon is determined based on the

        p-values. If all p-values are below the threshold (0.01), epsilon is set to the length of corr, otherwise, it is set to the first index where the p-value exceeds 0.01.

      • The delta is then set to 5 * epsilon * window_size.

      • Neighboring time steps are generated by adding a random value

        from a normal distribution scaled by epsilon * window_size to the current time step t.

      • These time steps are adjusted to ensure they are within valid

        bounds.

    • Cosine Similarity:
      • If ADF is not used, cosine similarity is employed to find

        close neighbors.

      • The target window (current segment) is flattened, and its

        cosine similarity with all other windows of the same size in the time series is calculated.

      • The top mc_sample_size windows with the highest cosine

        similarity are selected as neighbors.

      • The selected time steps are adjusted to ensure they are

        within valid bounds.

  2. Finding Distant Non-Neighbors:
    • The method _find_non_neighbors generates non-neighbors by

      selecting time steps far from the current time step t.

    • Depending on whether t is in the first or second half of the

      time series, the non-neighbor time steps are selected to be either before or after the delta range.

    • A fallback mechanism ensures at least one non-neighbor segment is

      returned, even if the primary selection fails.

Example Usage

```python # Example configuration from minerva.data.datasets.har_xu_23 import TNCDataset import numpy as np

data = np.random.randn(100, 6, 1000) # (samples, channels, timesteps)

# Instantiate the dataset tnc_dataset = TNCDataset(

x=data, mc_sample_size=mc_sample_size, window_size=window_size, epsilon=epsilon, adf=adf

)

# Retrieve a sample from the dataset central_window, close_neighbors, non_neighbors = tnc_dataset[0]

print(“Central Window Shape:”, central_window.shape) # (window_size,n_channels) print(“Close Neighbors Shape:”, close_neighbors.shape) # (mc_sample_size,window_size, n_channels, ) print(“Non-Neighbors Shape:”, non_neighbors.shape) # (mc_sample_size, n_channels, window_size) ```

T
__getitem__(ind)[source]

Returns a sample from the dataset.

Parameters

indint

The index of the sample to retrieve.

Returns

tuple

A tuple containing the central window, close neighbors, and distant non-neighbors.

__len__()[source]

Returns the number of samples in the dataset.

Returns

int

The number of samples in the dataset.

_find_neighours(x, t)[source]

Finds close neighbors for a given time step.

Parameters

xnp.ndarray

The time series data for a single sample.

tint

The current time step.

Returns

np.ndarray

An array of close neighbors.

_find_non_neighours(x, t)[source]

Finds distant non-neighbors for a given time step.

Parameters

xnp.ndarray

The time series data for a single sample.

tint

The current time step.

Returns

np.ndarray

An array of distant non-neighbors.

adf = True
mc_sample_size = 5
time_series
window_size = 128
Parameters:
  • x (numpy.array)

  • mc_sample_size (int)

  • window_size (int)

  • adf (bool)