minerva.data.datasets.har_xu_23
===============================

.. py:module:: minerva.data.datasets.har_xu_23


Classes
-------

.. autoapisummary::

   minerva.data.datasets.har_xu_23.HarDataset
   minerva.data.datasets.har_xu_23.TNCDataset


Module Contents
---------------

.. py:class:: HarDataset(data_path, annotate, feature_column_prefixes = ['accel-x', 'accel-y', 'accel-z', 'gyro-x', 'gyro-y', 'gyro-z'], target_column = 'standard activity code', flatten = False)

   Bases: :py:obj:`torch.utils.data.Dataset`


   An abstract class representing a :class:`Dataset`.

   All datasets that represent a map from keys to data samples should subclass
   it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
   data sample for a given key. Subclasses could also optionally overwrite
   :meth:`__len__`, which is expected to return the size of the dataset by many
   :class:`~torch.utils.data.Sampler` implementations and the default options
   of :class:`~torch.utils.data.DataLoader`. Subclasses could also
   optionally implement :meth:`__getitems__`, for speedup batched samples
   loading. This method accepts list of indices of samples of batch and returns
   list of samples.

   .. note::
     :class:`~torch.utils.data.DataLoader` by default constructs an index
     sampler that yields integral indices.  To make it work with a map-style
     dataset with non-integral indices/keys, a custom sampler must be provided.

   Dataset class for human activity recognition (HAR) data.

   Loads and prepares data from `.npy` files and returns features and labels.

   Parameters
   ----------
   data_path : PathLike
       Path to the directory containing dataset files. The directory should contain the following files:
       - train_data_subseq.npy
       - train_labels_subseq.npy
       - val_data.npy
       - val_labels_subseq.npy
       - test_data.npy
       - test_labels_subseq.npy

       These files should correspond to data segmented into subsequences and their labels.
   annotate : str
       Annotation type, indicating which subset of the data to load ('train', 'val', or 'test').
   feature_column_prefixes : List[str], optional
       List of prefixes for feature columns. Defaults to:
       ["accel-x", "accel-y", "accel-z", "gyro-x", "gyro-y", "gyro-z"].
   target_column : str, optional
       Name of the column for the target variable. Defaults to 'standard activity code'.
   flatten : bool, optional
       If True, flattens the input data. Defaults to False.

   Attributes
   ----------
   data : numpy.ndarray
       Array of features with shape (num_samples, num_timesteps, num_features).
       - num_samples: Total number of samples in the dataset.
       - num_timesteps: Length of each subsequence (e.g., 128).
       - num_features: Number of features per timestep (e.g., 6 for accelerometer and gyroscope data).
   labels : numpy.ndarray
       Array of labels with shape (num_samples,).
       - num_samples: Total number of samples in the dataset.

   Methods
   -------
   __len__() -> int
       Returns the number of samples in the dataset.
   __getitem__(idx: int) -> Tuple[torch.Tensor, int]
       Retrieves a sample from the dataset.
       - Features shape: [num_timesteps, num_features] if `flatten` is False, otherwise [num_timesteps * num_features].
       - Label shape: Scalar.

   Examples
   --------
   from minerva.data.datasets.har_xu_23 import HarDataset
   >>> dataset = HarDataset(data_path="/path/to/data", annotate="train")
   >>> len(dataset)
   3178
   >>> sample = dataset[0]
   >>> features, label = sample
   >>> features.shape
   torch.Size([128, 6])
   >>> label
   tensor(4)


   .. py:method:: __getitem__(idx)

      Get a sample from the dataset.

      Parameters
      ----------
      idx : int
          Index of the sample to retrieve.

      Returns
      -------
      Tuple[torch.Tensor, int]
          Tuple containing the features and the target label.


   .. py:method:: __len__()


   .. py:attribute:: annotate


   .. py:attribute:: data


   .. py:attribute:: data_path


   .. py:attribute:: feature_column_prefixes
      :value: ['accel-x', 'accel-y', 'accel-z', 'gyro-x', 'gyro-y', 'gyro-z']


   .. py:attribute:: flatten
      :value: False


   .. py:attribute:: labels


   .. py:attribute:: target_column
      :value: 'standard activity code'


.. py:class:: TNCDataset(x, mc_sample_size = 5, window_size = 128, epsilon=3, adf = True)

   Bases: :py:obj:`torch.utils.data.Dataset`


   An abstract class representing a :class:`Dataset`.

   All datasets that represent a map from keys to data samples should subclass
   it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
   data sample for a given key. Subclasses could also optionally overwrite
   :meth:`__len__`, which is expected to return the size of the dataset by many
   :class:`~torch.utils.data.Sampler` implementations and the default options
   of :class:`~torch.utils.data.DataLoader`. Subclasses could also
   optionally implement :meth:`__getitems__`, for speedup batched samples
   loading. This method accepts list of indices of samples of batch and returns
   list of samples.

   .. note::
     :class:`~torch.utils.data.DataLoader` by default constructs an index
     sampler that yields integral indices.  To make it work with a map-style
     dataset with non-integral indices/keys, a custom sampler must be provided.

   This TNCDataset class is designed to handle time series data for the TNC
   (Temporal Neighborhood Coding) task. It includes methods to load data,
   find close neighbors using ADF testing or cosine similarity, and find
   distant non-neighbors. The dataset returns a tuple of the central
   window, close neighbors, and distant non-neighbors for each sample.

   The `time_series` input should have the shape
   (n_samples, n_channels, n_timesteps).

   The `__getitem__` method returns:
   - `central_window`: (n_channels, window_size)
   - `close_neighbors`: (mc_sample_size, n_channels, window_size)
   - `non_neighbors`: (mc_sample_size, n_channels, window_size)

   Parameters
   ----------
   x : np.ndarray
       The time series data of shape (n_samples, n_channels, n_timesteps).
   mc_sample_size : int
       This value determines how many neighboring and non-neighboring
       windows are used per data sample.
   window_size : int
       The size of the window to be used for each sample.
   epsilon : int, optional
       This parameter controls the "spread" of neighboring windows.
       Higher values lead to more diverse neighbors within a larger search
       radius around the center window.
   adf : bool, optional
       A flag indicating whether to use ADF (Augmented Dickey-Fuller)
       testing for finding neighbors. Defaults to True.

   Neighbor Selection
   ------------------
   The selection of neighbors and non-neighbors is crucial for TNC. Here's
   how it's done:

   1. **Finding Close Neighbors**:
       - **ADF (Augmented Dickey-Fuller) Testing**:
           - The ADF test checks the stationarity of the time series
               segments.
           - For each time window of size `w_t` (ranging from `window_size`
               to `4 * window_size`), the ADF test is applied to determine
               the p-value.
           - The average p-value across all channels is calculated.
           - The neighborhood size `epsilon` is determined based on the
               p-values. If all p-values are below the threshold (0.01),
               `epsilon` is set to the length of `corr`, otherwise, it is
               set to the first index where the p-value exceeds 0.01.
           - The `delta` is then set to `5 * epsilon * window_size`.
           - Neighboring time steps are generated by adding a random value
               from a normal distribution scaled by `epsilon * window_size`
               to the current time step `t`.
           - These time steps are adjusted to ensure they are within valid
               bounds.

       - **Cosine Similarity**:
           - If ADF is not used, cosine similarity is employed to find
               close neighbors.
           - The target window (current segment) is flattened, and its
               cosine similarity with all other windows of the same size
               in the time series is calculated.
           - The top `mc_sample_size` windows with the highest cosine
               similarity are selected as neighbors.
           - The selected time steps are adjusted to ensure they are
               within valid bounds.

   2. **Finding Distant Non-Neighbors**:
       - The method `_find_non_neighbors` generates non-neighbors by
           selecting time steps far from the current time step `t`.
       - Depending on whether `t` is in the first or second half of the
           time series, the non-neighbor time steps are selected to be
           either before or after the `delta` range.
       - A fallback mechanism ensures at least one non-neighbor segment is
           returned, even if the primary selection fails.

   Example Usage
   -------------
   ```python
   # Example configuration
   from minerva.data.datasets.har_xu_23 import TNCDataset
   import numpy as np

   data = np.random.randn(100, 6, 1000)  # (samples, channels, timesteps)

   # Instantiate the dataset
   tnc_dataset = TNCDataset(
       x=data,
       mc_sample_size=mc_sample_size,
       window_size=window_size,
       epsilon=epsilon,
       adf=adf
   )

   # Retrieve a sample from the dataset
   central_window, close_neighbors, non_neighbors = tnc_dataset[0]

   print("Central Window Shape:", central_window.shape)  # (window_size,n_channels)
   print("Close Neighbors Shape:", close_neighbors.shape)  # (mc_sample_size,window_size, n_channels, )
   print("Non-Neighbors Shape:", non_neighbors.shape)  # (mc_sample_size, n_channels, window_size)
   ```


   .. py:attribute:: T


   .. py:method:: __getitem__(ind)

      Returns a sample from the dataset.

      Parameters
      ----------
      ind : int
          The index of the sample to retrieve.

      Returns
      -------
      tuple
          A tuple containing the central window, close neighbors, and distant non-neighbors.


   .. py:method:: __len__()

      Returns the number of samples in the dataset.

      Returns
      -------
      int
          The number of samples in the dataset.


   .. py:method:: _find_neighours(x, t)

      Finds close neighbors for a given time step.

      Parameters
      ----------
      x : np.ndarray
          The time series data for a single sample.
      t : int
          The current time step.

      Returns
      -------
      np.ndarray
          An array of close neighbors.


   .. py:method:: _find_non_neighours(x, t)

      Finds distant non-neighbors for a given time step.

      Parameters
      ----------
      x : np.ndarray
          The time series data for a single sample.
      t : int
          The current time step.

      Returns
      -------
      np.ndarray
          An array of distant non-neighbors.


   .. py:attribute:: adf
      :value: True


   .. py:attribute:: mc_sample_size
      :value: 5


   .. py:attribute:: time_series


   .. py:attribute:: window_size
      :value: 128