minerva.data.data_modules.har_xu_23
===================================

.. py:module:: minerva.data.data_modules.har_xu_23


Classes
-------

.. autoapisummary::

   minerva.data.data_modules.har_xu_23.HarDataModule


Module Contents
---------------

.. py:class:: HarDataModule(processed_data_dir, batch_size = 16, mc_sample_size = 5, epsilon = 3, adf = True, window_size = 128, use_train_as_val = False, num_workers = 8, use_val_with_train = False)

   Bases: :py:obj:`lightning.LightningDataModule`


   A DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is
   consistent data splits, data preparation and transforms across models.

   Example::

       import lightning as L
       import torch.utils.data as data
       from lightning.pytorch.demos.boring_classes import RandomDataset

       class MyDataModule(L.LightningDataModule):
           def prepare_data(self):
               # download, IO, etc. Useful with shared filesystems
               # only called on 1 GPU/TPU in distributed
               ...

           def setup(self, stage):
               # make assignments here (val/train/test split)
               # called on every process in DDP
               dataset = RandomDataset(1, 100)
               self.train, self.val, self.test = data.random_split(
                   dataset, [80, 10, 10], generator=torch.Generator().manual_seed(42)
               )

           def train_dataloader(self):
               return data.DataLoader(self.train)

           def val_dataloader(self):
               return data.DataLoader(self.val)

           def test_dataloader(self):
               return data.DataLoader(self.test)

           def on_exception(self, exception):
               # clean up state after the trainer faced an exception
               ...

           def teardown(self):
               # clean up state after the trainer stops, delete files...
               # called on every process in DDP
               ...


   This DataModule handles the loading and preparation of data for
   training, validation, and testing. The data is expected to be stored
   in 3 numpy (.npy) files named `train_data.npy`, `val_data.npy`, and
   `test_data.npy`. They are NumPy arrays storing the concatenated
   accelerometer and gyroscope data.

   This numpy arrays (files) must have the following shape (n_samples,
   n_timesteps, n_channels) and are produced at specific window size by
   another data processing script available in
   https://github.com/maxxu05/rebar/blob/main/data/process/har_processdata.py

   The original files have exact shape of:
   - `train_data.npy`: `(41, 15038, 6)`
   - `val_data.npy`: `(9, 15038, 6)`
   - `test_data.npy`: `(9, 15038, 6)`

   The Python script performs a series of tasks to facilitate the
   preprocessing and organization of dataset, processing
   The raw accelerometer and gyroscope data for each participant are,
   filtering out sequences shorter than a set threshold.
   The data is then split into training, validation, and test sets, which
   are saved as NumPy arrays along with corresponding participant names.

   For the dataloader, the .npy files are transposed into the shape
   (n_samples, n_channels, n_timesteps) and passed to the TNCDataset

   Parameters
   ----------
   processed_data_dir: PathLike
       Path to the directory where the processed .npy files are stored.
       Inside this path must have 3 files, named train_data.npy,
       val_data.npy, and test_data.npy.
   batch_size : int, optional
       The batch size to use for the DataLoader. Defaults to 16.
   mc_sample_size : int, optional
       This value determines how many neighboring and non-neighboring
       windows are used per data sample. Defaults to 5.
   epsilon : int, optional
       This parameter controls the "spread" of neighboring windows.
   adf : bool, optional
       Flag indicating whether to use ADF (Augmented Dickey-Fuller)
       testing for finding neighbors. Defaults to True.
   window_size : int, optional
       The size of the windows to be used for each sample in the TNC
       dataset. Defaults to 128.
   use_val_with_train : bool, optional
       If True, the validation and train sets will be concatenated in
       order to create a large train set. By default, this is True.


   .. py:attribute:: adf
      :value: True


   .. py:attribute:: batch_size
      :value: 16


   .. py:attribute:: epsilon
      :value: 3


   .. py:attribute:: har_test


   .. py:attribute:: har_train


   .. py:attribute:: har_val


   .. py:attribute:: mc_sample_size
      :value: 5


   .. py:attribute:: num_workers
      :value: 8


   .. py:attribute:: processed_data_dir


   .. py:method:: test_dataloader()

      Returns the DataLoader for the test dataset.

      Returns
      -------
      DataLoader
          DataLoader for the test dataset.


   .. py:method:: train_dataloader()

      Returns the DataLoader for the training dataset.

      Returns
      -------
      DataLoader
          DataLoader for the training dataset.


   .. py:attribute:: use_val_with_train
      :value: False


   .. py:method:: val_dataloader()

      Returns the DataLoader for the validation dataset.

      Returns
      -------
      DataLoader
          DataLoader for the validation dataset.


   .. py:attribute:: window_size
      :value: 128