minerva.data.datasets.series_dataset
====================================

.. py:module:: minerva.data.datasets.series_dataset


Classes
-------

.. autoapisummary::

   minerva.data.datasets.series_dataset.MultiModalSeriesCSVDataset
   minerva.data.datasets.series_dataset.SeriesFolderCSVDataset


Module Contents
---------------

.. py:class:: MultiModalSeriesCSVDataset(data_path, feature_prefixes = None, label = None, features_as_channels = True, cast_to = 'float32', transforms = None, map_labels = None)

   Bases: :py:obj:`torch.utils.data.Dataset`


   An abstract class representing a :class:`Dataset`.

   All datasets that represent a map from keys to data samples should subclass
   it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
   data sample for a given key. Subclasses could also optionally overwrite
   :meth:`__len__`, which is expected to return the size of the dataset by many
   :class:`~torch.utils.data.Sampler` implementations and the default options
   of :class:`~torch.utils.data.DataLoader`. Subclasses could also
   optionally implement :meth:`__getitems__`, for speedup batched samples
   loading. This method accepts list of indices of samples of batch and returns
   list of samples.

   .. note::
     :class:`~torch.utils.data.DataLoader` by default constructs an index
     sampler that yields integral indices.  To make it work with a map-style
     dataset with non-integral indices/keys, a custom sampler must be provided.

   This datasets assumes that the data is in a single CSV file with
   series of data. Each row is a single sample that can be composed of
   multiple modalities (series). Each column is a feature of some series
   with the prefix indicating the series. The suffix may indicates the
   time step. For instance, if we have two series, accel-x and accel-y,
   the data will look something like:

   +-----------+-----------+-----------+-----------+--------+
   | accel-x-0 | accel-x-1 | accel-y-0 | accel-y-1 |  class |
   +-----------+-----------+-----------+-----------+--------+
   | 0.502123  | 0.02123   | 0.502123  | 0.502123  |  0     |
   | 0.6820123 | 0.02123   | 0.502123  | 0.502123  |  1     |
   | 0.498217  | 0.00001   | 1.414141  | 3.141592  |  2     |
   +-----------+-----------+-----------+-----------+--------+

   The ``feature_prefixes`` parameter is used to select the columns that
   will be used as features. For instance, if we want to use only the
   accel-x series, we can set ``feature_prefixes=["accel-x"]``. If we want
   to use both accel-x and accel-y, we can set
   ``feature_prefixes=["accel-x", "accel-y"]``. If None is passed, all
   columns will be used as features, except the label column.
   The label column is specified by the ``label`` parameter.

   The dataset will return a 2-element tuple with the data and the label,
   if the ``label`` parameter is specified, otherwise return only the data.

   If ``features_as_channels`` is ``True``, the data will be returned as a
   vector of shape `(C, T)`, where C is the number of channels (features)
   and `T` is the number of time steps. Else, the data will be returned as
   a vector of shape  T*C (a single vector with all the features).

   Parameters
   ----------
   data_path : Union[Path, str]
       The location of the CSV file
   feature_prefixes : Union[str, List[str]], optional
       The prefix of the column names in the dataframe that will be used
       to become features. If None, all columns except the label will be
       used as features.
   label : str, optional
       The name of the column that will be used as label
   features_as_channels : bool, optional
       If True, the data will be returned as a vector of shape (C, T),
       else the data will be returned as a vector of shape  T*C.
   cast_to: str, optional
       Cast the numpy data to the specified type
   transforms: Optional[List[Callable]], optional
       A list of transforms that will be applied to each sample
       individually. Each transform must be a callable that receives a
       numpy array and returns a numpy array. The transforms will be
       applied in the order they are specified.
   map_labels: Optional[Dict[int, int]], optional
       A dictionary to map the labels to a different set of labels. The
       keys are the original labels and the values are the new labels.

   Examples
   --------
   # Using the data from the example above, and features_as_channels=False
   >>> data_path = "data.csv"
   >>> dataset = MultiModalSeriesCSVDataset(
           data_path,
           feature_prefixes=["accel-x", "accel-y"],
           label="class"
       )
   >>> data, label = dataset[0]
   >>> data.shape
   (4, )

   # Using the data from the example above, and features_as_channels=True
   >>> dataset = MultiModalSeriesCSVDataset(
           data_path,
           feature_prefixes=["accel-x", "accel-y"],
           label="class",
           features_as_channels=True
       )
   >>> data, label = dataset[0]
   >>> data.shape
   (2, 2)

   # And the dataset length
   >>> len(dataset)
   3


   .. py:method:: __getitem__(index)


   .. py:method:: __len__()


   .. py:method:: __repr__()


   .. py:method:: __str__()


   .. py:method:: _load_data()

      Load data from the CSV file

      Returns
      -------
      Tuple[np.ndarray, Optional[np.ndarray]]
          A 2-element tuple with the data and the labels. The second element
          is None if the label is not specified.


   .. py:attribute:: cast_to
      :value: 'float32'


   .. py:attribute:: data_path


   .. py:attribute:: feature_prefixes
      :value: None


   .. py:attribute:: features_as_channels
      :value: True


   .. py:attribute:: label
      :value: None


   .. py:attribute:: map_labels
      :value: None


   .. py:attribute:: transforms
      :value: None


.. py:class:: SeriesFolderCSVDataset(data_path, features = None, label = None, pad = False, cast_to = 'float32', transforms = None, lazy = False)

   Bases: :py:obj:`torch.utils.data.Dataset`


   An abstract class representing a :class:`Dataset`.

   All datasets that represent a map from keys to data samples should subclass
   it. All subclasses should overwrite :meth:`__getitem__`, supporting fetching a
   data sample for a given key. Subclasses could also optionally overwrite
   :meth:`__len__`, which is expected to return the size of the dataset by many
   :class:`~torch.utils.data.Sampler` implementations and the default options
   of :class:`~torch.utils.data.DataLoader`. Subclasses could also
   optionally implement :meth:`__getitems__`, for speedup batched samples
   loading. This method accepts list of indices of samples of batch and returns
   list of samples.

   .. note::
     :class:`~torch.utils.data.DataLoader` by default constructs an index
     sampler that yields integral indices.  To make it work with a map-style
     dataset with non-integral indices/keys, a custom sampler must be provided.

   This dataset assumes that the data is in a folder with multiple CSV
   files. Each CSV file is a single sample that can be composed of
   multiple time steps (rows). Each column is a feature of the sample.

   For instance, if we have two samples, sample-1.csv and sample-2.csv,
   the directory structure will look something like:

   data_path
   ├── sample-1.csv
   └── sample-2.csv

   And the data will look something like:
   - sample-1.csv:
       +---------+---------+--------+
       | accel-x | accel-y | class  |
       +---------+---------+--------+
       | 0.502123| 0.02123 | 1      |
       | 0.682012| 0.02123 | 1      |
       | 0.498217| 0.00001 | 1      |
       +---------+---------+--------+
   - sample-2.csv:
       +---------+---------+--------+
       | accel-x | accel-y | class  |
       +---------+---------+--------+
       | 0.502123| 0.02123 | 0      |
       | 0.682012| 0.02123 | 0      |
       | 0.498217| 0.00001 | 0      |
       | 3.141592| 1.414141| 0      |
       +---------+---------+--------+

   The ``features`` parameter is used to select the columns that will be
   used as features. For instance, if we want to use only the accel-x
   column, we can set ``features=["accel-x"]``. If we want to use both
   accel-x and accel-y, we can set ``features=["accel-x", "accel-y"]``.

   The label column is specified by the ``label`` parameter. Note that we
   have one label per time-step and not a single label per sample.

   The dataset will return a 2-element tuple with the data and the label,
   if the ``label`` parameter is specified, otherwise return only the data.

   Notes
   -----
   - Samples may have different number of time steps. Use ``pad`` to pad
       the data to the length of the longest sample.

   Examples
   --------
   # Using the data from the example above
   >>> data_dir = "train_folder"
   >>> dataset = SeriesFolderCSVDataset(
           data_dir,
           features=["accel-x", "accel-y"],
           label="class"
       )
   >>> data, label = dataset[0]
   >>> data.shape
   (2, 3)
   >>> label.shape
   (3,)
   >>> data, label = dataset[1]
   >>> data.shape
   (2, 4)
   >>> label.shape
   (4,)

   Parameters
   ----------
   data_path : str
       The location of the directory with CSV files
   features: List[str]
       A list with column names that will be used as features. If None,
       all columns except the label will be used as features.
   pad: bool, optional
       If True, the data will be padded to the length of the longest
       sample. Note that padding will be applyied after the transforms,
       and also to the labels if specified.
   label: str, optional
       Specify the name of the column with the label of the data
   cast_to: str, optional
       Cast the numpy data to the specified type
   transforms: Optional[List[Callable]], optional
       A list of transforms that will be applied to each sample
       individually. Each transform must be a callable that receives a
       numpy array and returns a numpy array. The transforms will be
       applied in the order they are specified.
   lazy: bool, optional
       If True, the data will be loaded lazily (i.e. the CSV files will be
       read only when needed)


   .. py:method:: __getitem__(idx)

      Get a single sample from the dataset

      Parameters
      ----------
      idx : int
          The index of the sample

      Returns
      -------
      Union[Tuple[np.ndarray, np.ndarray], np.ndarray]
          A 2-element tuple with the data and the label if the label is
          specified, otherwise only the data.


   .. py:method:: __len__()


   .. py:method:: __repr__()


   .. py:method:: __str__()


   .. py:attribute:: _cache


   .. py:method:: _disable_fix_length()

      Decorator to disable fix_length when calling a function


   .. py:attribute:: _files


   .. py:method:: _get_longest_sample_size()

      Return the size of the longest sample in the dataset

      Returns
      -------
      int
          The size of the longest sample in the dataset


   .. py:attribute:: _longest_sample_size
      :value: 0


   .. py:method:: _pad_data(data)

      Pad the data to the length of the longest sample. In summary, this
      function makes the data cyclic.

      Parameters
      ----------
      data : np.ndarray
          The data to pad

      Returns
      -------
      np.ndarray
          The padded data


   .. py:method:: _read_all_csv()

      Read all the CSV files in the data directory

      Returns
      -------
      Union[Tuple[np.ndarray, np.ndarray], np.ndarray]
          A list of 2-element tuple with the data and the label. If the label is not specified, the second element of the tuples are None.


   .. py:method:: _read_csv(path)

      Read a single CSV file (a single sample)

      Parameters
      ----------
      path : Path
          The path to the CSV file

      Returns
      -------
      Tuple[np.ndarray, Optional[np.ndarray]]
          A 2-element tuple with the data and the label. If the label is not
          specified, the second element is None.


   .. py:method:: _scan_data()

      List the CSV files in the data directory

      Returns
      -------
      List[Path]
          List of CSV files


   .. py:attribute:: cast_to
      :value: 'float32'


   .. py:attribute:: data_path


   .. py:attribute:: features
      :value: None


   .. py:attribute:: label
      :value: None


   .. py:attribute:: pad
      :value: False


   .. py:attribute:: transforms
      :value: None