minerva.data.readers.tabular_reader

Classes

TabularReader

Base class for readers. Readers define an ordered collection of data and

Module Contents

class minerva.data.readers.tabular_reader.TabularReader(df, columns_to_select, cast_to=None, data_shape=None)

Bases: minerva.data.readers.reader._Reader

Base class for readers. Readers define an ordered collection of data and provide methods to access it. This class primarily handles:

  1. Definition of data structure and storage.

  2. Reading data from the source.

The access is handled by the __getitem__ and __len__ methods, which should be implemented by a subclass. Readers usually returns a single item at a time, that can be a single image, a single label, etc.

Reader to select columns from a DataFrame and return them as a NumPy array. The DataFrame is indexed by the row number. Each row of the DataFrame is considered as a sample. Thus, the __getitem__ method will return the columns of the DataFrame at the specified index as a NumPy array.

Parameters

dfpd.DataFrame

The DataFrame to select the columns from. The DataFrame should have the columns that are specified in the columns_to_select parameter.

columns_to_selectUnion[str, list[str]]

A string or a list of strings used to select the columns from the DataFrame. The string can be a regular expression pattern or a column name. The columns that match the pattern will be selected.

cast_tostr, optional

Cast the selected columns to the specified data type. If None, the data type of the columns will not be changed. (default is None)

data_shapetuple[int, …], optional

The shape of the data to be returned. If None, the data will be returned as a 1D array. If provided, the data will be reshaped to the specified shape. (default is None)

__getitem__(index)

Return the columns of the DataFrame at the specified row index as a NumPy array. The columns are selected based on the self.columns_to_select.

Parameters

indexint

The row index to select the columns from the DataFrame.

Returns

np.ndarray

The selected columns from the row as a NumPy array.

Parameters:

index (int)

Return type:

numpy.ndarray

__len__()

Return the number of samples in the DataFrame. The number of samples is equal to the number of rows in the DataFrame.

Returns

int

The number of samples in the DataFrame.

Return type:

int

Parameters:
  • df (pandas.DataFrame)

  • columns_to_select (Union[str, list[str]])

  • cast_to (str)

  • data_shape (tuple[int, Ellipsis])