minerva.models.nets.classic_ml_pipeline

Classes

ClassicMLModel

A PyTorch Lightning module that wraps a classic ML model (e.g. a scikit-learn model)

SklearnPipeline

A sequence of data transformers with an optional final predictor.

Module Contents

class minerva.models.nets.classic_ml_pipeline.ClassicMLModel(head, backbone=None, use_only_train_data=False, test_metrics=None, sklearn_model_save_path=None, flatten=True, adapter=None, predict_proba=True)[source]

Bases: lightning.LightningModule

A PyTorch Lightning module that wraps a classic ML model (e.g. a scikit-learn model) and uses it as a head of a neural network. The backbone of the network is frozen and the head is trained on the features extracted by the backbone. More complex models, that do not follow this pipeline, should not inherit from this class.

Initialize the model with the backbone and head. The backbone is frozen and the head is trained on the features extracted by the backbone. The head should implement the BaseEstimator interface. The model can be trained using only the training data or using both training and validation data. The test metrics are used to evaluate the model during testing. It will be logged using lightning logger at the end of each epoch. Parameters ———- head : BaseEstimator

The head model. Usually, a scikit-learn model, like a classifier or regressor that implements the predict and fit methods.

backbonetorch.nn.Module

The backbone model. When trained only a classic ML model the backbone can be the Identity function imported from nn.Identity.

use_only_train_databool, optional

If True, the model will be trained using only the training data- If False, the model will be trained using both training and validation data, concatenated.

test_metricsDict[str, Metric], optional

The metrics to be used during testing, by default None

sklearn_model_save_path: str, optional

The path to save the sklearn model weights, by default None

flattenbool, optional

If True the input data will be flattened before passing through the model, by default True

adapterCallable[[torch.Tensor], torch.Tensor], optional

An adapter to be used from the backbone to the head, by default None.

predict_probabool, optional

If True, the head will use the predict_proba method in the head, otherwise it will use predict. By default True.

Parameters:
  • head (Union[sklearn.base.BaseEstimator, sklearn.pipeline.Pipeline])

  • backbone (Union[torch.nn.Module, minerva.models.loaders.LoadableModule])

  • use_only_train_data (bool)

  • test_metrics (Optional[Dict[str, torchmetrics.Metric]])

  • sklearn_model_save_path (Optional[str])

  • flatten (bool)

  • adapter (Optional[Callable[[torch.Tensor], torch.Tensor]])

  • predict_proba (bool)

adapter = None
backbone = None
configure_optimizers()[source]

Choose what optimizers and learning-rate schedulers to use in your optimization. Normally you’d need one. But in the case of GANs or similar you might have multiple. Optimization with multiple optimizers only works in the manual optimization mode.

Return:

Any of these 6 options.

  • Single optimizer.

  • List or Tuple of optimizers.

  • Two lists - The first list has multiple optimizers, and the second has multiple LR schedulers (or multiple lr_scheduler_config).

  • Dictionary, with an "optimizer" key, and (optionally) a "lr_scheduler" key whose value is a single LR scheduler or lr_scheduler_config.

  • None - Fit will run without any optimizer.

The lr_scheduler_config is a dictionary which contains the scheduler and its associated configuration. The default configuration is shown below.

lr_scheduler_config = {
    # REQUIRED: The scheduler instance
    "scheduler": lr_scheduler,
    # The unit of the scheduler's step size, could also be 'step'.
    # 'epoch' updates the scheduler on epoch end whereas 'step'
    # updates it after a optimizer update.
    "interval": "epoch",
    # How many epochs/steps should pass between calls to
    # `scheduler.step()`. 1 corresponds to updating the learning
    # rate after every epoch/step.
    "frequency": 1,
    # Metric to monitor for schedulers like `ReduceLROnPlateau`
    "monitor": "val_loss",
    # If set to `True`, will enforce that the value specified 'monitor'
    # is available when the scheduler is updated, thus stopping
    # training if not found. If set to `False`, it will only produce a warning
    "strict": True,
    # If using the `LearningRateMonitor` callback to monitor the
    # learning rate progress, this keyword can be used to specify
    # a custom logged name
    "name": None,
}

When there are schedulers in which the .step() method is conditioned on a value, such as the torch.optim.lr_scheduler.ReduceLROnPlateau scheduler, Lightning requires that the lr_scheduler_config contains the keyword "monitor" set to the metric name that the scheduler should be conditioned on.

Metrics can be made available to monitor by simply logging it using self.log('metric_to_track', metric_val) in your LightningModule.

Note:

Some things to know:

  • Lightning calls .backward() and .step() automatically in case of automatic optimization.

  • If a learning rate scheduler is specified in configure_optimizers() with key "interval" (default “epoch”) in the scheduler configuration, Lightning will call the scheduler’s .step() method automatically in case of automatic optimization.

  • If you use 16-bit precision (precision=16), Lightning will automatically handle the optimizer.

  • If you use torch.optim.LBFGS, Lightning handles the closure function automatically for you.

  • If you use multiple optimizers, you will have to switch to ‘manual optimization’ mode and step them yourself.

  • If you need to control how often the optimizer steps, override the optimizer_step() hook.

flatten = True
forward(x)[source]

Forward pass of the model. Extracts features from the backbone and predicts the target using the head. Parameters ———- x : torch.Tensor

The input data.

Returns

torch.Tensor

The predicted target.

head
on_train_epoch_end()[source]

At the end of the first epoch, the model is trained on the concatenated training and validation data. The training data is flattened and the head is trained on it.

predict_proba = True
predict_step(batch, batch_idx, dataloader_idx=None)[source]

Predict step of the model.

sklearn_model_save_path = None
tensor1
test_metrics = None
test_step(batch, batch_idx)[source]

Test step of the model.

Parameters:
  • batch (torch.Tensor)

  • batch_idx (int)

train_data = []
train_y = []
training_step(batch, batch_index)[source]

Training step of the model. Collects all the training batchs into one variable and logs a dummy loss to keep track of the training process.

use_only_train_data = False
val_data = []
val_y = []
validation_step(batch, batch_index)[source]

Validation step of the model. Collects all the validation batchs into one variable and logs a dummy loss to keep track of the validation process.

class minerva.models.nets.classic_ml_pipeline.SklearnPipeline(steps, *, memory=None, verbose=False, **kwargs)[source]

Bases: sklearn.pipeline.Pipeline

A sequence of data transformers with an optional final predictor.

Pipeline allows you to sequentially apply a list of transformers to preprocess the data and, if desired, conclude the sequence with a final predictor for predictive modeling.

Intermediate steps of the pipeline must be transformers, that is, they must implement fit and transform methods. The final estimator only needs to implement fit. The transformers in the pipeline can be cached using memory argument.

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a ‘__’, as in the example below. A step’s estimator may be replaced entirely by setting the parameter with its name to another estimator, or a transformer removed by setting it to ‘passthrough’ or None.

For an example use case of Pipeline combined with GridSearchCV, refer to sphx_glr_auto_examples_compose_plot_compare_reduction.py. The example sphx_glr_auto_examples_compose_plot_digits_pipe.py shows how to grid search on a pipeline using ‘__’ as a separator in the parameter names.

Read more in the User Guide.

Added in version 0.5.

Parameters

stepslist of tuples

List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define fit. All non-last steps must also define transform. See Combining Estimators for more details.

transform_inputlist of str, default=None

The names of the metadata parameters that should be transformed by the pipeline before passing it to the step consuming it.

This enables transforming some input arguments to fit (other than X) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via metadata routing. For instance, this can be used to pass a validation set through the pipeline.

You can only set this if metadata routing is enabled, which you can enable using sklearn.set_config(enable_metadata_routing=True).

Added in version 1.6.

memorystr or object with the joblib.Memory interface, default=None

Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute named_steps or steps to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py for an example on how to enable caching.

verbosebool, default=False

If True, the time elapsed while fitting each step will be printed as it is completed.

Attributes

named_stepsBunch

Dictionary-like object, with the following attributes. Read-only attribute to access any step parameter by user given name. Keys are step names and values are steps parameters.

classesndarray of shape (n_classes,)

The classes labels. Only exist if the last step of the pipeline is a classifier.

n_features_in_int

Number of features seen during fit. Only defined if the underlying first estimator in steps exposes such an attribute when fit.

Added in version 0.24.

feature_names_in_ndarray of shape (n_features_in_,)

Names of features seen during fit. Only defined if the underlying estimator exposes such an attribute when fit.

Added in version 1.0.

See Also

make_pipeline : Convenience function for simplified pipeline construction.

Examples

>>> from sklearn.svm import SVC
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.pipeline import Pipeline
>>> X, y = make_classification(random_state=0)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
...                                                     random_state=0)
>>> pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
>>> # The pipeline can be used as any other estimator
>>> # and avoids leaking the test set into the train set
>>> pipe.fit(X_train, y_train).score(X_test, y_test)
0.88
>>> # An estimator's parameter can be set using '__' syntax
>>> pipe.set_params(svc__C=10).fit(X_train, y_train).score(X_test, y_test)
0.76
static _load_class(step_config)[source]

loads a class from a YAML configuration dictionary and returns an instance of it

Parameters:
  • steps (list)

  • memory (str)

  • verbose (bool)