minerva.models.nets.vit

Classes

_Encoder

Transformer Model Encoder for sequence to sequence translation.

_VisionTransformerBackbone

Vision Transformer as per https://arxiv.org/abs/2010.11929.

Module Contents

class minerva.models.nets.vit._Encoder(seq_length, num_layers, num_heads, hidden_dim, mlp_dim, dropout, attention_dropout, aux_output=False, aux_output_layers=None, norm_layer=partial(nn.LayerNorm, eps=1e-06))

Bases: torch.nn.Module

Transformer Model Encoder for sequence to sequence translation.

Parameters:
  • seq_length (int)

  • num_layers (int)

  • num_heads (int)

  • hidden_dim (int)

  • mlp_dim (int)

  • dropout (float)

  • attention_dropout (float)

  • aux_output (bool)

  • aux_output_layers (List[int] | None)

  • norm_layer (Callable[Ellipsis, torch.nn.Module])

forward(input)
Parameters:

input (torch.Tensor)

class minerva.models.nets.vit._VisionTransformerBackbone(image_size, patch_size, num_layers, num_heads, hidden_dim, mlp_dim, dropout=0.0, attention_dropout=0.0, num_classes=1000, aux_output=False, aux_output_layers=None, norm_layer=partial(nn.LayerNorm, eps=1e-06), conv_stem_configs=None)

Bases: torch.nn.Module

Vision Transformer as per https://arxiv.org/abs/2010.11929.

Initializes a Vision Transformer (ViT) model.

Parameters

image_sizeint or tuple[int, int]

The size of the input image. If an int is provided, it is assumed to be a square image. If a tuple of ints is provided, it represents the height and width of the image.

patch_sizeint

The size of each patch in the image.

num_layersint

The number of transformer layers in the model.

num_headsint

The number of attention heads in the transformer layers.

hidden_dimint

The dimensionality of the hidden layers in the transformer.

mlp_dimint

The dimensionality of the feed-forward MLP layers in the transformer.

dropoutfloat, optional

The dropout rate to apply. Defaults to 0.0.

attention_dropoutfloat, optional

The dropout rate to apply to the attention weights. Defaults to 0.0.

num_classesint, optional

The number of output classes. Defaults to 1000.

norm_layerCallable[…, torch.nn.Module], optional

The normalization layer to use. Defaults to nn.LayerNorm with epsilon=1e-6.

conv_stem_configsList[ConvStemConfig], optional

The configuration for the convolutional stem layers. If provided, the input image will be processed by these convolutional layers before being passed to the transformer. Defaults to None.

_process_input(x)

Process the input tensor and return the reshaped tensor and dimensions.

Args:

x (torch.Tensor): The input tensor.

Returns:

tuple[torch.Tensor, int, int]: The reshaped tensor, number of rows, and number of columns.

Parameters:

x (torch.Tensor)

Return type:

tuple[torch.Tensor, int, int]

forward(x)

Forward pass of the Vision Transformer Backbone.

Args:

x (torch.Tensor): The input tensor.

Returns:

torch.Tensor: The output tensor.

Parameters:

x (torch.Tensor)

Parameters:
  • image_size (int | tuple[int, int])

  • patch_size (int)

  • num_layers (int)

  • num_heads (int)

  • hidden_dim (int)

  • mlp_dim (int)

  • dropout (float)

  • attention_dropout (float)

  • num_classes (int)

  • aux_output (bool)

  • aux_output_layers (List[int] | None)

  • norm_layer (Callable[Ellipsis, torch.nn.Module])

  • conv_stem_configs (Optional[List[torchvision.models.vision_transformer.ConvStemConfig]])