minerva.models.nets.vit

Classes

`_Encoder`	Transformer Model Encoder for sequence to sequence translation.
`_VisionTransformerBackbone`	Vision Transformer as per https://arxiv.org/abs/2010.11929.

Module Contents

class minerva.models.nets.vit._Encoder(seq_length, num_layers, num_heads, hidden_dim, mlp_dim, dropout, attention_dropout, aux_output=False, aux_output_layers=None, norm_layer=partial(nn.LayerNorm, eps=1e-06))

Bases: torch.nn.Module

Transformer Model Encoder for sequence to sequence translation.

Parameters:

seq_length (int)
num_layers (int)
num_heads (int)
hidden_dim (int)
mlp_dim (int)
dropout (float)
attention_dropout (float)
aux_output (bool)
aux_output_layers (List[int] | None)
norm_layer (Callable[Ellipsis, torch.nn.Module])

forward(input)

Parameters:: input (torch.Tensor)

class minerva.models.nets.vit._VisionTransformerBackbone(image_size, patch_size, num_layers, num_heads, hidden_dim, mlp_dim, dropout=0.0, attention_dropout=0.0, num_classes=1000, aux_output=False, aux_output_layers=None, norm_layer=partial(nn.LayerNorm, eps=1e-06), conv_stem_configs=None)

Bases: torch.nn.Module

Vision Transformer as per https://arxiv.org/abs/2010.11929.

Initializes a Vision Transformer (ViT) model.

Parameters

image_sizeint or tuple[int, int]: The size of the input image. If an int is provided, it is assumed to be a square image. If a tuple of ints is provided, it represents the height and width of the image.
patch_sizeint: The size of each patch in the image.
num_layersint: The number of transformer layers in the model.
num_headsint: The number of attention heads in the transformer layers.
hidden_dimint: The dimensionality of the hidden layers in the transformer.
mlp_dimint: The dimensionality of the feed-forward MLP layers in the transformer.
dropoutfloat, optional: The dropout rate to apply. Defaults to 0.0.
attention_dropoutfloat, optional: The dropout rate to apply to the attention weights. Defaults to 0.0.
num_classesint, optional: The number of output classes. Defaults to 1000.
norm_layerCallable[…, torch.nn.Module], optional: The normalization layer to use. Defaults to nn.LayerNorm with epsilon=1e-6.
conv_stem_configsList[ConvStemConfig], optional: The configuration for the convolutional stem layers. If provided, the input image will be processed by these convolutional layers before being passed to the transformer. Defaults to None.

_process_input(x)

Process the input tensor and return the reshaped tensor and dimensions.

Args:: x (torch.Tensor): The input tensor.
Returns:: tuple[torch.Tensor, int, int]: The reshaped tensor, number of rows, and number of columns.

Parameters:: x (torch.Tensor)
Return type:: tuple[torch.Tensor, int, int]

forward(x)

Forward pass of the Vision Transformer Backbone.

Args:: x (torch.Tensor): The input tensor.
Returns:: torch.Tensor: The output tensor.

Parameters:: x (torch.Tensor)

Parameters:

image_size (int | tuple[int, int])
patch_size (int)
num_layers (int)
num_heads (int)
hidden_dim (int)
mlp_dim (int)
dropout (float)
attention_dropout (float)
num_classes (int)
aux_output (bool)
aux_output_layers (List[int] | None)
norm_layer (Callable[Ellipsis, torch.nn.Module])
conv_stem_configs (Optional[List[torchvision.models.vision_transformer.ConvStemConfig]])