minerva.models.nets.vit
Classes
Transformer Model Encoder for sequence to sequence translation. |
|
Vision Transformer as per https://arxiv.org/abs/2010.11929. |
Module Contents
- class minerva.models.nets.vit._Encoder(seq_length, num_layers, num_heads, hidden_dim, mlp_dim, dropout, attention_dropout, aux_output=False, aux_output_layers=None, norm_layer=partial(nn.LayerNorm, eps=1e-06))
Bases:
torch.nn.Module
Transformer Model Encoder for sequence to sequence translation.
- Parameters:
seq_length (int)
num_layers (int)
num_heads (int)
hidden_dim (int)
mlp_dim (int)
dropout (float)
attention_dropout (float)
aux_output (bool)
aux_output_layers (List[int] | None)
norm_layer (Callable[Ellipsis, torch.nn.Module])
- forward(input)
- Parameters:
input (torch.Tensor)
- class minerva.models.nets.vit._VisionTransformerBackbone(image_size, patch_size, num_layers, num_heads, hidden_dim, mlp_dim, dropout=0.0, attention_dropout=0.0, num_classes=1000, aux_output=False, aux_output_layers=None, norm_layer=partial(nn.LayerNorm, eps=1e-06), conv_stem_configs=None)
Bases:
torch.nn.Module
Vision Transformer as per https://arxiv.org/abs/2010.11929.
Initializes a Vision Transformer (ViT) model.
Parameters
- image_sizeint or tuple[int, int]
The size of the input image. If an int is provided, it is assumed to be a square image. If a tuple of ints is provided, it represents the height and width of the image.
- patch_sizeint
The size of each patch in the image.
- num_layersint
The number of transformer layers in the model.
- num_headsint
The number of attention heads in the transformer layers.
- hidden_dimint
The dimensionality of the hidden layers in the transformer.
- mlp_dimint
The dimensionality of the feed-forward MLP layers in the transformer.
- dropoutfloat, optional
The dropout rate to apply. Defaults to 0.0.
- attention_dropoutfloat, optional
The dropout rate to apply to the attention weights. Defaults to 0.0.
- num_classesint, optional
The number of output classes. Defaults to 1000.
- norm_layerCallable[…, torch.nn.Module], optional
The normalization layer to use. Defaults to nn.LayerNorm with epsilon=1e-6.
- conv_stem_configsList[ConvStemConfig], optional
The configuration for the convolutional stem layers. If provided, the input image will be processed by these convolutional layers before being passed to the transformer. Defaults to None.
- _process_input(x)
Process the input tensor and return the reshaped tensor and dimensions.
- Args:
x (torch.Tensor): The input tensor.
- Returns:
tuple[torch.Tensor, int, int]: The reshaped tensor, number of rows, and number of columns.
- Parameters:
x (torch.Tensor)
- Return type:
tuple[torch.Tensor, int, int]
- forward(x)
Forward pass of the Vision Transformer Backbone.
- Args:
x (torch.Tensor): The input tensor.
- Returns:
torch.Tensor: The output tensor.
- Parameters:
x (torch.Tensor)
- Parameters:
image_size (int | tuple[int, int])
patch_size (int)
num_layers (int)
num_heads (int)
hidden_dim (int)
mlp_dim (int)
dropout (float)
attention_dropout (float)
num_classes (int)
aux_output (bool)
aux_output_layers (List[int] | None)
norm_layer (Callable[Ellipsis, torch.nn.Module])
conv_stem_configs (Optional[List[torchvision.models.vision_transformer.ConvStemConfig]])