minerva.models.nets.image.vit_local.vit¶
Classes¶
Multi-head self-attention module. |
|
Transformer block module. |
|
LayerScale module. |
|
Parallel Scaling Vision Transformer block. |
|
Parallel Things Vision Transformer block. |
|
Residual Post-Norm Transformer block. |
|
Vision Transformer (ViT) |
Functions¶
|
|
|
|
|
ViT weight initialization, matching JAX (Flax) impl |
|
ViT weight initialization, matching moco-v3 impl minus fixed PatchEmbed |
|
ViT weight initialization, original timm impl (for reproducibility) |
|
Rescale the grid of position embeddings when loading from state_dict. |
Module Contents¶
- class minerva.models.nets.image.vit_local.vit.Attention(dim, num_heads=8, qkv_bias=False, qk_norm=False, proj_bias=True, attn_drop=0.0, proj_drop=0.0, norm_layer=nn.LayerNorm)[source]¶
Bases:
torch.nn.ModuleMulti-head self-attention module.
This class implements the standard multi-head attention mechanism used in Transformer architectures. It supports both standard and fused attention implementations for improved performance when available.
Initialize the Attention module.
Parameters¶
- dimint
Total dimension of the input and output features.
- num_headsint, default=8
Number of attention heads.
- qkv_biasbool, default=False
If True, add a bias term to the query, key, and value projections.
- qk_normbool, default=False
If True, apply normalization to query and key tensors.
- proj_biasbool, default=True
If True, include bias in the output projection layer.
- attn_dropfloat, default=0.0
Dropout rate applied to the attention weights.
- proj_dropfloat, default=0.0
Dropout rate applied after the output projection.
- norm_layerType[nn.Module], default=nn.LayerNorm
Normalization layer type applied to query and key vectors when qk_norm=True.
- attn_drop¶
- forward(x)[source]¶
Forward pass of the multi-head attention mechanism.
Parameters¶
- xtorch.Tensor
Input tensor of shape (B, N, C), where B is the batch size, N is the sequence length, and C is the feature dimension.
Returns¶
- torch.Tensor
Output tensor of the same shape as input (B, N, C), containing the attended feature representations.
- Parameters:
x (torch.Tensor)
- Return type:
torch.Tensor
- fused_attn: torch.jit.Final[bool]¶
- head_dim¶
- k_norm¶
- num_heads = 8¶
- proj¶
- proj_drop¶
- q_norm¶
- qkv¶
- scale¶
- Parameters:
dim (int)
num_heads (int)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
attn_drop (float)
proj_drop (float)
norm_layer (Type[torch.nn.Module])
- class minerva.models.nets.image.vit_local.vit.Block(dim, num_heads, mlp_ratio=4.0, qkv_bias=False, qk_norm=False, proj_bias=True, proj_drop=0.0, attn_drop=0.0, init_values=None, drop_path=0.0, act_layer=nn.GELU, norm_layer=nn.LayerNorm, mlp_layer=Mlp)[source]¶
Bases:
torch.nn.ModuleTransformer block module.
Initialize the Transformer block.
Parameters¶
- dimint
Embedding dimension of the input and output features.
- num_headsint
Number of attention heads in the self-attention layer.
- mlp_ratiofloat, default=4.0
Expansion ratio for the hidden dimension in the MLP layer.
- qkv_biasbool, default=False
If True, add bias to the query, key, and value projections.
- qk_normbool, default=False
If True, apply normalization to query and key tensors.
- proj_biasbool, default=True
If True, include bias in the projection layers.
- proj_dropfloat, default=0.0
Dropout rate applied to the output of the attention and MLP layers.
- attn_dropfloat, default=0.0
Dropout rate applied to the attention weights.
- init_valuesfloat, optional
If specified, enables LayerScale with this initial scaling value.
- drop_pathfloat, default=0.0
Stochastic depth rate; set > 0 to apply DropPath regularization.
- act_layerType[nn.Module], default=nn.GELU
Activation function used in the MLP layer.
- norm_layerType[nn.Module], default=nn.LayerNorm
Normalization layer type applied before attention and MLP.
- mlp_layerType[nn.Module], default=Mlp
Module type used for the feed-forward network.
- attn¶
- drop_path1¶
- drop_path2¶
- ls1¶
- ls2¶
- mlp¶
- norm1¶
- norm2¶
- Parameters:
dim (int)
num_heads (int)
mlp_ratio (float)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
proj_drop (float)
attn_drop (float)
init_values (Optional[float])
drop_path (float)
act_layer (Type[torch.nn.Module])
norm_layer (Type[torch.nn.Module])
mlp_layer (Type[torch.nn.Module])
- class minerva.models.nets.image.vit_local.vit.LayerScale(dim, init_values=1e-05, inplace=False)[source]¶
Bases:
torch.nn.ModuleLayerScale module.
Initialize the LayerScale module.
Parameters¶
- dimint
Number of feature dimensions (channels) to scale.
- init_valuesfloat, default=1e-5
Initial value for the learnable scaling parameter.
- inplacebool, default=False
If True, performs the scaling operation in-place to save memory.
- forward(x)[source]¶
Forward pass applying per-channel scaling to the input tensor.
Parameters¶
- xtorch.Tensor
Input tensor of shape (B, N, C) or (B, C, H, W), depending on context.
Returns¶
- torch.Tensor
Scaled tensor of the same shape as the input.
- Parameters:
x (torch.Tensor)
- Return type:
torch.Tensor
- gamma¶
- inplace = False¶
- Parameters:
dim (int)
init_values (float)
inplace (bool)
- class minerva.models.nets.image.vit_local.vit.ParallelScalingBlock(dim, num_heads, mlp_ratio=4.0, qkv_bias=False, qk_norm=False, proj_bias=True, proj_drop=0.0, attn_drop=0.0, init_values=None, drop_path=0.0, act_layer=nn.GELU, norm_layer=nn.LayerNorm, mlp_layer=None)[source]¶
Bases:
torch.nn.ModuleParallel Scaling Vision Transformer block.
This module implements a parallel Transformer block that computes the multi-head self-attention and MLP branches concurrently and then combines their outputs. The design follows the architecture from “Scaling Vision Transformers to 22 Billion Parameters” (https://arxiv.org/abs/2302.05442).
The block includes LayerScale for stable deep scaling, optional DropPath for stochastic depth regularization, and supports fused attention when available for performance efficiency.
Initialize the ParallelScalingBlock.
Parameters¶
- dimint
Embedding dimension of the input and output features.
- num_headsint
Number of attention heads in the multi-head self-attention layer.
- mlp_ratiofloat, default=4.0
Expansion ratio for the hidden dimension in the MLP branch.
- qkv_biasbool, default=False
If True, add bias to the query, key, and value projections.
- qk_normbool, default=False
If True, apply normalization to the query and key tensors.
- proj_biasbool, default=True
If True, include bias in the output projection layers.
- proj_dropfloat, default=0.0
Dropout rate applied after the projection layers.
- attn_dropfloat, default=0.0
Dropout rate applied to the attention weights.
- init_valuesfloat, optional
If specified, enables LayerScale with this initialization value.
- drop_pathfloat, default=0.0
Stochastic depth rate; set > 0 to apply DropPath regularization.
- act_layerType[nn.Module], default=nn.GELU
Activation function used in the MLP branch.
- norm_layerType[nn.Module], default=nn.LayerNorm
Normalization layer applied before the parallel branches.
- mlp_layerType[nn.Module], optional
Optional custom MLP implementation; defaults to a standard linear MLP.
- attn_drop¶
- attn_out_proj¶
- drop_path¶
- forward(x)[source]¶
Forward pass of the Parallel Scaling Transformer block.
Parameters¶
- xtorch.Tensor
Input tensor of shape (B, N, C), where B is batch size, N is sequence length, and C is embedding dimension.
Returns¶
- torch.Tensor
Output tensor of shape (B, N, C), containing the updated feature representations.
- Parameters:
x (torch.Tensor)
- Return type:
torch.Tensor
- fused_attn: torch.jit.Final[bool]¶
- head_dim¶
- in_norm¶
- in_proj¶
- in_split¶
- k_norm¶
- ls¶
- mlp_act¶
- mlp_drop¶
- mlp_out_proj¶
- num_heads¶
- q_norm¶
- scale¶
- Parameters:
dim (int)
num_heads (int)
mlp_ratio (float)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
proj_drop (float)
attn_drop (float)
init_values (Optional[float])
drop_path (float)
act_layer (Type[torch.nn.Module])
norm_layer (Type[torch.nn.Module])
mlp_layer (Optional[Type[torch.nn.Module]])
- class minerva.models.nets.image.vit_local.vit.ParallelThingsBlock(dim, num_heads, num_parallel=2, mlp_ratio=4.0, qkv_bias=False, qk_norm=False, proj_bias=True, init_values=None, proj_drop=0.0, attn_drop=0.0, drop_path=0.0, act_layer=nn.GELU, norm_layer=nn.LayerNorm, mlp_layer=Mlp)[source]¶
Bases:
torch.nn.ModuleParallel Things Vision Transformer block.
This module implements a Transformer block that processes the input through multiple parallel attention layers followed by multiple parallel MLP layers. The outputs of each parallel branch are summed together, enabling a richer representation and improved learning capacity.
The design follows the architecture from “Three Things Everyone Should Know About Vision Transformers” (https://arxiv.org/abs/2203.09795).
Initialize the ParallelThingsBlock.
Parameters¶
- dimint
Embedding dimension of the input and output features.
- num_headsint
Number of attention heads in each attention branch.
- num_parallelint, default=2
Number of parallel attention and MLP branches.
- mlp_ratiofloat, default=4.0
Expansion ratio for the hidden dimension in the MLP layers.
- qkv_biasbool, default=False
If True, add bias to the query, key, and value projections.
- qk_normbool, default=False
If True, apply normalization to query and key tensors.
- proj_biasbool, default=True
If True, include bias in the projection layers.
- init_valuesfloat, optional
If specified, enables LayerScale with this initialization value.
- proj_dropfloat, default=0.0
Dropout rate applied to the output of the projection layers.
- attn_dropfloat, default=0.0
Dropout rate applied to the attention weights.
- drop_pathfloat, default=0.0
Stochastic depth rate; set > 0 to apply DropPath regularization.
- act_layerType[nn.Module], default=nn.GELU
Activation function used in the MLP layers.
- norm_layerType[nn.Module], default=nn.LayerNorm
Normalization layer type applied in each sub-block.
- mlp_layerType[nn.Module], default=Mlp
Module type used for the feed-forward MLP networks.
- attns¶
- ffns¶
- forward(x)[source]¶
Forward pass of the ParallelThingsBlock.
Parameters¶
- xtorch.Tensor
Input tensor of shape (B, N, C).
Returns¶
- torch.Tensor
Output tensor of the same shape (B, N, C), representing the combined outputs from the parallel attention and MLP branches.
- Parameters:
x (torch.Tensor)
- Return type:
torch.Tensor
- num_parallel = 2¶
- Parameters:
dim (int)
num_heads (int)
num_parallel (int)
mlp_ratio (float)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
init_values (Optional[float])
proj_drop (float)
attn_drop (float)
drop_path (float)
act_layer (Type[torch.nn.Module])
norm_layer (Type[torch.nn.Module])
mlp_layer (Type[torch.nn.Module])
- class minerva.models.nets.image.vit_local.vit.ResPostBlock(dim, num_heads, mlp_ratio=4.0, qkv_bias=False, qk_norm=False, proj_bias=True, proj_drop=0.0, attn_drop=0.0, init_values=None, drop_path=0.0, act_layer=nn.GELU, norm_layer=nn.LayerNorm, mlp_layer=Mlp)[source]¶
Bases:
torch.nn.ModuleResidual Post-Norm Transformer block.
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- Parameters:
dim (int)
num_heads (int)
mlp_ratio (float)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
proj_drop (float)
attn_drop (float)
init_values (Optional[float])
drop_path (float)
act_layer (Type[torch.nn.Module])
norm_layer (Type[torch.nn.Module])
mlp_layer (Type[torch.nn.Module])
- attn¶
- drop_path1¶
- drop_path2¶
- forward(x)[source]¶
Forward pass of the Residual Post-Norm Transformer block.
The input tensor passes through attention and MLP sublayers, each followed by normalization and residual connections. DropPath is optionally applied for regularization.
Parameters¶
- xtorch.Tensor
Input tensor of shape (B, N, C), where B is batch size, N is sequence length, and C is embedding dimension.
Returns¶
- torch.Tensor
Output tensor of the same shape (B, N, C), representing the transformed features.
- Parameters:
x (torch.Tensor)
- Return type:
torch.Tensor
- init_values = None¶
- mlp¶
- norm1¶
- norm2¶
- class minerva.models.nets.image.vit_local.vit.VisionTransformer(img_size=224, patch_size=16, in_chans=3, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, qk_norm=False, proj_bias=True, init_values=None, class_token=True, pos_embed='learn', no_embed_class=False, reg_tokens=0, pre_norm=False, dynamic_img_size=False, dynamic_img_pad=False, pos_drop_rate=0.0, patch_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, weight_init='', fix_init=False, embed_norm_layer=None, norm_layer=None, act_layer=None, block_fn=Block, mlp_layer=Mlp)[source]¶
Bases:
torch.nn.ModuleVision Transformer (ViT)
A PyTorch implementation of the Vision Transformer architecture from “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (https://arxiv.org/abs/2010.11929).
This model divides an input image into fixed-size patches, embeds them, adds positional information, and processes them through a sequence of Transformer encoder blocks to learn global image representations.
Initialize the Vision Transformer model.
Parameters¶
- img_sizeint or tuple of int, default=224
Input image size (height, width).
- patch_sizeint or tuple of int, default=16
Size of each image patch.
- in_chansint, default=3
Number of input channels (e.g., 3 for RGB images).
- num_classesint, default=1000
Number of output classes for classification.
- embed_dimint, default=768
Dimension of the patch embeddings.
- depthint, default=12
Number of Transformer encoder blocks.
- num_headsint, default=12
Number of attention heads per block.
- mlp_ratiofloat, default=4.0
Expansion ratio for the MLP hidden dimension.
- qkv_biasbool, default=True
If True, include bias in the query, key, and value projections.
- qk_normbool, default=False
If True, apply normalization to query and key vectors.
- proj_biasbool, default=True
If True, include bias in projection layers.
- init_valuesfloat, optional
Initial value for LayerScale; if None, LayerScale is disabled.
- class_tokenbool, default=True
If True, use a learnable class token.
- pos_embed{‘’, ‘none’, ‘learn’}, default=’learn’
Type of positional embedding; ‘learn’ enables learnable embeddings.
- no_embed_classbool, default=False
If True, exclude class and reg tokens from position embedding.
- reg_tokensint, default=0
Number of auxiliary regression tokens.
- pre_normbool, default=False
If True, apply normalization before Transformer blocks.
- dynamic_img_sizebool, default=False
If True, enables dynamic image resizing during inference.
- dynamic_img_padbool, default=False
If True, apply padding to dynamically sized images.
- drop_ratefloat, default=0.0
Dropout rate applied globally.
- pos_drop_ratefloat, default=0.0
Dropout rate applied to positional embeddings.
- patch_drop_ratefloat, default=0.0
Probability of randomly dropping patch tokens during training.
- proj_drop_ratefloat, default=0.0
Dropout rate applied to projection layers.
- attn_drop_ratefloat, default=0.0
Dropout rate applied to attention weights.
- drop_path_ratefloat, default=0.0
Stochastic depth drop rate across layers.
- weight_init{‘skip’, ‘jax’, ‘jax_nlhb’, ‘moco’, ‘’}, default=’’
Weight initialization strategy.
- fix_initbool, default=False
If True, rescales initialization following original ViT heuristics.
- embed_norm_layernn.Module, optional
Normalization layer applied to embeddings.
- norm_layernn.Module, optional
Normalization layer applied to Transformer blocks.
- act_layernn.Module, optional
Activation function used in MLP layers.
- block_fnnn.Module, default=Block
Type of Transformer block used.
- mlp_layernn.Module, default=Mlp
Type of MLP module used in each block.
- blocks¶
- cls_token¶
- dynamic_img_size: torch.jit.Final[bool]¶
- feature_info¶
- forward(x)[source]¶
Forward pass of the Vision Transformer.
Parameters¶
- xtorch.Tensor
Input image tensor of shape (B, C, H, W), where B is batch size, C is number of channels, and H, W are image dimensions.
Returns¶
- torch.Tensor
Encoded features of shape (B, N, D), where N is the number of patches (plus any prefix tokens) and D is the embedding dimension.
- Parameters:
x (torch.Tensor)
- Return type:
torch.Tensor
- grad_checkpointing = False¶
- has_class_token = True¶
- in_channels = 3¶
- no_embed_class = False¶
- norm_pre¶
- num_prefix_tokens = 1¶
- num_reg_tokens = 0¶
- patch_embed¶
- patch_size¶
- pos_drop¶
- reg_token¶
- set_input_size(img_size=None, patch_size=None)[source]¶
Method updates the input image resolution, patch size
- Args:
img_size: New input resolution, if None current resolution is used patch_size: New patch size, if None existing patch size is used
- Parameters:
img_size (Optional[Tuple[int, int]])
patch_size (Optional[Tuple[int, int]])
- Parameters:
img_size (Union[int, Tuple[int, int]])
patch_size (Union[int, Tuple[int, int]])
in_chans (int)
embed_dim (int)
depth (int)
num_heads (int)
mlp_ratio (float)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
init_values (Optional[float])
class_token (bool)
pos_embed (str)
no_embed_class (bool)
reg_tokens (int)
pre_norm (bool)
dynamic_img_size (bool)
dynamic_img_pad (bool)
pos_drop_rate (float)
patch_drop_rate (float)
proj_drop_rate (float)
attn_drop_rate (float)
drop_path_rate (float)
weight_init (Literal['skip', 'jax', 'jax_nlhb', 'moco', ''])
fix_init (bool)
embed_norm_layer (Optional[timm.layers.LayerType])
norm_layer (Optional[timm.layers.LayerType])
act_layer (Optional[timm.layers.LayerType])
block_fn (Type[torch.nn.Module])
mlp_layer (Type[torch.nn.Module])
- minerva.models.nets.image.vit_local.vit.get_init_weights_vit(mode='jax', head_bias=0.0)[source]¶
- Parameters:
mode (str)
head_bias (float)
- Return type:
Callable
- minerva.models.nets.image.vit_local.vit.global_pool_nlc(x, pool_type='token', num_prefix_tokens=1, reduce_include_prefix=False)[source]¶
- Parameters:
x (torch.Tensor)
pool_type (str)
num_prefix_tokens (int)
reduce_include_prefix (bool)
- minerva.models.nets.image.vit_local.vit.init_weights_vit_jax(module, name='', head_bias=0.0)[source]¶
ViT weight initialization, matching JAX (Flax) impl
- Parameters:
module (torch.nn.Module)
name (str)
head_bias (float)
- Return type:
None
- minerva.models.nets.image.vit_local.vit.init_weights_vit_moco(module, name='')[source]¶
ViT weight initialization, matching moco-v3 impl minus fixed PatchEmbed
- Parameters:
module (torch.nn.Module)
name (str)
- Return type:
None
- minerva.models.nets.image.vit_local.vit.init_weights_vit_timm(module, name='')[source]¶
ViT weight initialization, original timm impl (for reproducibility)
- Parameters:
module (torch.nn.Module)
name (str)
- Return type:
None
- minerva.models.nets.image.vit_local.vit.resize_pos_embed(posemb, posemb_new, num_prefix_tokens=1, gs_new=(), interpolation='bicubic', antialias=False)[source]¶
Rescale the grid of position embeddings when loading from state_dict. DEPRECATED This function is being deprecated in favour of using resample_abs_pos_embed
- Parameters:
posemb (torch.Tensor)
posemb_new (torch.Tensor)
num_prefix_tokens (int)
gs_new (Tuple[int, int])
interpolation (str)
antialias (bool)
- Return type:
torch.Tensor