minerva.models.nets.image.vit_local.vit¶

Classes¶

`Attention`	Multi-head self-attention module.
`Block`	Transformer block module.
`LayerScale`	LayerScale module.
`ParallelScalingBlock`	Parallel Scaling Vision Transformer block.
`ParallelThingsBlock`	Parallel Things Vision Transformer block.
`ResPostBlock`	Residual Post-Norm Transformer block.
`VisionTransformer`	Vision Transformer (ViT)

Functions¶

`get_init_weights_vit`([mode, head_bias])
`global_pool_nlc`(x[, pool_type, num_prefix_tokens, ...])
`init_weights_vit_jax`(module[, name, head_bias])	ViT weight initialization, matching JAX (Flax) impl
`init_weights_vit_moco`(module[, name])	ViT weight initialization, matching moco-v3 impl minus fixed PatchEmbed
`init_weights_vit_timm`(module[, name])	ViT weight initialization, original timm impl (for reproducibility)
`resize_pos_embed`(posemb, posemb_new[, ...])	Rescale the grid of position embeddings when loading from state_dict.

Module Contents¶

class minerva.models.nets.image.vit_local.vit.Attention(dim, num_heads=8, qkv_bias=False, qk_norm=False, proj_bias=True, attn_drop=0.0, proj_drop=0.0, norm_layer=nn.LayerNorm)[source]¶

Bases: torch.nn.Module

Multi-head self-attention module.

This class implements the standard multi-head attention mechanism used in Transformer architectures. It supports both standard and fused attention implementations for improved performance when available.

Initialize the Attention module.

Parameters¶

dimint: Total dimension of the input and output features.
num_headsint, default=8: Number of attention heads.
qkv_biasbool, default=False: If True, add a bias term to the query, key, and value projections.
qk_normbool, default=False: If True, apply normalization to query and key tensors.
proj_biasbool, default=True: If True, include bias in the output projection layer.
attn_dropfloat, default=0.0: Dropout rate applied to the attention weights.
proj_dropfloat, default=0.0: Dropout rate applied after the output projection.
norm_layerType[nn.Module], default=nn.LayerNorm: Normalization layer type applied to query and key vectors when qk_norm=True.

attn_drop¶

forward(x)[source]¶

Forward pass of the multi-head attention mechanism.

Parameters¶

xtorch.Tensor: Input tensor of shape (B, N, C), where B is the batch size, N is the sequence length, and C is the feature dimension.

Returns¶

torch.Tensor: Output tensor of the same shape as input (B, N, C), containing the attended feature representations.

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

fused_attn: torch.jit.Final[bool]¶

head_dim¶

k_norm¶

num_heads = 8¶

proj¶

proj_drop¶

q_norm¶

qkv¶

scale¶

Parameters:

dim (int)
num_heads (int)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
attn_drop (float)
proj_drop (float)
norm_layer (Type[torch.nn.Module])

class minerva.models.nets.image.vit_local.vit.Block(dim, num_heads, mlp_ratio=4.0, qkv_bias=False, qk_norm=False, proj_bias=True, proj_drop=0.0, attn_drop=0.0, init_values=None, drop_path=0.0, act_layer=nn.GELU, norm_layer=nn.LayerNorm, mlp_layer=Mlp)[source]¶

Bases: torch.nn.Module

Transformer block module.

Initialize the Transformer block.

Parameters¶

dimint: Embedding dimension of the input and output features.
num_headsint: Number of attention heads in the self-attention layer.
mlp_ratiofloat, default=4.0: Expansion ratio for the hidden dimension in the MLP layer.
qkv_biasbool, default=False: If True, add bias to the query, key, and value projections.
qk_normbool, default=False: If True, apply normalization to query and key tensors.
proj_biasbool, default=True: If True, include bias in the projection layers.
proj_dropfloat, default=0.0: Dropout rate applied to the output of the attention and MLP layers.
attn_dropfloat, default=0.0: Dropout rate applied to the attention weights.
init_valuesfloat, optional: If specified, enables LayerScale with this initial scaling value.
drop_pathfloat, default=0.0: Stochastic depth rate; set > 0 to apply DropPath regularization.
act_layerType[nn.Module], default=nn.GELU: Activation function used in the MLP layer.
norm_layerType[nn.Module], default=nn.LayerNorm: Normalization layer type applied before attention and MLP.
mlp_layerType[nn.Module], default=Mlp: Module type used for the feed-forward network.

attn¶

drop_path1¶

drop_path2¶

forward(x)[source]¶

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

ls1¶

ls2¶

mlp¶

norm1¶

norm2¶

Parameters:

dim (int)
num_heads (int)
mlp_ratio (float)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
proj_drop (float)
attn_drop (float)
init_values (Optional[float])
drop_path (float)
act_layer (Type[torch.nn.Module])
norm_layer (Type[torch.nn.Module])
mlp_layer (Type[torch.nn.Module])

class minerva.models.nets.image.vit_local.vit.LayerScale(dim, init_values=1e-05, inplace=False)[source]¶

Bases: torch.nn.Module

LayerScale module.

Initialize the LayerScale module.

Parameters¶

dimint: Number of feature dimensions (channels) to scale.
init_valuesfloat, default=1e-5: Initial value for the learnable scaling parameter.
inplacebool, default=False: If True, performs the scaling operation in-place to save memory.

forward(x)[source]¶

Forward pass applying per-channel scaling to the input tensor.

Parameters¶

xtorch.Tensor: Input tensor of shape (B, N, C) or (B, C, H, W), depending on context.

Returns¶

torch.Tensor: Scaled tensor of the same shape as the input.

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

gamma¶

inplace = False¶

Parameters:

dim (int)
init_values (float)
inplace (bool)

class minerva.models.nets.image.vit_local.vit.ParallelScalingBlock(dim, num_heads, mlp_ratio=4.0, qkv_bias=False, qk_norm=False, proj_bias=True, proj_drop=0.0, attn_drop=0.0, init_values=None, drop_path=0.0, act_layer=nn.GELU, norm_layer=nn.LayerNorm, mlp_layer=None)[source]¶

Bases: torch.nn.Module

Parallel Scaling Vision Transformer block.

This module implements a parallel Transformer block that computes the multi-head self-attention and MLP branches concurrently and then combines their outputs. The design follows the architecture from “Scaling Vision Transformers to 22 Billion Parameters” (https://arxiv.org/abs/2302.05442).

The block includes LayerScale for stable deep scaling, optional DropPath for stochastic depth regularization, and supports fused attention when available for performance efficiency.

Initialize the ParallelScalingBlock.

Parameters¶

dimint: Embedding dimension of the input and output features.
num_headsint: Number of attention heads in the multi-head self-attention layer.
mlp_ratiofloat, default=4.0: Expansion ratio for the hidden dimension in the MLP branch.
qkv_biasbool, default=False: If True, add bias to the query, key, and value projections.
qk_normbool, default=False: If True, apply normalization to the query and key tensors.
proj_biasbool, default=True: If True, include bias in the output projection layers.
proj_dropfloat, default=0.0: Dropout rate applied after the projection layers.
attn_dropfloat, default=0.0: Dropout rate applied to the attention weights.
init_valuesfloat, optional: If specified, enables LayerScale with this initialization value.
drop_pathfloat, default=0.0: Stochastic depth rate; set > 0 to apply DropPath regularization.
act_layerType[nn.Module], default=nn.GELU: Activation function used in the MLP branch.
norm_layerType[nn.Module], default=nn.LayerNorm: Normalization layer applied before the parallel branches.
mlp_layerType[nn.Module], optional: Optional custom MLP implementation; defaults to a standard linear MLP.

attn_drop¶

attn_out_proj¶

drop_path¶

forward(x)[source]¶

Forward pass of the Parallel Scaling Transformer block.

Parameters¶

xtorch.Tensor: Input tensor of shape (B, N, C), where B is batch size, N is sequence length, and C is embedding dimension.

Returns¶

torch.Tensor: Output tensor of shape (B, N, C), containing the updated feature representations.

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

fused_attn: torch.jit.Final[bool]¶

head_dim¶

in_norm¶

in_proj¶

in_split¶

k_norm¶

ls¶

mlp_act¶

mlp_drop¶

mlp_out_proj¶

num_heads¶

q_norm¶

scale¶

Parameters:

dim (int)
num_heads (int)
mlp_ratio (float)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
proj_drop (float)
attn_drop (float)
init_values (Optional[float])
drop_path (float)
act_layer (Type[torch.nn.Module])
norm_layer (Type[torch.nn.Module])
mlp_layer (Optional[Type[torch.nn.Module]])

class minerva.models.nets.image.vit_local.vit.ParallelThingsBlock(dim, num_heads, num_parallel=2, mlp_ratio=4.0, qkv_bias=False, qk_norm=False, proj_bias=True, init_values=None, proj_drop=0.0, attn_drop=0.0, drop_path=0.0, act_layer=nn.GELU, norm_layer=nn.LayerNorm, mlp_layer=Mlp)[source]¶

Bases: torch.nn.Module

Parallel Things Vision Transformer block.

This module implements a Transformer block that processes the input through multiple parallel attention layers followed by multiple parallel MLP layers. The outputs of each parallel branch are summed together, enabling a richer representation and improved learning capacity.

The design follows the architecture from “Three Things Everyone Should Know About Vision Transformers” (https://arxiv.org/abs/2203.09795).

Initialize the ParallelThingsBlock.

Parameters¶

dimint: Embedding dimension of the input and output features.
num_headsint: Number of attention heads in each attention branch.
num_parallelint, default=2: Number of parallel attention and MLP branches.
mlp_ratiofloat, default=4.0: Expansion ratio for the hidden dimension in the MLP layers.
qkv_biasbool, default=False: If True, add bias to the query, key, and value projections.
qk_normbool, default=False: If True, apply normalization to query and key tensors.
proj_biasbool, default=True: If True, include bias in the projection layers.
init_valuesfloat, optional: If specified, enables LayerScale with this initialization value.
proj_dropfloat, default=0.0: Dropout rate applied to the output of the projection layers.
attn_dropfloat, default=0.0: Dropout rate applied to the attention weights.
drop_pathfloat, default=0.0: Stochastic depth rate; set > 0 to apply DropPath regularization.
act_layerType[nn.Module], default=nn.GELU: Activation function used in the MLP layers.
norm_layerType[nn.Module], default=nn.LayerNorm: Normalization layer type applied in each sub-block.
mlp_layerType[nn.Module], default=Mlp: Module type used for the feed-forward MLP networks.

_forward(x)[source]¶

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

_forward_jit(x)[source]¶

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

attns¶

ffns¶

forward(x)[source]¶

Forward pass of the ParallelThingsBlock.

Parameters¶

xtorch.Tensor: Input tensor of shape (B, N, C).

Returns¶

torch.Tensor: Output tensor of the same shape (B, N, C), representing the combined outputs from the parallel attention and MLP branches.

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

num_parallel = 2¶

Parameters:

dim (int)
num_heads (int)
num_parallel (int)
mlp_ratio (float)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
init_values (Optional[float])
proj_drop (float)
attn_drop (float)
drop_path (float)
act_layer (Type[torch.nn.Module])
norm_layer (Type[torch.nn.Module])
mlp_layer (Type[torch.nn.Module])

class minerva.models.nets.image.vit_local.vit.ResPostBlock(dim, num_heads, mlp_ratio=4.0, qkv_bias=False, qk_norm=False, proj_bias=True, proj_drop=0.0, attn_drop=0.0, init_values=None, drop_path=0.0, act_layer=nn.GELU, norm_layer=nn.LayerNorm, mlp_layer=Mlp)[source]¶

Bases: torch.nn.Module

Residual Post-Norm Transformer block.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

Parameters:

dim (int)
num_heads (int)
mlp_ratio (float)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
proj_drop (float)
attn_drop (float)
init_values (Optional[float])
drop_path (float)
act_layer (Type[torch.nn.Module])
norm_layer (Type[torch.nn.Module])
mlp_layer (Type[torch.nn.Module])

attn¶

drop_path1¶

drop_path2¶

forward(x)[source]¶

Forward pass of the Residual Post-Norm Transformer block.

The input tensor passes through attention and MLP sublayers, each followed by normalization and residual connections. DropPath is optionally applied for regularization.

Parameters¶

xtorch.Tensor: Input tensor of shape (B, N, C), where B is batch size, N is sequence length, and C is embedding dimension.

Returns¶

torch.Tensor: Output tensor of the same shape (B, N, C), representing the transformed features.

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

init_values = None¶

init_weights()[source]¶

Return type:: None

mlp¶

norm1¶

norm2¶

class minerva.models.nets.image.vit_local.vit.VisionTransformer(img_size=224, patch_size=16, in_chans=3, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, qk_norm=False, proj_bias=True, init_values=None, class_token=True, pos_embed='learn', no_embed_class=False, reg_tokens=0, pre_norm=False, dynamic_img_size=False, dynamic_img_pad=False, pos_drop_rate=0.0, patch_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, weight_init='', fix_init=False, embed_norm_layer=None, norm_layer=None, act_layer=None, block_fn=Block, mlp_layer=Mlp)[source]¶

Bases: torch.nn.Module

Vision Transformer (ViT)

A PyTorch implementation of the Vision Transformer architecture from “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (https://arxiv.org/abs/2010.11929).

This model divides an input image into fixed-size patches, embeds them, adds positional information, and processes them through a sequence of Transformer encoder blocks to learn global image representations.

Initialize the Vision Transformer model.

Parameters¶

img_sizeint or tuple of int, default=224: Input image size (height, width).
patch_sizeint or tuple of int, default=16: Size of each image patch.
in_chansint, default=3: Number of input channels (e.g., 3 for RGB images).
num_classesint, default=1000: Number of output classes for classification.
embed_dimint, default=768: Dimension of the patch embeddings.
depthint, default=12: Number of Transformer encoder blocks.
num_headsint, default=12: Number of attention heads per block.
mlp_ratiofloat, default=4.0: Expansion ratio for the MLP hidden dimension.
qkv_biasbool, default=True: If True, include bias in the query, key, and value projections.
qk_normbool, default=False: If True, apply normalization to query and key vectors.
proj_biasbool, default=True: If True, include bias in projection layers.
init_valuesfloat, optional: Initial value for LayerScale; if None, LayerScale is disabled.
class_tokenbool, default=True: If True, use a learnable class token.
pos_embed{‘’, ‘none’, ‘learn’}, default=’learn’: Type of positional embedding; ‘learn’ enables learnable embeddings.
no_embed_classbool, default=False: If True, exclude class and reg tokens from position embedding.
reg_tokensint, default=0: Number of auxiliary regression tokens.
pre_normbool, default=False: If True, apply normalization before Transformer blocks.
dynamic_img_sizebool, default=False: If True, enables dynamic image resizing during inference.
dynamic_img_padbool, default=False: If True, apply padding to dynamically sized images.
drop_ratefloat, default=0.0: Dropout rate applied globally.
pos_drop_ratefloat, default=0.0: Dropout rate applied to positional embeddings.
patch_drop_ratefloat, default=0.0: Probability of randomly dropping patch tokens during training.
proj_drop_ratefloat, default=0.0: Dropout rate applied to projection layers.
attn_drop_ratefloat, default=0.0: Dropout rate applied to attention weights.
drop_path_ratefloat, default=0.0: Stochastic depth drop rate across layers.
weight_init{‘skip’, ‘jax’, ‘jax_nlhb’, ‘moco’, ‘’}, default=’’: Weight initialization strategy.
fix_initbool, default=False: If True, rescales initialization following original ViT heuristics.
embed_norm_layernn.Module, optional: Normalization layer applied to embeddings.
norm_layernn.Module, optional: Normalization layer applied to Transformer blocks.
act_layernn.Module, optional: Activation function used in MLP layers.
block_fnnn.Module, default=Block: Type of Transformer block used.
mlp_layernn.Module, default=Mlp: Type of MLP module used in each block.

_init_weights(m)[source]¶

Parameters:: m (torch.nn.Module)
Return type:: None

_pos_embed(x)[source]¶

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

blocks¶

cls_token¶

dynamic_img_size: torch.jit.Final[bool]¶

feature_info¶

fix_init_weight()[source]¶

forward(x)[source]¶

Forward pass of the Vision Transformer.

Parameters¶

xtorch.Tensor: Input image tensor of shape (B, C, H, W), where B is batch size, C is number of channels, and H, W are image dimensions.

Returns¶

torch.Tensor: Encoded features of shape (B, N, D), where N is the number of patches (plus any prefix tokens) and D is the embedding dimension.

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

grad_checkpointing = False¶

has_class_token = True¶

in_channels = 3¶

init_weights(mode='')[source]¶

Parameters:: mode (str)
Return type:: None

no_embed_class = False¶

norm_pre¶

num_prefix_tokens = 1¶

num_reg_tokens = 0¶

patch_embed¶

patch_size¶

pos_drop¶

reg_token¶

set_input_size(img_size=None, patch_size=None)[source]¶

Method updates the input image resolution, patch size

Args:: img_size: New input resolution, if None current resolution is used patch_size: New patch size, if None existing patch size is used

Parameters:

img_size (Optional[Tuple[int, int]])
patch_size (Optional[Tuple[int, int]])

Parameters:

img_size (Union[int, Tuple[int, int]])
patch_size (Union[int, Tuple[int, int]])
in_chans (int)
embed_dim (int)
depth (int)
num_heads (int)
mlp_ratio (float)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
init_values (Optional[float])
class_token (bool)
pos_embed (str)
no_embed_class (bool)
reg_tokens (int)
pre_norm (bool)
dynamic_img_size (bool)
dynamic_img_pad (bool)
pos_drop_rate (float)
patch_drop_rate (float)
proj_drop_rate (float)
attn_drop_rate (float)
drop_path_rate (float)
weight_init (Literal['skip', 'jax', 'jax_nlhb', 'moco', ''])
fix_init (bool)
embed_norm_layer (Optional[timm.layers.LayerType])
norm_layer (Optional[timm.layers.LayerType])
act_layer (Optional[timm.layers.LayerType])
block_fn (Type[torch.nn.Module])
mlp_layer (Type[torch.nn.Module])

minerva.models.nets.image.vit_local.vit.get_init_weights_vit(mode='jax', head_bias=0.0)[source]¶

Parameters:

mode (str)
head_bias (float)

Return type:

Callable

minerva.models.nets.image.vit_local.vit.global_pool_nlc(x, pool_type='token', num_prefix_tokens=1, reduce_include_prefix=False)[source]¶

Parameters:

x (torch.Tensor)
pool_type (str)
num_prefix_tokens (int)
reduce_include_prefix (bool)

minerva.models.nets.image.vit_local.vit.init_weights_vit_jax(module, name='', head_bias=0.0)[source]¶

ViT weight initialization, matching JAX (Flax) impl

Parameters:

module (torch.nn.Module)
name (str)
head_bias (float)

Return type:

None

minerva.models.nets.image.vit_local.vit.init_weights_vit_moco(module, name='')[source]¶

ViT weight initialization, matching moco-v3 impl minus fixed PatchEmbed

Parameters:

module (torch.nn.Module)
name (str)

Return type:

None

minerva.models.nets.image.vit_local.vit.init_weights_vit_timm(module, name='')[source]¶

ViT weight initialization, original timm impl (for reproducibility)

Parameters:

module (torch.nn.Module)
name (str)

Return type:

None

minerva.models.nets.image.vit_local.vit.resize_pos_embed(posemb, posemb_new, num_prefix_tokens=1, gs_new=(), interpolation='bicubic', antialias=False)[source]¶

Rescale the grid of position embeddings when loading from state_dict. DEPRECATED This function is being deprecated in favour of using resample_abs_pos_embed

Parameters:

posemb (torch.Tensor)
posemb_new (torch.Tensor)
num_prefix_tokens (int)
gs_new (Tuple[int, int])
interpolation (str)
antialias (bool)

Return type:

torch.Tensor