minerva.models.nets.image.vit_local.vit

Classes

Attention

Multi-head self-attention module.

Block

Transformer block module.

LayerScale

LayerScale module.

ParallelScalingBlock

Parallel Scaling Vision Transformer block.

ParallelThingsBlock

Parallel Things Vision Transformer block.

ResPostBlock

Residual Post-Norm Transformer block.

VisionTransformer

Vision Transformer (ViT)

Functions

get_init_weights_vit([mode, head_bias])

global_pool_nlc(x[, pool_type, num_prefix_tokens, ...])

init_weights_vit_jax(module[, name, head_bias])

ViT weight initialization, matching JAX (Flax) impl

init_weights_vit_moco(module[, name])

ViT weight initialization, matching moco-v3 impl minus fixed PatchEmbed

init_weights_vit_timm(module[, name])

ViT weight initialization, original timm impl (for reproducibility)

resize_pos_embed(posemb, posemb_new[, ...])

Rescale the grid of position embeddings when loading from state_dict.

Module Contents

class minerva.models.nets.image.vit_local.vit.Attention(dim, num_heads=8, qkv_bias=False, qk_norm=False, proj_bias=True, attn_drop=0.0, proj_drop=0.0, norm_layer=nn.LayerNorm)[source]

Bases: torch.nn.Module

Multi-head self-attention module.

This class implements the standard multi-head attention mechanism used in Transformer architectures. It supports both standard and fused attention implementations for improved performance when available.

Initialize the Attention module.

Parameters

dimint

Total dimension of the input and output features.

num_headsint, default=8

Number of attention heads.

qkv_biasbool, default=False

If True, add a bias term to the query, key, and value projections.

qk_normbool, default=False

If True, apply normalization to query and key tensors.

proj_biasbool, default=True

If True, include bias in the output projection layer.

attn_dropfloat, default=0.0

Dropout rate applied to the attention weights.

proj_dropfloat, default=0.0

Dropout rate applied after the output projection.

norm_layerType[nn.Module], default=nn.LayerNorm

Normalization layer type applied to query and key vectors when qk_norm=True.

attn_drop
forward(x)[source]

Forward pass of the multi-head attention mechanism.

Parameters

xtorch.Tensor

Input tensor of shape (B, N, C), where B is the batch size, N is the sequence length, and C is the feature dimension.

Returns

torch.Tensor

Output tensor of the same shape as input (B, N, C), containing the attended feature representations.

Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

fused_attn: torch.jit.Final[bool]
head_dim
k_norm
num_heads = 8
proj
proj_drop
q_norm
qkv
scale
Parameters:
  • dim (int)

  • num_heads (int)

  • qkv_bias (bool)

  • qk_norm (bool)

  • proj_bias (bool)

  • attn_drop (float)

  • proj_drop (float)

  • norm_layer (Type[torch.nn.Module])

class minerva.models.nets.image.vit_local.vit.Block(dim, num_heads, mlp_ratio=4.0, qkv_bias=False, qk_norm=False, proj_bias=True, proj_drop=0.0, attn_drop=0.0, init_values=None, drop_path=0.0, act_layer=nn.GELU, norm_layer=nn.LayerNorm, mlp_layer=Mlp)[source]

Bases: torch.nn.Module

Transformer block module.

Initialize the Transformer block.

Parameters

dimint

Embedding dimension of the input and output features.

num_headsint

Number of attention heads in the self-attention layer.

mlp_ratiofloat, default=4.0

Expansion ratio for the hidden dimension in the MLP layer.

qkv_biasbool, default=False

If True, add bias to the query, key, and value projections.

qk_normbool, default=False

If True, apply normalization to query and key tensors.

proj_biasbool, default=True

If True, include bias in the projection layers.

proj_dropfloat, default=0.0

Dropout rate applied to the output of the attention and MLP layers.

attn_dropfloat, default=0.0

Dropout rate applied to the attention weights.

init_valuesfloat, optional

If specified, enables LayerScale with this initial scaling value.

drop_pathfloat, default=0.0

Stochastic depth rate; set > 0 to apply DropPath regularization.

act_layerType[nn.Module], default=nn.GELU

Activation function used in the MLP layer.

norm_layerType[nn.Module], default=nn.LayerNorm

Normalization layer type applied before attention and MLP.

mlp_layerType[nn.Module], default=Mlp

Module type used for the feed-forward network.

attn
drop_path1
drop_path2
forward(x)[source]
Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

ls1
ls2
mlp
norm1
norm2
Parameters:
  • dim (int)

  • num_heads (int)

  • mlp_ratio (float)

  • qkv_bias (bool)

  • qk_norm (bool)

  • proj_bias (bool)

  • proj_drop (float)

  • attn_drop (float)

  • init_values (Optional[float])

  • drop_path (float)

  • act_layer (Type[torch.nn.Module])

  • norm_layer (Type[torch.nn.Module])

  • mlp_layer (Type[torch.nn.Module])

class minerva.models.nets.image.vit_local.vit.LayerScale(dim, init_values=1e-05, inplace=False)[source]

Bases: torch.nn.Module

LayerScale module.

Initialize the LayerScale module.

Parameters

dimint

Number of feature dimensions (channels) to scale.

init_valuesfloat, default=1e-5

Initial value for the learnable scaling parameter.

inplacebool, default=False

If True, performs the scaling operation in-place to save memory.

forward(x)[source]

Forward pass applying per-channel scaling to the input tensor.

Parameters

xtorch.Tensor

Input tensor of shape (B, N, C) or (B, C, H, W), depending on context.

Returns

torch.Tensor

Scaled tensor of the same shape as the input.

Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

gamma
inplace = False
Parameters:
  • dim (int)

  • init_values (float)

  • inplace (bool)

class minerva.models.nets.image.vit_local.vit.ParallelScalingBlock(dim, num_heads, mlp_ratio=4.0, qkv_bias=False, qk_norm=False, proj_bias=True, proj_drop=0.0, attn_drop=0.0, init_values=None, drop_path=0.0, act_layer=nn.GELU, norm_layer=nn.LayerNorm, mlp_layer=None)[source]

Bases: torch.nn.Module

Parallel Scaling Vision Transformer block.

This module implements a parallel Transformer block that computes the multi-head self-attention and MLP branches concurrently and then combines their outputs. The design follows the architecture from “Scaling Vision Transformers to 22 Billion Parameters” (https://arxiv.org/abs/2302.05442).

The block includes LayerScale for stable deep scaling, optional DropPath for stochastic depth regularization, and supports fused attention when available for performance efficiency.

Initialize the ParallelScalingBlock.

Parameters

dimint

Embedding dimension of the input and output features.

num_headsint

Number of attention heads in the multi-head self-attention layer.

mlp_ratiofloat, default=4.0

Expansion ratio for the hidden dimension in the MLP branch.

qkv_biasbool, default=False

If True, add bias to the query, key, and value projections.

qk_normbool, default=False

If True, apply normalization to the query and key tensors.

proj_biasbool, default=True

If True, include bias in the output projection layers.

proj_dropfloat, default=0.0

Dropout rate applied after the projection layers.

attn_dropfloat, default=0.0

Dropout rate applied to the attention weights.

init_valuesfloat, optional

If specified, enables LayerScale with this initialization value.

drop_pathfloat, default=0.0

Stochastic depth rate; set > 0 to apply DropPath regularization.

act_layerType[nn.Module], default=nn.GELU

Activation function used in the MLP branch.

norm_layerType[nn.Module], default=nn.LayerNorm

Normalization layer applied before the parallel branches.

mlp_layerType[nn.Module], optional

Optional custom MLP implementation; defaults to a standard linear MLP.

attn_drop
attn_out_proj
drop_path
forward(x)[source]

Forward pass of the Parallel Scaling Transformer block.

Parameters

xtorch.Tensor

Input tensor of shape (B, N, C), where B is batch size, N is sequence length, and C is embedding dimension.

Returns

torch.Tensor

Output tensor of shape (B, N, C), containing the updated feature representations.

Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

fused_attn: torch.jit.Final[bool]
head_dim
in_norm
in_proj
in_split
k_norm
ls
mlp_act
mlp_drop
mlp_out_proj
num_heads
q_norm
scale
Parameters:
  • dim (int)

  • num_heads (int)

  • mlp_ratio (float)

  • qkv_bias (bool)

  • qk_norm (bool)

  • proj_bias (bool)

  • proj_drop (float)

  • attn_drop (float)

  • init_values (Optional[float])

  • drop_path (float)

  • act_layer (Type[torch.nn.Module])

  • norm_layer (Type[torch.nn.Module])

  • mlp_layer (Optional[Type[torch.nn.Module]])

class minerva.models.nets.image.vit_local.vit.ParallelThingsBlock(dim, num_heads, num_parallel=2, mlp_ratio=4.0, qkv_bias=False, qk_norm=False, proj_bias=True, init_values=None, proj_drop=0.0, attn_drop=0.0, drop_path=0.0, act_layer=nn.GELU, norm_layer=nn.LayerNorm, mlp_layer=Mlp)[source]

Bases: torch.nn.Module

Parallel Things Vision Transformer block.

This module implements a Transformer block that processes the input through multiple parallel attention layers followed by multiple parallel MLP layers. The outputs of each parallel branch are summed together, enabling a richer representation and improved learning capacity.

The design follows the architecture from “Three Things Everyone Should Know About Vision Transformers” (https://arxiv.org/abs/2203.09795).

Initialize the ParallelThingsBlock.

Parameters

dimint

Embedding dimension of the input and output features.

num_headsint

Number of attention heads in each attention branch.

num_parallelint, default=2

Number of parallel attention and MLP branches.

mlp_ratiofloat, default=4.0

Expansion ratio for the hidden dimension in the MLP layers.

qkv_biasbool, default=False

If True, add bias to the query, key, and value projections.

qk_normbool, default=False

If True, apply normalization to query and key tensors.

proj_biasbool, default=True

If True, include bias in the projection layers.

init_valuesfloat, optional

If specified, enables LayerScale with this initialization value.

proj_dropfloat, default=0.0

Dropout rate applied to the output of the projection layers.

attn_dropfloat, default=0.0

Dropout rate applied to the attention weights.

drop_pathfloat, default=0.0

Stochastic depth rate; set > 0 to apply DropPath regularization.

act_layerType[nn.Module], default=nn.GELU

Activation function used in the MLP layers.

norm_layerType[nn.Module], default=nn.LayerNorm

Normalization layer type applied in each sub-block.

mlp_layerType[nn.Module], default=Mlp

Module type used for the feed-forward MLP networks.

_forward(x)[source]
Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

_forward_jit(x)[source]
Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

attns
ffns
forward(x)[source]

Forward pass of the ParallelThingsBlock.

Parameters

xtorch.Tensor

Input tensor of shape (B, N, C).

Returns

torch.Tensor

Output tensor of the same shape (B, N, C), representing the combined outputs from the parallel attention and MLP branches.

Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

num_parallel = 2
Parameters:
  • dim (int)

  • num_heads (int)

  • num_parallel (int)

  • mlp_ratio (float)

  • qkv_bias (bool)

  • qk_norm (bool)

  • proj_bias (bool)

  • init_values (Optional[float])

  • proj_drop (float)

  • attn_drop (float)

  • drop_path (float)

  • act_layer (Type[torch.nn.Module])

  • norm_layer (Type[torch.nn.Module])

  • mlp_layer (Type[torch.nn.Module])

class minerva.models.nets.image.vit_local.vit.ResPostBlock(dim, num_heads, mlp_ratio=4.0, qkv_bias=False, qk_norm=False, proj_bias=True, proj_drop=0.0, attn_drop=0.0, init_values=None, drop_path=0.0, act_layer=nn.GELU, norm_layer=nn.LayerNorm, mlp_layer=Mlp)[source]

Bases: torch.nn.Module

Residual Post-Norm Transformer block.

Initialize internal Module state, shared by both nn.Module and ScriptModule.

Parameters:
  • dim (int)

  • num_heads (int)

  • mlp_ratio (float)

  • qkv_bias (bool)

  • qk_norm (bool)

  • proj_bias (bool)

  • proj_drop (float)

  • attn_drop (float)

  • init_values (Optional[float])

  • drop_path (float)

  • act_layer (Type[torch.nn.Module])

  • norm_layer (Type[torch.nn.Module])

  • mlp_layer (Type[torch.nn.Module])

attn
drop_path1
drop_path2
forward(x)[source]

Forward pass of the Residual Post-Norm Transformer block.

The input tensor passes through attention and MLP sublayers, each followed by normalization and residual connections. DropPath is optionally applied for regularization.

Parameters

xtorch.Tensor

Input tensor of shape (B, N, C), where B is batch size, N is sequence length, and C is embedding dimension.

Returns

torch.Tensor

Output tensor of the same shape (B, N, C), representing the transformed features.

Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

init_values = None
init_weights()[source]
Return type:

None

mlp
norm1
norm2
class minerva.models.nets.image.vit_local.vit.VisionTransformer(img_size=224, patch_size=16, in_chans=3, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, qk_norm=False, proj_bias=True, init_values=None, class_token=True, pos_embed='learn', no_embed_class=False, reg_tokens=0, pre_norm=False, dynamic_img_size=False, dynamic_img_pad=False, pos_drop_rate=0.0, patch_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, weight_init='', fix_init=False, embed_norm_layer=None, norm_layer=None, act_layer=None, block_fn=Block, mlp_layer=Mlp)[source]

Bases: torch.nn.Module

Vision Transformer (ViT)

A PyTorch implementation of the Vision Transformer architecture from “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (https://arxiv.org/abs/2010.11929).

This model divides an input image into fixed-size patches, embeds them, adds positional information, and processes them through a sequence of Transformer encoder blocks to learn global image representations.

Initialize the Vision Transformer model.

Parameters

img_sizeint or tuple of int, default=224

Input image size (height, width).

patch_sizeint or tuple of int, default=16

Size of each image patch.

in_chansint, default=3

Number of input channels (e.g., 3 for RGB images).

num_classesint, default=1000

Number of output classes for classification.

embed_dimint, default=768

Dimension of the patch embeddings.

depthint, default=12

Number of Transformer encoder blocks.

num_headsint, default=12

Number of attention heads per block.

mlp_ratiofloat, default=4.0

Expansion ratio for the MLP hidden dimension.

qkv_biasbool, default=True

If True, include bias in the query, key, and value projections.

qk_normbool, default=False

If True, apply normalization to query and key vectors.

proj_biasbool, default=True

If True, include bias in projection layers.

init_valuesfloat, optional

Initial value for LayerScale; if None, LayerScale is disabled.

class_tokenbool, default=True

If True, use a learnable class token.

pos_embed{‘’, ‘none’, ‘learn’}, default=’learn’

Type of positional embedding; ‘learn’ enables learnable embeddings.

no_embed_classbool, default=False

If True, exclude class and reg tokens from position embedding.

reg_tokensint, default=0

Number of auxiliary regression tokens.

pre_normbool, default=False

If True, apply normalization before Transformer blocks.

dynamic_img_sizebool, default=False

If True, enables dynamic image resizing during inference.

dynamic_img_padbool, default=False

If True, apply padding to dynamically sized images.

drop_ratefloat, default=0.0

Dropout rate applied globally.

pos_drop_ratefloat, default=0.0

Dropout rate applied to positional embeddings.

patch_drop_ratefloat, default=0.0

Probability of randomly dropping patch tokens during training.

proj_drop_ratefloat, default=0.0

Dropout rate applied to projection layers.

attn_drop_ratefloat, default=0.0

Dropout rate applied to attention weights.

drop_path_ratefloat, default=0.0

Stochastic depth drop rate across layers.

weight_init{‘skip’, ‘jax’, ‘jax_nlhb’, ‘moco’, ‘’}, default=’’

Weight initialization strategy.

fix_initbool, default=False

If True, rescales initialization following original ViT heuristics.

embed_norm_layernn.Module, optional

Normalization layer applied to embeddings.

norm_layernn.Module, optional

Normalization layer applied to Transformer blocks.

act_layernn.Module, optional

Activation function used in MLP layers.

block_fnnn.Module, default=Block

Type of Transformer block used.

mlp_layernn.Module, default=Mlp

Type of MLP module used in each block.

_init_weights(m)[source]
Parameters:

m (torch.nn.Module)

Return type:

None

_pos_embed(x)[source]
Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

blocks
cls_token
dynamic_img_size: torch.jit.Final[bool]
feature_info
fix_init_weight()[source]
forward(x)[source]

Forward pass of the Vision Transformer.

Parameters

xtorch.Tensor

Input image tensor of shape (B, C, H, W), where B is batch size, C is number of channels, and H, W are image dimensions.

Returns

torch.Tensor

Encoded features of shape (B, N, D), where N is the number of patches (plus any prefix tokens) and D is the embedding dimension.

Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

grad_checkpointing = False
has_class_token = True
in_channels = 3
init_weights(mode='')[source]
Parameters:

mode (str)

Return type:

None

no_embed_class = False
norm_pre
num_prefix_tokens = 1
num_reg_tokens = 0
patch_embed
patch_size
pos_drop
reg_token
set_input_size(img_size=None, patch_size=None)[source]

Method updates the input image resolution, patch size

Args:

img_size: New input resolution, if None current resolution is used patch_size: New patch size, if None existing patch size is used

Parameters:
  • img_size (Optional[Tuple[int, int]])

  • patch_size (Optional[Tuple[int, int]])

Parameters:
  • img_size (Union[int, Tuple[int, int]])

  • patch_size (Union[int, Tuple[int, int]])

  • in_chans (int)

  • embed_dim (int)

  • depth (int)

  • num_heads (int)

  • mlp_ratio (float)

  • qkv_bias (bool)

  • qk_norm (bool)

  • proj_bias (bool)

  • init_values (Optional[float])

  • class_token (bool)

  • pos_embed (str)

  • no_embed_class (bool)

  • reg_tokens (int)

  • pre_norm (bool)

  • dynamic_img_size (bool)

  • dynamic_img_pad (bool)

  • pos_drop_rate (float)

  • patch_drop_rate (float)

  • proj_drop_rate (float)

  • attn_drop_rate (float)

  • drop_path_rate (float)

  • weight_init (Literal['skip', 'jax', 'jax_nlhb', 'moco', ''])

  • fix_init (bool)

  • embed_norm_layer (Optional[timm.layers.LayerType])

  • norm_layer (Optional[timm.layers.LayerType])

  • act_layer (Optional[timm.layers.LayerType])

  • block_fn (Type[torch.nn.Module])

  • mlp_layer (Type[torch.nn.Module])

minerva.models.nets.image.vit_local.vit.get_init_weights_vit(mode='jax', head_bias=0.0)[source]
Parameters:
  • mode (str)

  • head_bias (float)

Return type:

Callable

minerva.models.nets.image.vit_local.vit.global_pool_nlc(x, pool_type='token', num_prefix_tokens=1, reduce_include_prefix=False)[source]
Parameters:
  • x (torch.Tensor)

  • pool_type (str)

  • num_prefix_tokens (int)

  • reduce_include_prefix (bool)

minerva.models.nets.image.vit_local.vit.init_weights_vit_jax(module, name='', head_bias=0.0)[source]

ViT weight initialization, matching JAX (Flax) impl

Parameters:
  • module (torch.nn.Module)

  • name (str)

  • head_bias (float)

Return type:

None

minerva.models.nets.image.vit_local.vit.init_weights_vit_moco(module, name='')[source]

ViT weight initialization, matching moco-v3 impl minus fixed PatchEmbed

Parameters:
  • module (torch.nn.Module)

  • name (str)

Return type:

None

minerva.models.nets.image.vit_local.vit.init_weights_vit_timm(module, name='')[source]

ViT weight initialization, original timm impl (for reproducibility)

Parameters:
  • module (torch.nn.Module)

  • name (str)

Return type:

None

minerva.models.nets.image.vit_local.vit.resize_pos_embed(posemb, posemb_new, num_prefix_tokens=1, gs_new=(), interpolation='bicubic', antialias=False)[source]

Rescale the grid of position embeddings when loading from state_dict. DEPRECATED This function is being deprecated in favour of using resample_abs_pos_embed

Parameters:
  • posemb (torch.Tensor)

  • posemb_new (torch.Tensor)

  • num_prefix_tokens (int)

  • gs_new (Tuple[int, int])

  • interpolation (str)

  • antialias (bool)

Return type:

torch.Tensor