minerva.models.nets.image.vit_local.vit
=======================================

.. py:module:: minerva.models.nets.image.vit_local.vit


Classes
-------

.. autoapisummary::

   minerva.models.nets.image.vit_local.vit.Attention
   minerva.models.nets.image.vit_local.vit.Block
   minerva.models.nets.image.vit_local.vit.LayerScale
   minerva.models.nets.image.vit_local.vit.ParallelScalingBlock
   minerva.models.nets.image.vit_local.vit.ParallelThingsBlock
   minerva.models.nets.image.vit_local.vit.ResPostBlock
   minerva.models.nets.image.vit_local.vit.VisionTransformer


Functions
---------

.. autoapisummary::

   minerva.models.nets.image.vit_local.vit.get_init_weights_vit
   minerva.models.nets.image.vit_local.vit.global_pool_nlc
   minerva.models.nets.image.vit_local.vit.init_weights_vit_jax
   minerva.models.nets.image.vit_local.vit.init_weights_vit_moco
   minerva.models.nets.image.vit_local.vit.init_weights_vit_timm
   minerva.models.nets.image.vit_local.vit.resize_pos_embed


Module Contents
---------------

.. py:class:: Attention(dim, num_heads = 8, qkv_bias = False, qk_norm = False, proj_bias = True, attn_drop = 0.0, proj_drop = 0.0, norm_layer = nn.LayerNorm)

   Bases: :py:obj:`torch.nn.Module`


   Multi-head self-attention module.

   This class implements the standard multi-head attention mechanism used in
   Transformer architectures. It supports both standard and fused attention
   implementations for improved performance when available.

   Initialize the Attention module.

   Parameters
   ----------
   dim : int
       Total dimension of the input and output features.
   num_heads : int, default=8
       Number of attention heads.
   qkv_bias : bool, default=False
       If True, add a bias term to the query, key, and value projections.
   qk_norm : bool, default=False
       If True, apply normalization to query and key tensors.
   proj_bias : bool, default=True
       If True, include bias in the output projection layer.
   attn_drop : float, default=0.0
       Dropout rate applied to the attention weights.
   proj_drop : float, default=0.0
       Dropout rate applied after the output projection.
   norm_layer : Type[nn.Module], default=nn.LayerNorm
       Normalization layer type applied to query and key vectors when `qk_norm=True`.


   .. py:attribute:: attn_drop


   .. py:method:: forward(x)

      Forward pass of the multi-head attention mechanism.

      Parameters
      ----------
      x : torch.Tensor
          Input tensor of shape (B, N, C), where
          B is the batch size, N is the sequence length, and C is the feature dimension.

      Returns
      -------
      torch.Tensor
          Output tensor of the same shape as input (B, N, C),
          containing the attended feature representations.


   .. py:attribute:: fused_attn
      :type:  torch.jit.Final[bool]


   .. py:attribute:: head_dim


   .. py:attribute:: k_norm


   .. py:attribute:: num_heads
      :value: 8


   .. py:attribute:: proj


   .. py:attribute:: proj_drop


   .. py:attribute:: q_norm


   .. py:attribute:: qkv


   .. py:attribute:: scale


.. py:class:: Block(dim, num_heads, mlp_ratio = 4.0, qkv_bias = False, qk_norm = False, proj_bias = True, proj_drop = 0.0, attn_drop = 0.0, init_values = None, drop_path = 0.0, act_layer = nn.GELU, norm_layer = nn.LayerNorm, mlp_layer = Mlp)

   Bases: :py:obj:`torch.nn.Module`


   Transformer block module.

   Initialize the Transformer block.

   Parameters
   ----------
   dim : int
       Embedding dimension of the input and output features.
   num_heads : int
       Number of attention heads in the self-attention layer.
   mlp_ratio : float, default=4.0
       Expansion ratio for the hidden dimension in the MLP layer.
   qkv_bias : bool, default=False
       If True, add bias to the query, key, and value projections.
   qk_norm : bool, default=False
       If True, apply normalization to query and key tensors.
   proj_bias : bool, default=True
       If True, include bias in the projection layers.
   proj_drop : float, default=0.0
       Dropout rate applied to the output of the attention and MLP layers.
   attn_drop : float, default=0.0
       Dropout rate applied to the attention weights.
   init_values : float, optional
       If specified, enables LayerScale with this initial scaling value.
   drop_path : float, default=0.0
       Stochastic depth rate; set > 0 to apply DropPath regularization.
   act_layer : Type[nn.Module], default=nn.GELU
       Activation function used in the MLP layer.
   norm_layer : Type[nn.Module], default=nn.LayerNorm
       Normalization layer type applied before attention and MLP.
   mlp_layer : Type[nn.Module], default=Mlp
       Module type used for the feed-forward network.


   .. py:attribute:: attn


   .. py:attribute:: drop_path1


   .. py:attribute:: drop_path2


   .. py:method:: forward(x)


   .. py:attribute:: ls1


   .. py:attribute:: ls2


   .. py:attribute:: mlp


   .. py:attribute:: norm1


   .. py:attribute:: norm2


.. py:class:: LayerScale(dim, init_values = 1e-05, inplace = False)

   Bases: :py:obj:`torch.nn.Module`


   LayerScale module.

   Initialize the LayerScale module.

   Parameters
   ----------
   dim : int
       Number of feature dimensions (channels) to scale.
   init_values : float, default=1e-5
       Initial value for the learnable scaling parameter.
   inplace : bool, default=False
       If True, performs the scaling operation in-place to save memory.


   .. py:method:: forward(x)

      Forward pass applying per-channel scaling to the input tensor.

      Parameters
      ----------
      x : torch.Tensor
          Input tensor of shape (B, N, C) or (B, C, H, W), depending on context.

      Returns
      -------
      torch.Tensor
          Scaled tensor of the same shape as the input.


   .. py:attribute:: gamma


   .. py:attribute:: inplace
      :value: False


.. py:class:: ParallelScalingBlock(dim, num_heads, mlp_ratio = 4.0, qkv_bias = False, qk_norm = False, proj_bias = True, proj_drop = 0.0, attn_drop = 0.0, init_values = None, drop_path = 0.0, act_layer = nn.GELU, norm_layer = nn.LayerNorm, mlp_layer = None)

   Bases: :py:obj:`torch.nn.Module`


   Parallel Scaling Vision Transformer block.

   This module implements a parallel Transformer block that computes the
   multi-head self-attention and MLP branches concurrently and then combines
   their outputs. The design follows the architecture from
   "Scaling Vision Transformers to 22 Billion Parameters"
   (https://arxiv.org/abs/2302.05442).

   The block includes LayerScale for stable deep scaling, optional DropPath for
   stochastic depth regularization, and supports fused attention when available
   for performance efficiency.

   Initialize the ParallelScalingBlock.

   Parameters
   ----------
   dim : int
       Embedding dimension of the input and output features.
   num_heads : int
       Number of attention heads in the multi-head self-attention layer.
   mlp_ratio : float, default=4.0
       Expansion ratio for the hidden dimension in the MLP branch.
   qkv_bias : bool, default=False
       If True, add bias to the query, key, and value projections.
   qk_norm : bool, default=False
       If True, apply normalization to the query and key tensors.
   proj_bias : bool, default=True
       If True, include bias in the output projection layers.
   proj_drop : float, default=0.0
       Dropout rate applied after the projection layers.
   attn_drop : float, default=0.0
       Dropout rate applied to the attention weights.
   init_values : float, optional
       If specified, enables LayerScale with this initialization value.
   drop_path : float, default=0.0
       Stochastic depth rate; set > 0 to apply DropPath regularization.
   act_layer : Type[nn.Module], default=nn.GELU
       Activation function used in the MLP branch.
   norm_layer : Type[nn.Module], default=nn.LayerNorm
       Normalization layer applied before the parallel branches.
   mlp_layer : Type[nn.Module], optional
       Optional custom MLP implementation; defaults to a standard linear MLP.


   .. py:attribute:: attn_drop


   .. py:attribute:: attn_out_proj


   .. py:attribute:: drop_path


   .. py:method:: forward(x)

      Forward pass of the Parallel Scaling Transformer block.

      Parameters
      ----------
      x : torch.Tensor
          Input tensor of shape (B, N, C), where
          B is batch size, N is sequence length, and C is embedding dimension.

      Returns
      -------
      torch.Tensor
          Output tensor of shape (B, N, C), containing the updated feature representations.


   .. py:attribute:: fused_attn
      :type:  torch.jit.Final[bool]


   .. py:attribute:: head_dim


   .. py:attribute:: in_norm


   .. py:attribute:: in_proj


   .. py:attribute:: in_split


   .. py:attribute:: k_norm


   .. py:attribute:: ls


   .. py:attribute:: mlp_act


   .. py:attribute:: mlp_drop


   .. py:attribute:: mlp_out_proj


   .. py:attribute:: num_heads


   .. py:attribute:: q_norm


   .. py:attribute:: scale


.. py:class:: ParallelThingsBlock(dim, num_heads, num_parallel = 2, mlp_ratio = 4.0, qkv_bias = False, qk_norm = False, proj_bias = True, init_values = None, proj_drop = 0.0, attn_drop = 0.0, drop_path = 0.0, act_layer = nn.GELU, norm_layer = nn.LayerNorm, mlp_layer = Mlp)

   Bases: :py:obj:`torch.nn.Module`


   Parallel Things Vision Transformer block.

   This module implements a Transformer block that processes the input through
   multiple parallel attention layers followed by multiple parallel MLP layers.
   The outputs of each parallel branch are summed together, enabling a richer
   representation and improved learning capacity.

   The design follows the architecture from
   "Three Things Everyone Should Know About Vision Transformers"
   (https://arxiv.org/abs/2203.09795).

   Initialize the ParallelThingsBlock.

   Parameters
   ----------
   dim : int
       Embedding dimension of the input and output features.
   num_heads : int
       Number of attention heads in each attention branch.
   num_parallel : int, default=2
       Number of parallel attention and MLP branches.
   mlp_ratio : float, default=4.0
       Expansion ratio for the hidden dimension in the MLP layers.
   qkv_bias : bool, default=False
       If True, add bias to the query, key, and value projections.
   qk_norm : bool, default=False
       If True, apply normalization to query and key tensors.
   proj_bias : bool, default=True
       If True, include bias in the projection layers.
   init_values : float, optional
       If specified, enables LayerScale with this initialization value.
   proj_drop : float, default=0.0
       Dropout rate applied to the output of the projection layers.
   attn_drop : float, default=0.0
       Dropout rate applied to the attention weights.
   drop_path : float, default=0.0
       Stochastic depth rate; set > 0 to apply DropPath regularization.
   act_layer : Type[nn.Module], default=nn.GELU
       Activation function used in the MLP layers.
   norm_layer : Type[nn.Module], default=nn.LayerNorm
       Normalization layer type applied in each sub-block.
   mlp_layer : Type[nn.Module], default=Mlp
       Module type used for the feed-forward MLP networks.


   .. py:method:: _forward(x)


   .. py:method:: _forward_jit(x)


   .. py:attribute:: attns


   .. py:attribute:: ffns


   .. py:method:: forward(x)

      Forward pass of the ParallelThingsBlock.

      Parameters
      ----------
      x : torch.Tensor
          Input tensor of shape (B, N, C).

      Returns
      -------
      torch.Tensor
          Output tensor of the same shape (B, N, C), representing
          the combined outputs from the parallel attention and MLP branches.


   .. py:attribute:: num_parallel
      :value: 2


.. py:class:: ResPostBlock(dim, num_heads, mlp_ratio = 4.0, qkv_bias = False, qk_norm = False, proj_bias = True, proj_drop = 0.0, attn_drop = 0.0, init_values = None, drop_path = 0.0, act_layer = nn.GELU, norm_layer = nn.LayerNorm, mlp_layer = Mlp)

   Bases: :py:obj:`torch.nn.Module`


   Residual Post-Norm Transformer block.

   Initialize internal Module state, shared by both nn.Module and ScriptModule.


   .. py:attribute:: attn


   .. py:attribute:: drop_path1


   .. py:attribute:: drop_path2


   .. py:method:: forward(x)

      Forward pass of the Residual Post-Norm Transformer block.

      The input tensor passes through attention and MLP sublayers, each followed
      by normalization and residual connections. DropPath is optionally applied
      for regularization.

      Parameters
      ----------
      x : torch.Tensor
          Input tensor of shape (B, N, C), where
          B is batch size, N is sequence length, and C is embedding dimension.

      Returns
      -------
      torch.Tensor
          Output tensor of the same shape (B, N, C), representing the transformed features.


   .. py:attribute:: init_values
      :value: None


   .. py:method:: init_weights()


   .. py:attribute:: mlp


   .. py:attribute:: norm1


   .. py:attribute:: norm2


.. py:class:: VisionTransformer(img_size = 224, patch_size = 16, in_chans = 3, embed_dim = 768, depth = 12, num_heads = 12, mlp_ratio = 4.0, qkv_bias = True, qk_norm = False, proj_bias = True, init_values = None, class_token = True, pos_embed = 'learn', no_embed_class = False, reg_tokens = 0, pre_norm = False, dynamic_img_size = False, dynamic_img_pad = False, pos_drop_rate = 0.0, patch_drop_rate = 0.0, proj_drop_rate = 0.0, attn_drop_rate = 0.0, drop_path_rate = 0.0, weight_init = '', fix_init = False, embed_norm_layer = None, norm_layer = None, act_layer = None, block_fn = Block, mlp_layer = Mlp)

   Bases: :py:obj:`torch.nn.Module`


   Vision Transformer (ViT)

   A PyTorch implementation of the Vision Transformer architecture from
   *"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"*
   (https://arxiv.org/abs/2010.11929).

   This model divides an input image into fixed-size patches, embeds them,
   adds positional information, and processes them through a sequence of
   Transformer encoder blocks to learn global image representations.

   Initialize the Vision Transformer model.

   Parameters
   ----------
   img_size : int or tuple of int, default=224
       Input image size (height, width).
   patch_size : int or tuple of int, default=16
       Size of each image patch.
   in_chans : int, default=3
       Number of input channels (e.g., 3 for RGB images).
   num_classes : int, default=1000
       Number of output classes for classification.
   embed_dim : int, default=768
       Dimension of the patch embeddings.
   depth : int, default=12
       Number of Transformer encoder blocks.
   num_heads : int, default=12
       Number of attention heads per block.
   mlp_ratio : float, default=4.0
       Expansion ratio for the MLP hidden dimension.
   qkv_bias : bool, default=True
       If True, include bias in the query, key, and value projections.
   qk_norm : bool, default=False
       If True, apply normalization to query and key vectors.
   proj_bias : bool, default=True
       If True, include bias in projection layers.
   init_values : float, optional
       Initial value for LayerScale; if None, LayerScale is disabled.
   class_token : bool, default=True
       If True, use a learnable class token.
   pos_embed : {'', 'none', 'learn'}, default='learn'
       Type of positional embedding; 'learn' enables learnable embeddings.
   no_embed_class : bool, default=False
       If True, exclude class and reg tokens from position embedding.
   reg_tokens : int, default=0
       Number of auxiliary regression tokens.
   pre_norm : bool, default=False
       If True, apply normalization before Transformer blocks.
   dynamic_img_size : bool, default=False
       If True, enables dynamic image resizing during inference.
   dynamic_img_pad : bool, default=False
       If True, apply padding to dynamically sized images.
   drop_rate : float, default=0.0
       Dropout rate applied globally.
   pos_drop_rate : float, default=0.0
       Dropout rate applied to positional embeddings.
   patch_drop_rate : float, default=0.0
       Probability of randomly dropping patch tokens during training.
   proj_drop_rate : float, default=0.0
       Dropout rate applied to projection layers.
   attn_drop_rate : float, default=0.0
       Dropout rate applied to attention weights.
   drop_path_rate : float, default=0.0
       Stochastic depth drop rate across layers.
   weight_init : {'skip', 'jax', 'jax_nlhb', 'moco', ''}, default=''
       Weight initialization strategy.
   fix_init : bool, default=False
       If True, rescales initialization following original ViT heuristics.
   embed_norm_layer : nn.Module, optional
       Normalization layer applied to embeddings.
   norm_layer : nn.Module, optional
       Normalization layer applied to Transformer blocks.
   act_layer : nn.Module, optional
       Activation function used in MLP layers.
   block_fn : nn.Module, default=Block
       Type of Transformer block used.
   mlp_layer : nn.Module, default=Mlp
       Type of MLP module used in each block.


   .. py:method:: _init_weights(m)


   .. py:method:: _pos_embed(x)


   .. py:attribute:: blocks


   .. py:attribute:: cls_token


   .. py:attribute:: dynamic_img_size
      :type:  torch.jit.Final[bool]


   .. py:attribute:: feature_info


   .. py:method:: fix_init_weight()


   .. py:method:: forward(x)

      Forward pass of the Vision Transformer.

      Parameters
      ----------
      x : torch.Tensor
          Input image tensor of shape (B, C, H, W), where
          B is batch size, C is number of channels, and H, W are image dimensions.

      Returns
      -------
      torch.Tensor
          Encoded features of shape (B, N, D), where
          N is the number of patches (plus any prefix tokens) and
          D is the embedding dimension.


   .. py:attribute:: grad_checkpointing
      :value: False


   .. py:attribute:: has_class_token
      :value: True


   .. py:attribute:: in_channels
      :value: 3


   .. py:method:: init_weights(mode = '')


   .. py:attribute:: no_embed_class
      :value: False


   .. py:attribute:: norm_pre


   .. py:attribute:: num_prefix_tokens
      :value: 1


   .. py:attribute:: num_reg_tokens
      :value: 0


   .. py:attribute:: patch_embed


   .. py:attribute:: patch_size


   .. py:attribute:: pos_drop


   .. py:attribute:: reg_token


   .. py:method:: set_input_size(img_size = None, patch_size = None)

      Method updates the input image resolution, patch size

      Args:
          img_size: New input resolution, if None current resolution is used
          patch_size: New patch size, if None existing patch size is used


.. py:function:: get_init_weights_vit(mode = 'jax', head_bias = 0.0)

.. py:function:: global_pool_nlc(x, pool_type = 'token', num_prefix_tokens = 1, reduce_include_prefix = False)

.. py:function:: init_weights_vit_jax(module, name = '', head_bias = 0.0)

   ViT weight initialization, matching JAX (Flax) impl


.. py:function:: init_weights_vit_moco(module, name = '')

   ViT weight initialization, matching moco-v3 impl minus fixed PatchEmbed


.. py:function:: init_weights_vit_timm(module, name = '')

   ViT weight initialization, original timm impl (for reproducibility)


.. py:function:: resize_pos_embed(posemb, posemb_new, num_prefix_tokens = 1, gs_new = (), interpolation = 'bicubic', antialias = False)

   Rescale the grid of position embeddings when loading from state_dict.
   *DEPRECATED* This function is being deprecated in favour of using resample_abs_pos_embed