minerva.models.nets.image.vit_local
===================================

.. py:module:: minerva.models.nets.image.vit_local


Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/minerva/models/nets/image/vit_local/patch_embed/index
   /autoapi/minerva/models/nets/image/vit_local/vit/index


Classes
-------

.. autoapisummary::

   minerva.models.nets.image.vit_local.Block
   minerva.models.nets.image.vit_local.PatchEmbed
   minerva.models.nets.image.vit_local.VisionTransformer


Package Contents
----------------

.. py:class:: Block(dim, num_heads, mlp_ratio = 4.0, qkv_bias = False, qk_norm = False, proj_bias = True, proj_drop = 0.0, attn_drop = 0.0, init_values = None, drop_path = 0.0, act_layer = nn.GELU, norm_layer = nn.LayerNorm, mlp_layer = Mlp)

   Bases: :py:obj:`torch.nn.Module`


   Transformer block module.

   Initialize the Transformer block.

   Parameters
   ----------
   dim : int
       Embedding dimension of the input and output features.
   num_heads : int
       Number of attention heads in the self-attention layer.
   mlp_ratio : float, default=4.0
       Expansion ratio for the hidden dimension in the MLP layer.
   qkv_bias : bool, default=False
       If True, add bias to the query, key, and value projections.
   qk_norm : bool, default=False
       If True, apply normalization to query and key tensors.
   proj_bias : bool, default=True
       If True, include bias in the projection layers.
   proj_drop : float, default=0.0
       Dropout rate applied to the output of the attention and MLP layers.
   attn_drop : float, default=0.0
       Dropout rate applied to the attention weights.
   init_values : float, optional
       If specified, enables LayerScale with this initial scaling value.
   drop_path : float, default=0.0
       Stochastic depth rate; set > 0 to apply DropPath regularization.
   act_layer : Type[nn.Module], default=nn.GELU
       Activation function used in the MLP layer.
   norm_layer : Type[nn.Module], default=nn.LayerNorm
       Normalization layer type applied before attention and MLP.
   mlp_layer : Type[nn.Module], default=Mlp
       Module type used for the feed-forward network.


   .. py:attribute:: attn


   .. py:attribute:: drop_path1


   .. py:attribute:: drop_path2


   .. py:method:: forward(x)


   .. py:attribute:: ls1


   .. py:attribute:: ls2


   .. py:attribute:: mlp


   .. py:attribute:: norm1


   .. py:attribute:: norm2


.. py:class:: PatchEmbed(img_size = 224, patch_size = 16, in_chans = 3, embed_dim = 768, norm_layer = None, flatten = True, output_fmt = None, bias = True, strict_img_size = True, dynamic_img_pad = False)

   Bases: :py:obj:`torch.nn.Module`


   2D Image to Patch Embedding

   Initialize the PatchEmbed module.

   Parameters
   ----------
   img_size : int or Tuple[int, int], default=224
       Input image size. If None, image size will be inferred dynamically.
   patch_size : int or Tuple[int, int], default=16
       Size of each image patch.
   in_chans : int, default=3
       Number of input channels (e.g., 3 for RGB images).
   embed_dim : int, default=768
       Dimension of the output patch embeddings.
   norm_layer : Callable, optional
       Normalization layer applied to the output embeddings.
   flatten : bool, default=True
       If True, flattens patches into a sequence (N, L, C).
   output_fmt : str, optional
       Output tensor format. If specified, overrides `flatten`.
   bias : bool, default=True
       Whether to include a bias term in the projection layer.
   strict_img_size : bool, default=True
       If True, enforces input images to match the specified size exactly.
   dynamic_img_pad : bool, default=False
       If True, applies dynamic padding for images not divisible by patch size.


   .. py:method:: _init_img_size(img_size)


   .. py:method:: dynamic_feat_size(img_size)

      Get grid (feature) size for given image size taking account of dynamic padding.
      NOTE: must be torchscript compatible so using fixed tuple indexing


   .. py:attribute:: dynamic_img_pad
      :type:  torch.jit.Final[bool]


   .. py:method:: feat_ratio(as_scalar=True)


   .. py:method:: forward(x)

      Forward pass that converts an input image into patch embeddings.

      Parameters
      ----------
      x : torch.Tensor
          Input tensor of shape (B, C, H, W), where
          B is batch size, C is number of channels, and H, W are spatial dimensions.

      Returns
      -------
      torch.Tensor
          Patch embeddings tensor. Shape depends on output format:
          - If `flatten=True`: (B, num_patches, embed_dim)
          - If `flatten=False` and `output_fmt='NCHW'`: (B, embed_dim, H_p, W_p)
          - If using another output format: tensor is converted accordingly.


   .. py:attribute:: norm


   .. py:attribute:: output_fmt
      :type:  timm.layers.format.Format


   .. py:attribute:: patch_size


   .. py:attribute:: proj


   .. py:method:: set_input_size(img_size = None, patch_size = None)


   .. py:attribute:: strict_img_size
      :value: True


.. py:class:: VisionTransformer(img_size = 224, patch_size = 16, in_chans = 3, embed_dim = 768, depth = 12, num_heads = 12, mlp_ratio = 4.0, qkv_bias = True, qk_norm = False, proj_bias = True, init_values = None, class_token = True, pos_embed = 'learn', no_embed_class = False, reg_tokens = 0, pre_norm = False, dynamic_img_size = False, dynamic_img_pad = False, pos_drop_rate = 0.0, patch_drop_rate = 0.0, proj_drop_rate = 0.0, attn_drop_rate = 0.0, drop_path_rate = 0.0, weight_init = '', fix_init = False, embed_norm_layer = None, norm_layer = None, act_layer = None, block_fn = Block, mlp_layer = Mlp)

   Bases: :py:obj:`torch.nn.Module`


   Vision Transformer (ViT)

   A PyTorch implementation of the Vision Transformer architecture from
   *"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"*
   (https://arxiv.org/abs/2010.11929).

   This model divides an input image into fixed-size patches, embeds them,
   adds positional information, and processes them through a sequence of
   Transformer encoder blocks to learn global image representations.

   Initialize the Vision Transformer model.

   Parameters
   ----------
   img_size : int or tuple of int, default=224
       Input image size (height, width).
   patch_size : int or tuple of int, default=16
       Size of each image patch.
   in_chans : int, default=3
       Number of input channels (e.g., 3 for RGB images).
   num_classes : int, default=1000
       Number of output classes for classification.
   embed_dim : int, default=768
       Dimension of the patch embeddings.
   depth : int, default=12
       Number of Transformer encoder blocks.
   num_heads : int, default=12
       Number of attention heads per block.
   mlp_ratio : float, default=4.0
       Expansion ratio for the MLP hidden dimension.
   qkv_bias : bool, default=True
       If True, include bias in the query, key, and value projections.
   qk_norm : bool, default=False
       If True, apply normalization to query and key vectors.
   proj_bias : bool, default=True
       If True, include bias in projection layers.
   init_values : float, optional
       Initial value for LayerScale; if None, LayerScale is disabled.
   class_token : bool, default=True
       If True, use a learnable class token.
   pos_embed : {'', 'none', 'learn'}, default='learn'
       Type of positional embedding; 'learn' enables learnable embeddings.
   no_embed_class : bool, default=False
       If True, exclude class and reg tokens from position embedding.
   reg_tokens : int, default=0
       Number of auxiliary regression tokens.
   pre_norm : bool, default=False
       If True, apply normalization before Transformer blocks.
   dynamic_img_size : bool, default=False
       If True, enables dynamic image resizing during inference.
   dynamic_img_pad : bool, default=False
       If True, apply padding to dynamically sized images.
   drop_rate : float, default=0.0
       Dropout rate applied globally.
   pos_drop_rate : float, default=0.0
       Dropout rate applied to positional embeddings.
   patch_drop_rate : float, default=0.0
       Probability of randomly dropping patch tokens during training.
   proj_drop_rate : float, default=0.0
       Dropout rate applied to projection layers.
   attn_drop_rate : float, default=0.0
       Dropout rate applied to attention weights.
   drop_path_rate : float, default=0.0
       Stochastic depth drop rate across layers.
   weight_init : {'skip', 'jax', 'jax_nlhb', 'moco', ''}, default=''
       Weight initialization strategy.
   fix_init : bool, default=False
       If True, rescales initialization following original ViT heuristics.
   embed_norm_layer : nn.Module, optional
       Normalization layer applied to embeddings.
   norm_layer : nn.Module, optional
       Normalization layer applied to Transformer blocks.
   act_layer : nn.Module, optional
       Activation function used in MLP layers.
   block_fn : nn.Module, default=Block
       Type of Transformer block used.
   mlp_layer : nn.Module, default=Mlp
       Type of MLP module used in each block.


   .. py:method:: _init_weights(m)


   .. py:method:: _pos_embed(x)


   .. py:attribute:: blocks


   .. py:attribute:: cls_token


   .. py:attribute:: dynamic_img_size
      :type:  torch.jit.Final[bool]


   .. py:attribute:: feature_info


   .. py:method:: fix_init_weight()


   .. py:method:: forward(x)

      Forward pass of the Vision Transformer.

      Parameters
      ----------
      x : torch.Tensor
          Input image tensor of shape (B, C, H, W), where
          B is batch size, C is number of channels, and H, W are image dimensions.

      Returns
      -------
      torch.Tensor
          Encoded features of shape (B, N, D), where
          N is the number of patches (plus any prefix tokens) and
          D is the embedding dimension.


   .. py:attribute:: grad_checkpointing
      :value: False


   .. py:attribute:: has_class_token
      :value: True


   .. py:attribute:: in_channels
      :value: 3


   .. py:method:: init_weights(mode = '')


   .. py:attribute:: no_embed_class
      :value: False


   .. py:attribute:: norm_pre


   .. py:attribute:: num_prefix_tokens
      :value: 1


   .. py:attribute:: num_reg_tokens
      :value: 0


   .. py:attribute:: patch_embed


   .. py:attribute:: patch_size


   .. py:attribute:: pos_drop


   .. py:attribute:: reg_token


   .. py:method:: set_input_size(img_size = None, patch_size = None)

      Method updates the input image resolution, patch size

      Args:
          img_size: New input resolution, if None current resolution is used
          patch_size: New patch size, if None existing patch size is used