minerva.models.nets.image.vit_local.vit ======================================= .. py:module:: minerva.models.nets.image.vit_local.vit Classes ------- .. autoapisummary:: minerva.models.nets.image.vit_local.vit.Attention minerva.models.nets.image.vit_local.vit.Block minerva.models.nets.image.vit_local.vit.LayerScale minerva.models.nets.image.vit_local.vit.ParallelScalingBlock minerva.models.nets.image.vit_local.vit.ParallelThingsBlock minerva.models.nets.image.vit_local.vit.ResPostBlock minerva.models.nets.image.vit_local.vit.VisionTransformer Functions --------- .. autoapisummary:: minerva.models.nets.image.vit_local.vit.get_init_weights_vit minerva.models.nets.image.vit_local.vit.global_pool_nlc minerva.models.nets.image.vit_local.vit.init_weights_vit_jax minerva.models.nets.image.vit_local.vit.init_weights_vit_moco minerva.models.nets.image.vit_local.vit.init_weights_vit_timm minerva.models.nets.image.vit_local.vit.resize_pos_embed Module Contents --------------- .. py:class:: Attention(dim, num_heads = 8, qkv_bias = False, qk_norm = False, proj_bias = True, attn_drop = 0.0, proj_drop = 0.0, norm_layer = nn.LayerNorm) Bases: :py:obj:`torch.nn.Module` Multi-head self-attention module. This class implements the standard multi-head attention mechanism used in Transformer architectures. It supports both standard and fused attention implementations for improved performance when available. Initialize the Attention module. Parameters ---------- dim : int Total dimension of the input and output features. num_heads : int, default=8 Number of attention heads. qkv_bias : bool, default=False If True, add a bias term to the query, key, and value projections. qk_norm : bool, default=False If True, apply normalization to query and key tensors. proj_bias : bool, default=True If True, include bias in the output projection layer. attn_drop : float, default=0.0 Dropout rate applied to the attention weights. proj_drop : float, default=0.0 Dropout rate applied after the output projection. norm_layer : Type[nn.Module], default=nn.LayerNorm Normalization layer type applied to query and key vectors when `qk_norm=True`. .. py:attribute:: attn_drop .. py:method:: forward(x) Forward pass of the multi-head attention mechanism. Parameters ---------- x : torch.Tensor Input tensor of shape (B, N, C), where B is the batch size, N is the sequence length, and C is the feature dimension. Returns ------- torch.Tensor Output tensor of the same shape as input (B, N, C), containing the attended feature representations. .. py:attribute:: fused_attn :type: torch.jit.Final[bool] .. py:attribute:: head_dim .. py:attribute:: k_norm .. py:attribute:: num_heads :value: 8 .. py:attribute:: proj .. py:attribute:: proj_drop .. py:attribute:: q_norm .. py:attribute:: qkv .. py:attribute:: scale .. py:class:: Block(dim, num_heads, mlp_ratio = 4.0, qkv_bias = False, qk_norm = False, proj_bias = True, proj_drop = 0.0, attn_drop = 0.0, init_values = None, drop_path = 0.0, act_layer = nn.GELU, norm_layer = nn.LayerNorm, mlp_layer = Mlp) Bases: :py:obj:`torch.nn.Module` Transformer block module. Initialize the Transformer block. Parameters ---------- dim : int Embedding dimension of the input and output features. num_heads : int Number of attention heads in the self-attention layer. mlp_ratio : float, default=4.0 Expansion ratio for the hidden dimension in the MLP layer. qkv_bias : bool, default=False If True, add bias to the query, key, and value projections. qk_norm : bool, default=False If True, apply normalization to query and key tensors. proj_bias : bool, default=True If True, include bias in the projection layers. proj_drop : float, default=0.0 Dropout rate applied to the output of the attention and MLP layers. attn_drop : float, default=0.0 Dropout rate applied to the attention weights. init_values : float, optional If specified, enables LayerScale with this initial scaling value. drop_path : float, default=0.0 Stochastic depth rate; set > 0 to apply DropPath regularization. act_layer : Type[nn.Module], default=nn.GELU Activation function used in the MLP layer. norm_layer : Type[nn.Module], default=nn.LayerNorm Normalization layer type applied before attention and MLP. mlp_layer : Type[nn.Module], default=Mlp Module type used for the feed-forward network. .. py:attribute:: attn .. py:attribute:: drop_path1 .. py:attribute:: drop_path2 .. py:method:: forward(x) .. py:attribute:: ls1 .. py:attribute:: ls2 .. py:attribute:: mlp .. py:attribute:: norm1 .. py:attribute:: norm2 .. py:class:: LayerScale(dim, init_values = 1e-05, inplace = False) Bases: :py:obj:`torch.nn.Module` LayerScale module. Initialize the LayerScale module. Parameters ---------- dim : int Number of feature dimensions (channels) to scale. init_values : float, default=1e-5 Initial value for the learnable scaling parameter. inplace : bool, default=False If True, performs the scaling operation in-place to save memory. .. py:method:: forward(x) Forward pass applying per-channel scaling to the input tensor. Parameters ---------- x : torch.Tensor Input tensor of shape (B, N, C) or (B, C, H, W), depending on context. Returns ------- torch.Tensor Scaled tensor of the same shape as the input. .. py:attribute:: gamma .. py:attribute:: inplace :value: False .. py:class:: ParallelScalingBlock(dim, num_heads, mlp_ratio = 4.0, qkv_bias = False, qk_norm = False, proj_bias = True, proj_drop = 0.0, attn_drop = 0.0, init_values = None, drop_path = 0.0, act_layer = nn.GELU, norm_layer = nn.LayerNorm, mlp_layer = None) Bases: :py:obj:`torch.nn.Module` Parallel Scaling Vision Transformer block. This module implements a parallel Transformer block that computes the multi-head self-attention and MLP branches concurrently and then combines their outputs. The design follows the architecture from "Scaling Vision Transformers to 22 Billion Parameters" (https://arxiv.org/abs/2302.05442). The block includes LayerScale for stable deep scaling, optional DropPath for stochastic depth regularization, and supports fused attention when available for performance efficiency. Initialize the ParallelScalingBlock. Parameters ---------- dim : int Embedding dimension of the input and output features. num_heads : int Number of attention heads in the multi-head self-attention layer. mlp_ratio : float, default=4.0 Expansion ratio for the hidden dimension in the MLP branch. qkv_bias : bool, default=False If True, add bias to the query, key, and value projections. qk_norm : bool, default=False If True, apply normalization to the query and key tensors. proj_bias : bool, default=True If True, include bias in the output projection layers. proj_drop : float, default=0.0 Dropout rate applied after the projection layers. attn_drop : float, default=0.0 Dropout rate applied to the attention weights. init_values : float, optional If specified, enables LayerScale with this initialization value. drop_path : float, default=0.0 Stochastic depth rate; set > 0 to apply DropPath regularization. act_layer : Type[nn.Module], default=nn.GELU Activation function used in the MLP branch. norm_layer : Type[nn.Module], default=nn.LayerNorm Normalization layer applied before the parallel branches. mlp_layer : Type[nn.Module], optional Optional custom MLP implementation; defaults to a standard linear MLP. .. py:attribute:: attn_drop .. py:attribute:: attn_out_proj .. py:attribute:: drop_path .. py:method:: forward(x) Forward pass of the Parallel Scaling Transformer block. Parameters ---------- x : torch.Tensor Input tensor of shape (B, N, C), where B is batch size, N is sequence length, and C is embedding dimension. Returns ------- torch.Tensor Output tensor of shape (B, N, C), containing the updated feature representations. .. py:attribute:: fused_attn :type: torch.jit.Final[bool] .. py:attribute:: head_dim .. py:attribute:: in_norm .. py:attribute:: in_proj .. py:attribute:: in_split .. py:attribute:: k_norm .. py:attribute:: ls .. py:attribute:: mlp_act .. py:attribute:: mlp_drop .. py:attribute:: mlp_out_proj .. py:attribute:: num_heads .. py:attribute:: q_norm .. py:attribute:: scale .. py:class:: ParallelThingsBlock(dim, num_heads, num_parallel = 2, mlp_ratio = 4.0, qkv_bias = False, qk_norm = False, proj_bias = True, init_values = None, proj_drop = 0.0, attn_drop = 0.0, drop_path = 0.0, act_layer = nn.GELU, norm_layer = nn.LayerNorm, mlp_layer = Mlp) Bases: :py:obj:`torch.nn.Module` Parallel Things Vision Transformer block. This module implements a Transformer block that processes the input through multiple parallel attention layers followed by multiple parallel MLP layers. The outputs of each parallel branch are summed together, enabling a richer representation and improved learning capacity. The design follows the architecture from "Three Things Everyone Should Know About Vision Transformers" (https://arxiv.org/abs/2203.09795). Initialize the ParallelThingsBlock. Parameters ---------- dim : int Embedding dimension of the input and output features. num_heads : int Number of attention heads in each attention branch. num_parallel : int, default=2 Number of parallel attention and MLP branches. mlp_ratio : float, default=4.0 Expansion ratio for the hidden dimension in the MLP layers. qkv_bias : bool, default=False If True, add bias to the query, key, and value projections. qk_norm : bool, default=False If True, apply normalization to query and key tensors. proj_bias : bool, default=True If True, include bias in the projection layers. init_values : float, optional If specified, enables LayerScale with this initialization value. proj_drop : float, default=0.0 Dropout rate applied to the output of the projection layers. attn_drop : float, default=0.0 Dropout rate applied to the attention weights. drop_path : float, default=0.0 Stochastic depth rate; set > 0 to apply DropPath regularization. act_layer : Type[nn.Module], default=nn.GELU Activation function used in the MLP layers. norm_layer : Type[nn.Module], default=nn.LayerNorm Normalization layer type applied in each sub-block. mlp_layer : Type[nn.Module], default=Mlp Module type used for the feed-forward MLP networks. .. py:method:: _forward(x) .. py:method:: _forward_jit(x) .. py:attribute:: attns .. py:attribute:: ffns .. py:method:: forward(x) Forward pass of the ParallelThingsBlock. Parameters ---------- x : torch.Tensor Input tensor of shape (B, N, C). Returns ------- torch.Tensor Output tensor of the same shape (B, N, C), representing the combined outputs from the parallel attention and MLP branches. .. py:attribute:: num_parallel :value: 2 .. py:class:: ResPostBlock(dim, num_heads, mlp_ratio = 4.0, qkv_bias = False, qk_norm = False, proj_bias = True, proj_drop = 0.0, attn_drop = 0.0, init_values = None, drop_path = 0.0, act_layer = nn.GELU, norm_layer = nn.LayerNorm, mlp_layer = Mlp) Bases: :py:obj:`torch.nn.Module` Residual Post-Norm Transformer block. Initialize internal Module state, shared by both nn.Module and ScriptModule. .. py:attribute:: attn .. py:attribute:: drop_path1 .. py:attribute:: drop_path2 .. py:method:: forward(x) Forward pass of the Residual Post-Norm Transformer block. The input tensor passes through attention and MLP sublayers, each followed by normalization and residual connections. DropPath is optionally applied for regularization. Parameters ---------- x : torch.Tensor Input tensor of shape (B, N, C), where B is batch size, N is sequence length, and C is embedding dimension. Returns ------- torch.Tensor Output tensor of the same shape (B, N, C), representing the transformed features. .. py:attribute:: init_values :value: None .. py:method:: init_weights() .. py:attribute:: mlp .. py:attribute:: norm1 .. py:attribute:: norm2 .. py:class:: VisionTransformer(img_size = 224, patch_size = 16, in_chans = 3, embed_dim = 768, depth = 12, num_heads = 12, mlp_ratio = 4.0, qkv_bias = True, qk_norm = False, proj_bias = True, init_values = None, class_token = True, pos_embed = 'learn', no_embed_class = False, reg_tokens = 0, pre_norm = False, dynamic_img_size = False, dynamic_img_pad = False, pos_drop_rate = 0.0, patch_drop_rate = 0.0, proj_drop_rate = 0.0, attn_drop_rate = 0.0, drop_path_rate = 0.0, weight_init = '', fix_init = False, embed_norm_layer = None, norm_layer = None, act_layer = None, block_fn = Block, mlp_layer = Mlp) Bases: :py:obj:`torch.nn.Module` Vision Transformer (ViT) A PyTorch implementation of the Vision Transformer architecture from *"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"* (https://arxiv.org/abs/2010.11929). This model divides an input image into fixed-size patches, embeds them, adds positional information, and processes them through a sequence of Transformer encoder blocks to learn global image representations. Initialize the Vision Transformer model. Parameters ---------- img_size : int or tuple of int, default=224 Input image size (height, width). patch_size : int or tuple of int, default=16 Size of each image patch. in_chans : int, default=3 Number of input channels (e.g., 3 for RGB images). num_classes : int, default=1000 Number of output classes for classification. embed_dim : int, default=768 Dimension of the patch embeddings. depth : int, default=12 Number of Transformer encoder blocks. num_heads : int, default=12 Number of attention heads per block. mlp_ratio : float, default=4.0 Expansion ratio for the MLP hidden dimension. qkv_bias : bool, default=True If True, include bias in the query, key, and value projections. qk_norm : bool, default=False If True, apply normalization to query and key vectors. proj_bias : bool, default=True If True, include bias in projection layers. init_values : float, optional Initial value for LayerScale; if None, LayerScale is disabled. class_token : bool, default=True If True, use a learnable class token. pos_embed : {'', 'none', 'learn'}, default='learn' Type of positional embedding; 'learn' enables learnable embeddings. no_embed_class : bool, default=False If True, exclude class and reg tokens from position embedding. reg_tokens : int, default=0 Number of auxiliary regression tokens. pre_norm : bool, default=False If True, apply normalization before Transformer blocks. dynamic_img_size : bool, default=False If True, enables dynamic image resizing during inference. dynamic_img_pad : bool, default=False If True, apply padding to dynamically sized images. drop_rate : float, default=0.0 Dropout rate applied globally. pos_drop_rate : float, default=0.0 Dropout rate applied to positional embeddings. patch_drop_rate : float, default=0.0 Probability of randomly dropping patch tokens during training. proj_drop_rate : float, default=0.0 Dropout rate applied to projection layers. attn_drop_rate : float, default=0.0 Dropout rate applied to attention weights. drop_path_rate : float, default=0.0 Stochastic depth drop rate across layers. weight_init : {'skip', 'jax', 'jax_nlhb', 'moco', ''}, default='' Weight initialization strategy. fix_init : bool, default=False If True, rescales initialization following original ViT heuristics. embed_norm_layer : nn.Module, optional Normalization layer applied to embeddings. norm_layer : nn.Module, optional Normalization layer applied to Transformer blocks. act_layer : nn.Module, optional Activation function used in MLP layers. block_fn : nn.Module, default=Block Type of Transformer block used. mlp_layer : nn.Module, default=Mlp Type of MLP module used in each block. .. py:method:: _init_weights(m) .. py:method:: _pos_embed(x) .. py:attribute:: blocks .. py:attribute:: cls_token .. py:attribute:: dynamic_img_size :type: torch.jit.Final[bool] .. py:attribute:: feature_info .. py:method:: fix_init_weight() .. py:method:: forward(x) Forward pass of the Vision Transformer. Parameters ---------- x : torch.Tensor Input image tensor of shape (B, C, H, W), where B is batch size, C is number of channels, and H, W are image dimensions. Returns ------- torch.Tensor Encoded features of shape (B, N, D), where N is the number of patches (plus any prefix tokens) and D is the embedding dimension. .. py:attribute:: grad_checkpointing :value: False .. py:attribute:: has_class_token :value: True .. py:attribute:: in_channels :value: 3 .. py:method:: init_weights(mode = '') .. py:attribute:: no_embed_class :value: False .. py:attribute:: norm_pre .. py:attribute:: num_prefix_tokens :value: 1 .. py:attribute:: num_reg_tokens :value: 0 .. py:attribute:: patch_embed .. py:attribute:: patch_size .. py:attribute:: pos_drop .. py:attribute:: reg_token .. py:method:: set_input_size(img_size = None, patch_size = None) Method updates the input image resolution, patch size Args: img_size: New input resolution, if None current resolution is used patch_size: New patch size, if None existing patch size is used .. py:function:: get_init_weights_vit(mode = 'jax', head_bias = 0.0) .. py:function:: global_pool_nlc(x, pool_type = 'token', num_prefix_tokens = 1, reduce_include_prefix = False) .. py:function:: init_weights_vit_jax(module, name = '', head_bias = 0.0) ViT weight initialization, matching JAX (Flax) impl .. py:function:: init_weights_vit_moco(module, name = '') ViT weight initialization, matching moco-v3 impl minus fixed PatchEmbed .. py:function:: init_weights_vit_timm(module, name = '') ViT weight initialization, original timm impl (for reproducibility) .. py:function:: resize_pos_embed(posemb, posemb_new, num_prefix_tokens = 1, gs_new = (), interpolation = 'bicubic', antialias = False) Rescale the grid of position embeddings when loading from state_dict. *DEPRECATED* This function is being deprecated in favour of using resample_abs_pos_embed