minerva.models.nets.image.vit_local =================================== .. py:module:: minerva.models.nets.image.vit_local Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/minerva/models/nets/image/vit_local/patch_embed/index /autoapi/minerva/models/nets/image/vit_local/vit/index Classes ------- .. autoapisummary:: minerva.models.nets.image.vit_local.Block minerva.models.nets.image.vit_local.PatchEmbed minerva.models.nets.image.vit_local.VisionTransformer Package Contents ---------------- .. py:class:: Block(dim, num_heads, mlp_ratio = 4.0, qkv_bias = False, qk_norm = False, proj_bias = True, proj_drop = 0.0, attn_drop = 0.0, init_values = None, drop_path = 0.0, act_layer = nn.GELU, norm_layer = nn.LayerNorm, mlp_layer = Mlp) Bases: :py:obj:`torch.nn.Module` Transformer block module. Initialize the Transformer block. Parameters ---------- dim : int Embedding dimension of the input and output features. num_heads : int Number of attention heads in the self-attention layer. mlp_ratio : float, default=4.0 Expansion ratio for the hidden dimension in the MLP layer. qkv_bias : bool, default=False If True, add bias to the query, key, and value projections. qk_norm : bool, default=False If True, apply normalization to query and key tensors. proj_bias : bool, default=True If True, include bias in the projection layers. proj_drop : float, default=0.0 Dropout rate applied to the output of the attention and MLP layers. attn_drop : float, default=0.0 Dropout rate applied to the attention weights. init_values : float, optional If specified, enables LayerScale with this initial scaling value. drop_path : float, default=0.0 Stochastic depth rate; set > 0 to apply DropPath regularization. act_layer : Type[nn.Module], default=nn.GELU Activation function used in the MLP layer. norm_layer : Type[nn.Module], default=nn.LayerNorm Normalization layer type applied before attention and MLP. mlp_layer : Type[nn.Module], default=Mlp Module type used for the feed-forward network. .. py:attribute:: attn .. py:attribute:: drop_path1 .. py:attribute:: drop_path2 .. py:method:: forward(x) .. py:attribute:: ls1 .. py:attribute:: ls2 .. py:attribute:: mlp .. py:attribute:: norm1 .. py:attribute:: norm2 .. py:class:: PatchEmbed(img_size = 224, patch_size = 16, in_chans = 3, embed_dim = 768, norm_layer = None, flatten = True, output_fmt = None, bias = True, strict_img_size = True, dynamic_img_pad = False) Bases: :py:obj:`torch.nn.Module` 2D Image to Patch Embedding Initialize the PatchEmbed module. Parameters ---------- img_size : int or Tuple[int, int], default=224 Input image size. If None, image size will be inferred dynamically. patch_size : int or Tuple[int, int], default=16 Size of each image patch. in_chans : int, default=3 Number of input channels (e.g., 3 for RGB images). embed_dim : int, default=768 Dimension of the output patch embeddings. norm_layer : Callable, optional Normalization layer applied to the output embeddings. flatten : bool, default=True If True, flattens patches into a sequence (N, L, C). output_fmt : str, optional Output tensor format. If specified, overrides `flatten`. bias : bool, default=True Whether to include a bias term in the projection layer. strict_img_size : bool, default=True If True, enforces input images to match the specified size exactly. dynamic_img_pad : bool, default=False If True, applies dynamic padding for images not divisible by patch size. .. py:method:: _init_img_size(img_size) .. py:method:: dynamic_feat_size(img_size) Get grid (feature) size for given image size taking account of dynamic padding. NOTE: must be torchscript compatible so using fixed tuple indexing .. py:attribute:: dynamic_img_pad :type: torch.jit.Final[bool] .. py:method:: feat_ratio(as_scalar=True) .. py:method:: forward(x) Forward pass that converts an input image into patch embeddings. Parameters ---------- x : torch.Tensor Input tensor of shape (B, C, H, W), where B is batch size, C is number of channels, and H, W are spatial dimensions. Returns ------- torch.Tensor Patch embeddings tensor. Shape depends on output format: - If `flatten=True`: (B, num_patches, embed_dim) - If `flatten=False` and `output_fmt='NCHW'`: (B, embed_dim, H_p, W_p) - If using another output format: tensor is converted accordingly. .. py:attribute:: norm .. py:attribute:: output_fmt :type: timm.layers.format.Format .. py:attribute:: patch_size .. py:attribute:: proj .. py:method:: set_input_size(img_size = None, patch_size = None) .. py:attribute:: strict_img_size :value: True .. py:class:: VisionTransformer(img_size = 224, patch_size = 16, in_chans = 3, embed_dim = 768, depth = 12, num_heads = 12, mlp_ratio = 4.0, qkv_bias = True, qk_norm = False, proj_bias = True, init_values = None, class_token = True, pos_embed = 'learn', no_embed_class = False, reg_tokens = 0, pre_norm = False, dynamic_img_size = False, dynamic_img_pad = False, pos_drop_rate = 0.0, patch_drop_rate = 0.0, proj_drop_rate = 0.0, attn_drop_rate = 0.0, drop_path_rate = 0.0, weight_init = '', fix_init = False, embed_norm_layer = None, norm_layer = None, act_layer = None, block_fn = Block, mlp_layer = Mlp) Bases: :py:obj:`torch.nn.Module` Vision Transformer (ViT) A PyTorch implementation of the Vision Transformer architecture from *"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"* (https://arxiv.org/abs/2010.11929). This model divides an input image into fixed-size patches, embeds them, adds positional information, and processes them through a sequence of Transformer encoder blocks to learn global image representations. Initialize the Vision Transformer model. Parameters ---------- img_size : int or tuple of int, default=224 Input image size (height, width). patch_size : int or tuple of int, default=16 Size of each image patch. in_chans : int, default=3 Number of input channels (e.g., 3 for RGB images). num_classes : int, default=1000 Number of output classes for classification. embed_dim : int, default=768 Dimension of the patch embeddings. depth : int, default=12 Number of Transformer encoder blocks. num_heads : int, default=12 Number of attention heads per block. mlp_ratio : float, default=4.0 Expansion ratio for the MLP hidden dimension. qkv_bias : bool, default=True If True, include bias in the query, key, and value projections. qk_norm : bool, default=False If True, apply normalization to query and key vectors. proj_bias : bool, default=True If True, include bias in projection layers. init_values : float, optional Initial value for LayerScale; if None, LayerScale is disabled. class_token : bool, default=True If True, use a learnable class token. pos_embed : {'', 'none', 'learn'}, default='learn' Type of positional embedding; 'learn' enables learnable embeddings. no_embed_class : bool, default=False If True, exclude class and reg tokens from position embedding. reg_tokens : int, default=0 Number of auxiliary regression tokens. pre_norm : bool, default=False If True, apply normalization before Transformer blocks. dynamic_img_size : bool, default=False If True, enables dynamic image resizing during inference. dynamic_img_pad : bool, default=False If True, apply padding to dynamically sized images. drop_rate : float, default=0.0 Dropout rate applied globally. pos_drop_rate : float, default=0.0 Dropout rate applied to positional embeddings. patch_drop_rate : float, default=0.0 Probability of randomly dropping patch tokens during training. proj_drop_rate : float, default=0.0 Dropout rate applied to projection layers. attn_drop_rate : float, default=0.0 Dropout rate applied to attention weights. drop_path_rate : float, default=0.0 Stochastic depth drop rate across layers. weight_init : {'skip', 'jax', 'jax_nlhb', 'moco', ''}, default='' Weight initialization strategy. fix_init : bool, default=False If True, rescales initialization following original ViT heuristics. embed_norm_layer : nn.Module, optional Normalization layer applied to embeddings. norm_layer : nn.Module, optional Normalization layer applied to Transformer blocks. act_layer : nn.Module, optional Activation function used in MLP layers. block_fn : nn.Module, default=Block Type of Transformer block used. mlp_layer : nn.Module, default=Mlp Type of MLP module used in each block. .. py:method:: _init_weights(m) .. py:method:: _pos_embed(x) .. py:attribute:: blocks .. py:attribute:: cls_token .. py:attribute:: dynamic_img_size :type: torch.jit.Final[bool] .. py:attribute:: feature_info .. py:method:: fix_init_weight() .. py:method:: forward(x) Forward pass of the Vision Transformer. Parameters ---------- x : torch.Tensor Input image tensor of shape (B, C, H, W), where B is batch size, C is number of channels, and H, W are image dimensions. Returns ------- torch.Tensor Encoded features of shape (B, N, D), where N is the number of patches (plus any prefix tokens) and D is the embedding dimension. .. py:attribute:: grad_checkpointing :value: False .. py:attribute:: has_class_token :value: True .. py:attribute:: in_channels :value: 3 .. py:method:: init_weights(mode = '') .. py:attribute:: no_embed_class :value: False .. py:attribute:: norm_pre .. py:attribute:: num_prefix_tokens :value: 1 .. py:attribute:: num_reg_tokens :value: 0 .. py:attribute:: patch_embed .. py:attribute:: patch_size .. py:attribute:: pos_drop .. py:attribute:: reg_token .. py:method:: set_input_size(img_size = None, patch_size = None) Method updates the input image resolution, patch size Args: img_size: New input resolution, if None current resolution is used patch_size: New patch size, if None existing patch size is used