minerva.models.nets.image.vit_local¶
Submodules¶
Classes¶
Transformer block module. |
|
2D Image to Patch Embedding |
|
Vision Transformer (ViT) |
Package Contents¶
- class minerva.models.nets.image.vit_local.Block(dim, num_heads, mlp_ratio=4.0, qkv_bias=False, qk_norm=False, proj_bias=True, proj_drop=0.0, attn_drop=0.0, init_values=None, drop_path=0.0, act_layer=nn.GELU, norm_layer=nn.LayerNorm, mlp_layer=Mlp)[source]¶
Bases:
torch.nn.ModuleTransformer block module.
Initialize the Transformer block.
Parameters¶
- dimint
Embedding dimension of the input and output features.
- num_headsint
Number of attention heads in the self-attention layer.
- mlp_ratiofloat, default=4.0
Expansion ratio for the hidden dimension in the MLP layer.
- qkv_biasbool, default=False
If True, add bias to the query, key, and value projections.
- qk_normbool, default=False
If True, apply normalization to query and key tensors.
- proj_biasbool, default=True
If True, include bias in the projection layers.
- proj_dropfloat, default=0.0
Dropout rate applied to the output of the attention and MLP layers.
- attn_dropfloat, default=0.0
Dropout rate applied to the attention weights.
- init_valuesfloat, optional
If specified, enables LayerScale with this initial scaling value.
- drop_pathfloat, default=0.0
Stochastic depth rate; set > 0 to apply DropPath regularization.
- act_layerType[nn.Module], default=nn.GELU
Activation function used in the MLP layer.
- norm_layerType[nn.Module], default=nn.LayerNorm
Normalization layer type applied before attention and MLP.
- mlp_layerType[nn.Module], default=Mlp
Module type used for the feed-forward network.
- attn¶
- drop_path1¶
- drop_path2¶
- ls1¶
- ls2¶
- mlp¶
- norm1¶
- norm2¶
- Parameters:
dim (int)
num_heads (int)
mlp_ratio (float)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
proj_drop (float)
attn_drop (float)
init_values (Optional[float])
drop_path (float)
act_layer (Type[torch.nn.Module])
norm_layer (Type[torch.nn.Module])
mlp_layer (Type[torch.nn.Module])
- class minerva.models.nets.image.vit_local.PatchEmbed(img_size=224, patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True, output_fmt=None, bias=True, strict_img_size=True, dynamic_img_pad=False)[source]¶
Bases:
torch.nn.Module2D Image to Patch Embedding
Initialize the PatchEmbed module.
Parameters¶
- img_sizeint or Tuple[int, int], default=224
Input image size. If None, image size will be inferred dynamically.
- patch_sizeint or Tuple[int, int], default=16
Size of each image patch.
- in_chansint, default=3
Number of input channels (e.g., 3 for RGB images).
- embed_dimint, default=768
Dimension of the output patch embeddings.
- norm_layerCallable, optional
Normalization layer applied to the output embeddings.
- flattenbool, default=True
If True, flattens patches into a sequence (N, L, C).
- output_fmtstr, optional
Output tensor format. If specified, overrides flatten.
- biasbool, default=True
Whether to include a bias term in the projection layer.
- strict_img_sizebool, default=True
If True, enforces input images to match the specified size exactly.
- dynamic_img_padbool, default=False
If True, applies dynamic padding for images not divisible by patch size.
- dynamic_feat_size(img_size)[source]¶
Get grid (feature) size for given image size taking account of dynamic padding. NOTE: must be torchscript compatible so using fixed tuple indexing
- Parameters:
img_size (Tuple[int, int])
- Return type:
Tuple[int, int]
- dynamic_img_pad: torch.jit.Final[bool]¶
- forward(x)[source]¶
Forward pass that converts an input image into patch embeddings.
Parameters¶
- xtorch.Tensor
Input tensor of shape (B, C, H, W), where B is batch size, C is number of channels, and H, W are spatial dimensions.
Returns¶
- torch.Tensor
Patch embeddings tensor. Shape depends on output format: - If flatten=True: (B, num_patches, embed_dim) - If flatten=False and output_fmt=’NCHW’: (B, embed_dim, H_p, W_p) - If using another output format: tensor is converted accordingly.
- Parameters:
x (torch.Tensor)
- norm¶
- output_fmt: timm.layers.format.Format¶
- patch_size¶
- proj¶
- set_input_size(img_size=None, patch_size=None)[source]¶
- Parameters:
img_size (Optional[Union[int, Tuple[int, int]]])
patch_size (Optional[Union[int, Tuple[int, int]]])
- strict_img_size = True¶
- Parameters:
img_size (Union[int, Tuple[int, int]])
patch_size (Union[int, Tuple[int, int]])
in_chans (int)
embed_dim (int)
norm_layer (Optional[Callable])
flatten (bool)
output_fmt (Optional[str])
bias (bool)
strict_img_size (bool)
dynamic_img_pad (bool)
- class minerva.models.nets.image.vit_local.VisionTransformer(img_size=224, patch_size=16, in_chans=3, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, qk_norm=False, proj_bias=True, init_values=None, class_token=True, pos_embed='learn', no_embed_class=False, reg_tokens=0, pre_norm=False, dynamic_img_size=False, dynamic_img_pad=False, pos_drop_rate=0.0, patch_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, weight_init='', fix_init=False, embed_norm_layer=None, norm_layer=None, act_layer=None, block_fn=Block, mlp_layer=Mlp)[source]¶
Bases:
torch.nn.ModuleVision Transformer (ViT)
A PyTorch implementation of the Vision Transformer architecture from “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (https://arxiv.org/abs/2010.11929).
This model divides an input image into fixed-size patches, embeds them, adds positional information, and processes them through a sequence of Transformer encoder blocks to learn global image representations.
Initialize the Vision Transformer model.
Parameters¶
- img_sizeint or tuple of int, default=224
Input image size (height, width).
- patch_sizeint or tuple of int, default=16
Size of each image patch.
- in_chansint, default=3
Number of input channels (e.g., 3 for RGB images).
- num_classesint, default=1000
Number of output classes for classification.
- embed_dimint, default=768
Dimension of the patch embeddings.
- depthint, default=12
Number of Transformer encoder blocks.
- num_headsint, default=12
Number of attention heads per block.
- mlp_ratiofloat, default=4.0
Expansion ratio for the MLP hidden dimension.
- qkv_biasbool, default=True
If True, include bias in the query, key, and value projections.
- qk_normbool, default=False
If True, apply normalization to query and key vectors.
- proj_biasbool, default=True
If True, include bias in projection layers.
- init_valuesfloat, optional
Initial value for LayerScale; if None, LayerScale is disabled.
- class_tokenbool, default=True
If True, use a learnable class token.
- pos_embed{‘’, ‘none’, ‘learn’}, default=’learn’
Type of positional embedding; ‘learn’ enables learnable embeddings.
- no_embed_classbool, default=False
If True, exclude class and reg tokens from position embedding.
- reg_tokensint, default=0
Number of auxiliary regression tokens.
- pre_normbool, default=False
If True, apply normalization before Transformer blocks.
- dynamic_img_sizebool, default=False
If True, enables dynamic image resizing during inference.
- dynamic_img_padbool, default=False
If True, apply padding to dynamically sized images.
- drop_ratefloat, default=0.0
Dropout rate applied globally.
- pos_drop_ratefloat, default=0.0
Dropout rate applied to positional embeddings.
- patch_drop_ratefloat, default=0.0
Probability of randomly dropping patch tokens during training.
- proj_drop_ratefloat, default=0.0
Dropout rate applied to projection layers.
- attn_drop_ratefloat, default=0.0
Dropout rate applied to attention weights.
- drop_path_ratefloat, default=0.0
Stochastic depth drop rate across layers.
- weight_init{‘skip’, ‘jax’, ‘jax_nlhb’, ‘moco’, ‘’}, default=’’
Weight initialization strategy.
- fix_initbool, default=False
If True, rescales initialization following original ViT heuristics.
- embed_norm_layernn.Module, optional
Normalization layer applied to embeddings.
- norm_layernn.Module, optional
Normalization layer applied to Transformer blocks.
- act_layernn.Module, optional
Activation function used in MLP layers.
- block_fnnn.Module, default=Block
Type of Transformer block used.
- mlp_layernn.Module, default=Mlp
Type of MLP module used in each block.
- blocks¶
- cls_token¶
- dynamic_img_size: torch.jit.Final[bool]¶
- feature_info¶
- forward(x)[source]¶
Forward pass of the Vision Transformer.
Parameters¶
- xtorch.Tensor
Input image tensor of shape (B, C, H, W), where B is batch size, C is number of channels, and H, W are image dimensions.
Returns¶
- torch.Tensor
Encoded features of shape (B, N, D), where N is the number of patches (plus any prefix tokens) and D is the embedding dimension.
- Parameters:
x (torch.Tensor)
- Return type:
torch.Tensor
- grad_checkpointing = False¶
- has_class_token = True¶
- in_channels = 3¶
- no_embed_class = False¶
- norm_pre¶
- num_prefix_tokens = 1¶
- num_reg_tokens = 0¶
- patch_embed¶
- patch_size¶
- pos_drop¶
- reg_token¶
- set_input_size(img_size=None, patch_size=None)[source]¶
Method updates the input image resolution, patch size
- Args:
img_size: New input resolution, if None current resolution is used patch_size: New patch size, if None existing patch size is used
- Parameters:
img_size (Optional[Tuple[int, int]])
patch_size (Optional[Tuple[int, int]])
- Parameters:
img_size (Union[int, Tuple[int, int]])
patch_size (Union[int, Tuple[int, int]])
in_chans (int)
embed_dim (int)
depth (int)
num_heads (int)
mlp_ratio (float)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
init_values (Optional[float])
class_token (bool)
pos_embed (str)
no_embed_class (bool)
reg_tokens (int)
pre_norm (bool)
dynamic_img_size (bool)
dynamic_img_pad (bool)
pos_drop_rate (float)
patch_drop_rate (float)
proj_drop_rate (float)
attn_drop_rate (float)
drop_path_rate (float)
weight_init (Literal['skip', 'jax', 'jax_nlhb', 'moco', ''])
fix_init (bool)
embed_norm_layer (Optional[timm.layers.LayerType])
norm_layer (Optional[timm.layers.LayerType])
act_layer (Optional[timm.layers.LayerType])
block_fn (Type[torch.nn.Module])
mlp_layer (Type[torch.nn.Module])