minerva.models.nets.image.vit_local¶

Submodules¶

Classes¶

`Block`	Transformer block module.
`PatchEmbed`	2D Image to Patch Embedding
`VisionTransformer`	Vision Transformer (ViT)

Package Contents¶

class minerva.models.nets.image.vit_local.Block(dim, num_heads, mlp_ratio=4.0, qkv_bias=False, qk_norm=False, proj_bias=True, proj_drop=0.0, attn_drop=0.0, init_values=None, drop_path=0.0, act_layer=nn.GELU, norm_layer=nn.LayerNorm, mlp_layer=Mlp)[source]¶

Bases: torch.nn.Module

Transformer block module.

Initialize the Transformer block.

Parameters¶

dimint: Embedding dimension of the input and output features.
num_headsint: Number of attention heads in the self-attention layer.
mlp_ratiofloat, default=4.0: Expansion ratio for the hidden dimension in the MLP layer.
qkv_biasbool, default=False: If True, add bias to the query, key, and value projections.
qk_normbool, default=False: If True, apply normalization to query and key tensors.
proj_biasbool, default=True: If True, include bias in the projection layers.
proj_dropfloat, default=0.0: Dropout rate applied to the output of the attention and MLP layers.
attn_dropfloat, default=0.0: Dropout rate applied to the attention weights.
init_valuesfloat, optional: If specified, enables LayerScale with this initial scaling value.
drop_pathfloat, default=0.0: Stochastic depth rate; set > 0 to apply DropPath regularization.
act_layerType[nn.Module], default=nn.GELU: Activation function used in the MLP layer.
norm_layerType[nn.Module], default=nn.LayerNorm: Normalization layer type applied before attention and MLP.
mlp_layerType[nn.Module], default=Mlp: Module type used for the feed-forward network.

attn¶

drop_path1¶

drop_path2¶

forward(x)[source]¶

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

ls1¶

ls2¶

mlp¶

norm1¶

norm2¶

Parameters:

dim (int)
num_heads (int)
mlp_ratio (float)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
proj_drop (float)
attn_drop (float)
init_values (Optional[float])
drop_path (float)
act_layer (Type[torch.nn.Module])
norm_layer (Type[torch.nn.Module])
mlp_layer (Type[torch.nn.Module])

class minerva.models.nets.image.vit_local.PatchEmbed(img_size=224, patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True, output_fmt=None, bias=True, strict_img_size=True, dynamic_img_pad=False)[source]¶

Bases: torch.nn.Module

2D Image to Patch Embedding

Initialize the PatchEmbed module.

Parameters¶

img_sizeint or Tuple[int, int], default=224: Input image size. If None, image size will be inferred dynamically.
patch_sizeint or Tuple[int, int], default=16: Size of each image patch.
in_chansint, default=3: Number of input channels (e.g., 3 for RGB images).
embed_dimint, default=768: Dimension of the output patch embeddings.
norm_layerCallable, optional: Normalization layer applied to the output embeddings.
flattenbool, default=True: If True, flattens patches into a sequence (N, L, C).
output_fmtstr, optional: Output tensor format. If specified, overrides flatten.
biasbool, default=True: Whether to include a bias term in the projection layer.
strict_img_sizebool, default=True: If True, enforces input images to match the specified size exactly.
dynamic_img_padbool, default=False: If True, applies dynamic padding for images not divisible by patch size.

_init_img_size(img_size)[source]¶

Parameters:: img_size (Union[int, Tuple[int, int]])

dynamic_feat_size(img_size)[source]¶

Get grid (feature) size for given image size taking account of dynamic padding. NOTE: must be torchscript compatible so using fixed tuple indexing

Parameters:: img_size (Tuple[int, int])
Return type:: Tuple[int, int]

dynamic_img_pad: torch.jit.Final[bool]¶

feat_ratio(as_scalar=True)[source]¶

Return type:: Union[Tuple[int, int], int]

forward(x)[source]¶

Forward pass that converts an input image into patch embeddings.

Parameters¶

xtorch.Tensor: Input tensor of shape (B, C, H, W), where B is batch size, C is number of channels, and H, W are spatial dimensions.

Returns¶

torch.Tensor: Patch embeddings tensor. Shape depends on output format: - If flatten=True: (B, num_patches, embed_dim) - If flatten=False and output_fmt=’NCHW’: (B, embed_dim, H_p, W_p) - If using another output format: tensor is converted accordingly.

Parameters:: x (torch.Tensor)

norm¶

output_fmt: timm.layers.format.Format¶

patch_size¶

proj¶

set_input_size(img_size=None, patch_size=None)[source]¶

Parameters:

img_size (Optional[Union[int, Tuple[int, int]]])
patch_size (Optional[Union[int, Tuple[int, int]]])

strict_img_size = True¶

Parameters:

img_size (Union[int, Tuple[int, int]])
patch_size (Union[int, Tuple[int, int]])
in_chans (int)
embed_dim (int)
norm_layer (Optional[Callable])
flatten (bool)
output_fmt (Optional[str])
bias (bool)
strict_img_size (bool)
dynamic_img_pad (bool)

class minerva.models.nets.image.vit_local.VisionTransformer(img_size=224, patch_size=16, in_chans=3, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, qk_norm=False, proj_bias=True, init_values=None, class_token=True, pos_embed='learn', no_embed_class=False, reg_tokens=0, pre_norm=False, dynamic_img_size=False, dynamic_img_pad=False, pos_drop_rate=0.0, patch_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, weight_init='', fix_init=False, embed_norm_layer=None, norm_layer=None, act_layer=None, block_fn=Block, mlp_layer=Mlp)[source]¶

Bases: torch.nn.Module

Vision Transformer (ViT)

A PyTorch implementation of the Vision Transformer architecture from “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (https://arxiv.org/abs/2010.11929).

This model divides an input image into fixed-size patches, embeds them, adds positional information, and processes them through a sequence of Transformer encoder blocks to learn global image representations.

Initialize the Vision Transformer model.

Parameters¶

img_sizeint or tuple of int, default=224: Input image size (height, width).
patch_sizeint or tuple of int, default=16: Size of each image patch.
in_chansint, default=3: Number of input channels (e.g., 3 for RGB images).
num_classesint, default=1000: Number of output classes for classification.
embed_dimint, default=768: Dimension of the patch embeddings.
depthint, default=12: Number of Transformer encoder blocks.
num_headsint, default=12: Number of attention heads per block.
mlp_ratiofloat, default=4.0: Expansion ratio for the MLP hidden dimension.
qkv_biasbool, default=True: If True, include bias in the query, key, and value projections.
qk_normbool, default=False: If True, apply normalization to query and key vectors.
proj_biasbool, default=True: If True, include bias in projection layers.
init_valuesfloat, optional: Initial value for LayerScale; if None, LayerScale is disabled.
class_tokenbool, default=True: If True, use a learnable class token.
pos_embed{‘’, ‘none’, ‘learn’}, default=’learn’: Type of positional embedding; ‘learn’ enables learnable embeddings.
no_embed_classbool, default=False: If True, exclude class and reg tokens from position embedding.
reg_tokensint, default=0: Number of auxiliary regression tokens.
pre_normbool, default=False: If True, apply normalization before Transformer blocks.
dynamic_img_sizebool, default=False: If True, enables dynamic image resizing during inference.
dynamic_img_padbool, default=False: If True, apply padding to dynamically sized images.
drop_ratefloat, default=0.0: Dropout rate applied globally.
pos_drop_ratefloat, default=0.0: Dropout rate applied to positional embeddings.
patch_drop_ratefloat, default=0.0: Probability of randomly dropping patch tokens during training.
proj_drop_ratefloat, default=0.0: Dropout rate applied to projection layers.
attn_drop_ratefloat, default=0.0: Dropout rate applied to attention weights.
drop_path_ratefloat, default=0.0: Stochastic depth drop rate across layers.
weight_init{‘skip’, ‘jax’, ‘jax_nlhb’, ‘moco’, ‘’}, default=’’: Weight initialization strategy.
fix_initbool, default=False: If True, rescales initialization following original ViT heuristics.
embed_norm_layernn.Module, optional: Normalization layer applied to embeddings.
norm_layernn.Module, optional: Normalization layer applied to Transformer blocks.
act_layernn.Module, optional: Activation function used in MLP layers.
block_fnnn.Module, default=Block: Type of Transformer block used.
mlp_layernn.Module, default=Mlp: Type of MLP module used in each block.

_init_weights(m)[source]¶

Parameters:: m (torch.nn.Module)
Return type:: None

_pos_embed(x)[source]¶

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

blocks¶

cls_token¶

dynamic_img_size: torch.jit.Final[bool]¶

feature_info¶

fix_init_weight()[source]¶

forward(x)[source]¶

Forward pass of the Vision Transformer.

Parameters¶

xtorch.Tensor: Input image tensor of shape (B, C, H, W), where B is batch size, C is number of channels, and H, W are image dimensions.

Returns¶

torch.Tensor: Encoded features of shape (B, N, D), where N is the number of patches (plus any prefix tokens) and D is the embedding dimension.

Parameters:: x (torch.Tensor)
Return type:: torch.Tensor

grad_checkpointing = False¶

has_class_token = True¶

in_channels = 3¶

init_weights(mode='')[source]¶

Parameters:: mode (str)
Return type:: None

no_embed_class = False¶

norm_pre¶

num_prefix_tokens = 1¶

num_reg_tokens = 0¶

patch_embed¶

patch_size¶

pos_drop¶

reg_token¶

set_input_size(img_size=None, patch_size=None)[source]¶

Method updates the input image resolution, patch size

Args:: img_size: New input resolution, if None current resolution is used patch_size: New patch size, if None existing patch size is used

Parameters:

img_size (Optional[Tuple[int, int]])
patch_size (Optional[Tuple[int, int]])

Parameters:

img_size (Union[int, Tuple[int, int]])
patch_size (Union[int, Tuple[int, int]])
in_chans (int)
embed_dim (int)
depth (int)
num_heads (int)
mlp_ratio (float)
qkv_bias (bool)
qk_norm (bool)
proj_bias (bool)
init_values (Optional[float])
class_token (bool)
pos_embed (str)
no_embed_class (bool)
reg_tokens (int)
pre_norm (bool)
dynamic_img_size (bool)
dynamic_img_pad (bool)
pos_drop_rate (float)
patch_drop_rate (float)
proj_drop_rate (float)
attn_drop_rate (float)
drop_path_rate (float)
weight_init (Literal['skip', 'jax', 'jax_nlhb', 'moco', ''])
fix_init (bool)
embed_norm_layer (Optional[timm.layers.LayerType])
norm_layer (Optional[timm.layers.LayerType])
act_layer (Optional[timm.layers.LayerType])
block_fn (Type[torch.nn.Module])
mlp_layer (Type[torch.nn.Module])