minerva.models.nets.image.vit_local

Submodules

Classes

Block

Transformer block module.

PatchEmbed

2D Image to Patch Embedding

VisionTransformer

Vision Transformer (ViT)

Package Contents

class minerva.models.nets.image.vit_local.Block(dim, num_heads, mlp_ratio=4.0, qkv_bias=False, qk_norm=False, proj_bias=True, proj_drop=0.0, attn_drop=0.0, init_values=None, drop_path=0.0, act_layer=nn.GELU, norm_layer=nn.LayerNorm, mlp_layer=Mlp)[source]

Bases: torch.nn.Module

Transformer block module.

Initialize the Transformer block.

Parameters

dimint

Embedding dimension of the input and output features.

num_headsint

Number of attention heads in the self-attention layer.

mlp_ratiofloat, default=4.0

Expansion ratio for the hidden dimension in the MLP layer.

qkv_biasbool, default=False

If True, add bias to the query, key, and value projections.

qk_normbool, default=False

If True, apply normalization to query and key tensors.

proj_biasbool, default=True

If True, include bias in the projection layers.

proj_dropfloat, default=0.0

Dropout rate applied to the output of the attention and MLP layers.

attn_dropfloat, default=0.0

Dropout rate applied to the attention weights.

init_valuesfloat, optional

If specified, enables LayerScale with this initial scaling value.

drop_pathfloat, default=0.0

Stochastic depth rate; set > 0 to apply DropPath regularization.

act_layerType[nn.Module], default=nn.GELU

Activation function used in the MLP layer.

norm_layerType[nn.Module], default=nn.LayerNorm

Normalization layer type applied before attention and MLP.

mlp_layerType[nn.Module], default=Mlp

Module type used for the feed-forward network.

attn
drop_path1
drop_path2
forward(x)[source]
Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

ls1
ls2
mlp
norm1
norm2
Parameters:
  • dim (int)

  • num_heads (int)

  • mlp_ratio (float)

  • qkv_bias (bool)

  • qk_norm (bool)

  • proj_bias (bool)

  • proj_drop (float)

  • attn_drop (float)

  • init_values (Optional[float])

  • drop_path (float)

  • act_layer (Type[torch.nn.Module])

  • norm_layer (Type[torch.nn.Module])

  • mlp_layer (Type[torch.nn.Module])

class minerva.models.nets.image.vit_local.PatchEmbed(img_size=224, patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True, output_fmt=None, bias=True, strict_img_size=True, dynamic_img_pad=False)[source]

Bases: torch.nn.Module

2D Image to Patch Embedding

Initialize the PatchEmbed module.

Parameters

img_sizeint or Tuple[int, int], default=224

Input image size. If None, image size will be inferred dynamically.

patch_sizeint or Tuple[int, int], default=16

Size of each image patch.

in_chansint, default=3

Number of input channels (e.g., 3 for RGB images).

embed_dimint, default=768

Dimension of the output patch embeddings.

norm_layerCallable, optional

Normalization layer applied to the output embeddings.

flattenbool, default=True

If True, flattens patches into a sequence (N, L, C).

output_fmtstr, optional

Output tensor format. If specified, overrides flatten.

biasbool, default=True

Whether to include a bias term in the projection layer.

strict_img_sizebool, default=True

If True, enforces input images to match the specified size exactly.

dynamic_img_padbool, default=False

If True, applies dynamic padding for images not divisible by patch size.

_init_img_size(img_size)[source]
Parameters:

img_size (Union[int, Tuple[int, int]])

dynamic_feat_size(img_size)[source]

Get grid (feature) size for given image size taking account of dynamic padding. NOTE: must be torchscript compatible so using fixed tuple indexing

Parameters:

img_size (Tuple[int, int])

Return type:

Tuple[int, int]

dynamic_img_pad: torch.jit.Final[bool]
feat_ratio(as_scalar=True)[source]
Return type:

Union[Tuple[int, int], int]

forward(x)[source]

Forward pass that converts an input image into patch embeddings.

Parameters

xtorch.Tensor

Input tensor of shape (B, C, H, W), where B is batch size, C is number of channels, and H, W are spatial dimensions.

Returns

torch.Tensor

Patch embeddings tensor. Shape depends on output format: - If flatten=True: (B, num_patches, embed_dim) - If flatten=False and output_fmt=’NCHW’: (B, embed_dim, H_p, W_p) - If using another output format: tensor is converted accordingly.

Parameters:

x (torch.Tensor)

norm
output_fmt: timm.layers.format.Format
patch_size
proj
set_input_size(img_size=None, patch_size=None)[source]
Parameters:
  • img_size (Optional[Union[int, Tuple[int, int]]])

  • patch_size (Optional[Union[int, Tuple[int, int]]])

strict_img_size = True
Parameters:
  • img_size (Union[int, Tuple[int, int]])

  • patch_size (Union[int, Tuple[int, int]])

  • in_chans (int)

  • embed_dim (int)

  • norm_layer (Optional[Callable])

  • flatten (bool)

  • output_fmt (Optional[str])

  • bias (bool)

  • strict_img_size (bool)

  • dynamic_img_pad (bool)

class minerva.models.nets.image.vit_local.VisionTransformer(img_size=224, patch_size=16, in_chans=3, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4.0, qkv_bias=True, qk_norm=False, proj_bias=True, init_values=None, class_token=True, pos_embed='learn', no_embed_class=False, reg_tokens=0, pre_norm=False, dynamic_img_size=False, dynamic_img_pad=False, pos_drop_rate=0.0, patch_drop_rate=0.0, proj_drop_rate=0.0, attn_drop_rate=0.0, drop_path_rate=0.0, weight_init='', fix_init=False, embed_norm_layer=None, norm_layer=None, act_layer=None, block_fn=Block, mlp_layer=Mlp)[source]

Bases: torch.nn.Module

Vision Transformer (ViT)

A PyTorch implementation of the Vision Transformer architecture from “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” (https://arxiv.org/abs/2010.11929).

This model divides an input image into fixed-size patches, embeds them, adds positional information, and processes them through a sequence of Transformer encoder blocks to learn global image representations.

Initialize the Vision Transformer model.

Parameters

img_sizeint or tuple of int, default=224

Input image size (height, width).

patch_sizeint or tuple of int, default=16

Size of each image patch.

in_chansint, default=3

Number of input channels (e.g., 3 for RGB images).

num_classesint, default=1000

Number of output classes for classification.

embed_dimint, default=768

Dimension of the patch embeddings.

depthint, default=12

Number of Transformer encoder blocks.

num_headsint, default=12

Number of attention heads per block.

mlp_ratiofloat, default=4.0

Expansion ratio for the MLP hidden dimension.

qkv_biasbool, default=True

If True, include bias in the query, key, and value projections.

qk_normbool, default=False

If True, apply normalization to query and key vectors.

proj_biasbool, default=True

If True, include bias in projection layers.

init_valuesfloat, optional

Initial value for LayerScale; if None, LayerScale is disabled.

class_tokenbool, default=True

If True, use a learnable class token.

pos_embed{‘’, ‘none’, ‘learn’}, default=’learn’

Type of positional embedding; ‘learn’ enables learnable embeddings.

no_embed_classbool, default=False

If True, exclude class and reg tokens from position embedding.

reg_tokensint, default=0

Number of auxiliary regression tokens.

pre_normbool, default=False

If True, apply normalization before Transformer blocks.

dynamic_img_sizebool, default=False

If True, enables dynamic image resizing during inference.

dynamic_img_padbool, default=False

If True, apply padding to dynamically sized images.

drop_ratefloat, default=0.0

Dropout rate applied globally.

pos_drop_ratefloat, default=0.0

Dropout rate applied to positional embeddings.

patch_drop_ratefloat, default=0.0

Probability of randomly dropping patch tokens during training.

proj_drop_ratefloat, default=0.0

Dropout rate applied to projection layers.

attn_drop_ratefloat, default=0.0

Dropout rate applied to attention weights.

drop_path_ratefloat, default=0.0

Stochastic depth drop rate across layers.

weight_init{‘skip’, ‘jax’, ‘jax_nlhb’, ‘moco’, ‘’}, default=’’

Weight initialization strategy.

fix_initbool, default=False

If True, rescales initialization following original ViT heuristics.

embed_norm_layernn.Module, optional

Normalization layer applied to embeddings.

norm_layernn.Module, optional

Normalization layer applied to Transformer blocks.

act_layernn.Module, optional

Activation function used in MLP layers.

block_fnnn.Module, default=Block

Type of Transformer block used.

mlp_layernn.Module, default=Mlp

Type of MLP module used in each block.

_init_weights(m)[source]
Parameters:

m (torch.nn.Module)

Return type:

None

_pos_embed(x)[source]
Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

blocks
cls_token
dynamic_img_size: torch.jit.Final[bool]
feature_info
fix_init_weight()[source]
forward(x)[source]

Forward pass of the Vision Transformer.

Parameters

xtorch.Tensor

Input image tensor of shape (B, C, H, W), where B is batch size, C is number of channels, and H, W are image dimensions.

Returns

torch.Tensor

Encoded features of shape (B, N, D), where N is the number of patches (plus any prefix tokens) and D is the embedding dimension.

Parameters:

x (torch.Tensor)

Return type:

torch.Tensor

grad_checkpointing = False
has_class_token = True
in_channels = 3
init_weights(mode='')[source]
Parameters:

mode (str)

Return type:

None

no_embed_class = False
norm_pre
num_prefix_tokens = 1
num_reg_tokens = 0
patch_embed
patch_size
pos_drop
reg_token
set_input_size(img_size=None, patch_size=None)[source]

Method updates the input image resolution, patch size

Args:

img_size: New input resolution, if None current resolution is used patch_size: New patch size, if None existing patch size is used

Parameters:
  • img_size (Optional[Tuple[int, int]])

  • patch_size (Optional[Tuple[int, int]])

Parameters:
  • img_size (Union[int, Tuple[int, int]])

  • patch_size (Union[int, Tuple[int, int]])

  • in_chans (int)

  • embed_dim (int)

  • depth (int)

  • num_heads (int)

  • mlp_ratio (float)

  • qkv_bias (bool)

  • qk_norm (bool)

  • proj_bias (bool)

  • init_values (Optional[float])

  • class_token (bool)

  • pos_embed (str)

  • no_embed_class (bool)

  • reg_tokens (int)

  • pre_norm (bool)

  • dynamic_img_size (bool)

  • dynamic_img_pad (bool)

  • pos_drop_rate (float)

  • patch_drop_rate (float)

  • proj_drop_rate (float)

  • attn_drop_rate (float)

  • drop_path_rate (float)

  • weight_init (Literal['skip', 'jax', 'jax_nlhb', 'moco', ''])

  • fix_init (bool)

  • embed_norm_layer (Optional[timm.layers.LayerType])

  • norm_layer (Optional[timm.layers.LayerType])

  • act_layer (Optional[timm.layers.LayerType])

  • block_fn (Type[torch.nn.Module])

  • mlp_layer (Type[torch.nn.Module])