minerva.models.nets.image.vit_local.patch_embed¶

Classes¶

PatchEmbed

2D Image to Patch Embedding

Functions¶

resample_patch_embed(patch_embed, new_size[, ...])

Resample the weights of the patch embedding kernel to target resolution.

Module Contents¶

class minerva.models.nets.image.vit_local.patch_embed.PatchEmbed(img_size=224, patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True, output_fmt=None, bias=True, strict_img_size=True, dynamic_img_pad=False)[source]¶

Bases: torch.nn.Module

2D Image to Patch Embedding

Initialize the PatchEmbed module.

Parameters¶

img_sizeint or Tuple[int, int], default=224: Input image size. If None, image size will be inferred dynamically.
patch_sizeint or Tuple[int, int], default=16: Size of each image patch.
in_chansint, default=3: Number of input channels (e.g., 3 for RGB images).
embed_dimint, default=768: Dimension of the output patch embeddings.
norm_layerCallable, optional: Normalization layer applied to the output embeddings.
flattenbool, default=True: If True, flattens patches into a sequence (N, L, C).
output_fmtstr, optional: Output tensor format. If specified, overrides flatten.
biasbool, default=True: Whether to include a bias term in the projection layer.
strict_img_sizebool, default=True: If True, enforces input images to match the specified size exactly.
dynamic_img_padbool, default=False: If True, applies dynamic padding for images not divisible by patch size.

_init_img_size(img_size)[source]¶

Parameters:: img_size (Union[int, Tuple[int, int]])

dynamic_feat_size(img_size)[source]¶

Get grid (feature) size for given image size taking account of dynamic padding. NOTE: must be torchscript compatible so using fixed tuple indexing

Parameters:: img_size (Tuple[int, int])
Return type:: Tuple[int, int]

dynamic_img_pad: torch.jit.Final[bool]¶

feat_ratio(as_scalar=True)[source]¶

Return type:: Union[Tuple[int, int], int]

forward(x)[source]¶

Forward pass that converts an input image into patch embeddings.

Parameters¶

xtorch.Tensor: Input tensor of shape (B, C, H, W), where B is batch size, C is number of channels, and H, W are spatial dimensions.

Returns¶

torch.Tensor: Patch embeddings tensor. Shape depends on output format: - If flatten=True: (B, num_patches, embed_dim) - If flatten=False and output_fmt=’NCHW’: (B, embed_dim, H_p, W_p) - If using another output format: tensor is converted accordingly.

Parameters:: x (torch.Tensor)

norm¶

output_fmt: timm.layers.format.Format¶

patch_size¶

proj¶

set_input_size(img_size=None, patch_size=None)[source]¶

Parameters:

img_size (Optional[Union[int, Tuple[int, int]]])
patch_size (Optional[Union[int, Tuple[int, int]]])

strict_img_size = True¶

Parameters:

img_size (Union[int, Tuple[int, int]])
patch_size (Union[int, Tuple[int, int]])
in_chans (int)
embed_dim (int)
norm_layer (Optional[Callable])
flatten (bool)
output_fmt (Optional[str])
bias (bool)
strict_img_size (bool)
dynamic_img_pad (bool)

minerva.models.nets.image.vit_local.patch_embed.resample_patch_embed(patch_embed, new_size, interpolation='bicubic', antialias=True)[source]¶

Resample the weights of the patch embedding kernel to target resolution. We resample the patch embedding kernel by approximately inverting the effect of patch resizing.

Code based on:: https://github.com/google-research/big_vision/blob/b00544b81f8694488d5f36295aeb7972f3755ffe/big_vision/models/proj/flexi/vit.py

With this resizing, we can for example load a B/8 filter into a B/16 model and, on 2x larger input image, the result will match.

Args:: patch_embed: original parameter to be resized. new_size (tuple(int, int): target shape (height, width)-only. interpolation (str): interpolation for resize antialias (bool): use anti-aliasing filter in resize verbose (bool): log operation
Returns:: Resized patch embedding kernel.

Parameters:

patch_embed (torch.nn.Parameter)
new_size (List[int])
interpolation (str)
antialias (bool)