minerva.models.nets.image.vit_local.patch_embed

Classes

PatchEmbed

2D Image to Patch Embedding

Functions

resample_patch_embed(patch_embed, new_size[, ...])

Resample the weights of the patch embedding kernel to target resolution.

Module Contents

class minerva.models.nets.image.vit_local.patch_embed.PatchEmbed(img_size=224, patch_size=16, in_chans=3, embed_dim=768, norm_layer=None, flatten=True, output_fmt=None, bias=True, strict_img_size=True, dynamic_img_pad=False)[source]

Bases: torch.nn.Module

2D Image to Patch Embedding

Initialize the PatchEmbed module.

Parameters

img_sizeint or Tuple[int, int], default=224

Input image size. If None, image size will be inferred dynamically.

patch_sizeint or Tuple[int, int], default=16

Size of each image patch.

in_chansint, default=3

Number of input channels (e.g., 3 for RGB images).

embed_dimint, default=768

Dimension of the output patch embeddings.

norm_layerCallable, optional

Normalization layer applied to the output embeddings.

flattenbool, default=True

If True, flattens patches into a sequence (N, L, C).

output_fmtstr, optional

Output tensor format. If specified, overrides flatten.

biasbool, default=True

Whether to include a bias term in the projection layer.

strict_img_sizebool, default=True

If True, enforces input images to match the specified size exactly.

dynamic_img_padbool, default=False

If True, applies dynamic padding for images not divisible by patch size.

_init_img_size(img_size)[source]
Parameters:

img_size (Union[int, Tuple[int, int]])

dynamic_feat_size(img_size)[source]

Get grid (feature) size for given image size taking account of dynamic padding. NOTE: must be torchscript compatible so using fixed tuple indexing

Parameters:

img_size (Tuple[int, int])

Return type:

Tuple[int, int]

dynamic_img_pad: torch.jit.Final[bool]
feat_ratio(as_scalar=True)[source]
Return type:

Union[Tuple[int, int], int]

forward(x)[source]

Forward pass that converts an input image into patch embeddings.

Parameters

xtorch.Tensor

Input tensor of shape (B, C, H, W), where B is batch size, C is number of channels, and H, W are spatial dimensions.

Returns

torch.Tensor

Patch embeddings tensor. Shape depends on output format: - If flatten=True: (B, num_patches, embed_dim) - If flatten=False and output_fmt=’NCHW’: (B, embed_dim, H_p, W_p) - If using another output format: tensor is converted accordingly.

Parameters:

x (torch.Tensor)

norm
output_fmt: timm.layers.format.Format
patch_size
proj
set_input_size(img_size=None, patch_size=None)[source]
Parameters:
  • img_size (Optional[Union[int, Tuple[int, int]]])

  • patch_size (Optional[Union[int, Tuple[int, int]]])

strict_img_size = True
Parameters:
  • img_size (Union[int, Tuple[int, int]])

  • patch_size (Union[int, Tuple[int, int]])

  • in_chans (int)

  • embed_dim (int)

  • norm_layer (Optional[Callable])

  • flatten (bool)

  • output_fmt (Optional[str])

  • bias (bool)

  • strict_img_size (bool)

  • dynamic_img_pad (bool)

minerva.models.nets.image.vit_local.patch_embed.resample_patch_embed(patch_embed, new_size, interpolation='bicubic', antialias=True)[source]

Resample the weights of the patch embedding kernel to target resolution. We resample the patch embedding kernel by approximately inverting the effect of patch resizing.

Code based on:

https://github.com/google-research/big_vision/blob/b00544b81f8694488d5f36295aeb7972f3755ffe/big_vision/models/proj/flexi/vit.py

With this resizing, we can for example load a B/8 filter into a B/16 model and, on 2x larger input image, the result will match.

Args:

patch_embed: original parameter to be resized. new_size (tuple(int, int): target shape (height, width)-only. interpolation (str): interpolation for resize antialias (bool): use anti-aliasing filter in resize verbose (bool): log operation

Returns:

Resized patch embedding kernel.

Parameters:
  • patch_embed (torch.nn.Parameter)

  • new_size (List[int])

  • interpolation (str)

  • antialias (bool)