Combine multiple images and annotations into one image via a mosaic grid. Uses metadata for additional images; common in object detection training.
Mosaic creates a grid of images by placing the primary image and additional images from metadata into cells of a larger canvas, then crops a region to produce the final output. This is commonly used in object detection training to increase data diversity and help models learn to detect objects at different scales and contexts.
The transform takes a primary input image (and its annotations) and combines it with additional images/annotations provided via metadata. It calculates the geometry for a mosaic grid, selects additional items, preprocesses annotations consistently (handling label encoding updates), applies geometric transformations, and assembles the final output.
grid_yxThe number of rows (y) and columns (x) in the mosaic grid. Determines the maximum number of images involved (grid_yx[0] * grid_yx[1]). Default: (2, 2).
target_sizeThe desired output (height, width) for the final mosaic image. after cropping the mosaic grid.
cell_shapecell shape of each cell in the mosaic grid.
fit_modeHow to fit images into mosaic cells.
metadata_keyKey in the input dictionary specifying the list of additional data dictionaries
for the mosaic. Each dictionary in the list should represent one potential additional item.
Expected keys: 'image' (required, np.ndarray), and optionally 'mask' (np.ndarray),
'masks' (np.ndarray, stacked instance masks), 'bboxes' (np.ndarray), 'keypoints' (np.ndarray),
and label fields supplied via the bbox_labels and keypoint_labels wrapper dicts
(see Metadata Format below). Default: "mosaic_metadata".
center_rangeRange [0.0-1.0] to sample the center point of the mosaic view relative to the valid central region of the conceptual large grid. This affects which parts of the assembled grid are visible in the final crop. Default: (0.3, 0.7).
interpolationOpenCV interpolation flag used for resizing images during geometric processing. Default: cv2.INTER_LINEAR.
mask_interpolationOpenCV interpolation flag used for resizing masks during geometric processing. Default: cv2.INTER_NEAREST.
fillValue used for padding images if needed during geometric processing. Default: 0.
fill_maskValue used for padding masks if needed during geometric processing. Default: 0.
pProbability of applying the transform. Default: 0.5.
>>> import numpy as np
>>> import albumentations as A
>>> import cv2
>>>
>>> # Prepare primary data
>>> primary_image = np.random.randint(0, 256, (100, 100, 3), dtype=np.uint8)
>>> primary_mask = np.random.randint(0, 2, (100, 100), dtype=np.uint8)
>>> primary_bboxes = np.array([[10, 10, 40, 40], [50, 50, 90, 90]], dtype=np.float32)
>>> primary_bbox_classes = [1, 2]
>>> primary_keypoints = np.array([[25, 25], [75, 75]], dtype=np.float32)
>>> primary_keypoint_classes = ['eye', 'nose']
>>>
>>> # Prepare additional images for mosaic.
>>> # bbox_labels and keypoint_labels are dicts mapping field name -> list of values.
>>> mosaic_metadata = [
... {
... 'image': np.random.randint(0, 256, (100, 100, 3), dtype=np.uint8),
... 'mask': np.random.randint(0, 2, (100, 100), dtype=np.uint8),
... 'bboxes': np.array([[20, 20, 60, 60]], dtype=np.float32),
... 'bbox_labels': {'bbox_classes': [3]},
... 'keypoints': np.array([[40, 40]], dtype=np.float32),
... 'keypoint_labels': {'keypoint_classes': ['mouth']},
... },
... {
... 'image': np.random.randint(0, 256, (100, 100, 3), dtype=np.uint8),
... 'mask': np.random.randint(0, 2, (100, 100), dtype=np.uint8),
... 'bboxes': np.array([[30, 30, 70, 70]], dtype=np.float32),
... 'bbox_labels': {'bbox_classes': [4]},
... 'keypoints': np.array([[50, 50], [65, 65]], dtype=np.float32),
... 'keypoint_labels': {'keypoint_classes': ['eye', 'eye']},
... },
... ]
>>>
>>> transform = A.Compose([
... A.Mosaic(
... grid_yx=(2, 2),
... target_size=(200, 200),
... cell_shape=(120, 120),
... center_range=(0.4, 0.6),
... fit_mode="cover",
... p=1.0
... ),
... ], bbox_params=A.BboxParams(coord_format='pascal_voc', label_fields=['bbox_classes']),
... keypoint_params=A.KeypointParams(coord_format='xy', label_fields=['keypoint_classes']))
>>>
>>> transformed = transform(
... image=primary_image,
... mask=primary_mask,
... bboxes=primary_bboxes,
... bbox_classes=primary_bbox_classes,
... keypoints=primary_keypoints,
... keypoint_classes=primary_keypoint_classes,
... mosaic_metadata=mosaic_metadata,
... )
>>>
>>> mosaic_image = transformed['image']
>>> mosaic_bboxes = transformed['bboxes']
>>> mosaic_bbox_classes = transformed['bbox_classes']
>>> mosaic_keypoint_classes = transformed['keypoint_classes']If fewer additional images are provided than needed to fill the grid, the primary image
will be replicated to fill the remaining cells. For example, with a 2x2 grid, if only
one additional image is provided, the mosaic will contain the primary image in two cells
and the additional image in one cell, with one visible cell selected from these three.
Stacked instance masks on the masks key (N, H, W) are transformed via apply_to_masks like
other DualTransforms; _targets only lists Targets enum values (no Targets.MASKS).