Abstract
Identifying fish species from underwater imagery is challenging due to variable conditions such as water clarity, lighting, and background complexity, which often cause models to fail when applied beyond their original training domains. This study addresses these challenges by focusing on the detection of saithe (Pollachius virens) and pollock (Pollachius pollachius) as a representative case, utilizing an ensemble-based approach to enhance generalization across diverse aquatic environments. To diversify the domain coverage, copy-paste augmentation was first applied, inserting cropped fish into varied seafloor backgrounds, expanding two public sources and two in-house collections to 4002 images. Two YOLO backbones (YOLOv8m and YOLOv12m) are trained and separate model-level ensembles are created: Stochastic Weight Averaging, which averages multiple stochastic gradient descent solutions to achieve flatter optima and Fast Geometric Ensembling, which generates diverse model solutions along training trajectories. At inference time, Test-Time Augmentation is fused with Monte-Carlo Dropout to form a single prediction-level ensemble that averages outputs from geometrically transformed inputs while sampling the network stochastically to capture uncertainty. Finally, aggregating the predictions of the YOLOv8m and YOLOv12m ensembles to single predictions. Through these experiments it is demonstrated that ensembling YOLOv12 and YOLOv8 improves recall by 2.9% and precision by 2.5% in multi-class classification, and recall by 4.3% and precision by 0.87% in single-class classification. These models on the unseen underwater environment to access the generalisation performance. These findings demonstrate the effectiveness of combining ensemble-based strategies with sample data augmentation to enhance robustness in real-world marine biodiversity monitoring systems.