Created for IWU Portfolio (Artifact 4)

Artifact 4: Building a Biological Anomaly Detection World Model

Audience

This artifact is intended for technical reviewers, AI/ML peers, and domain experts interested in the application of computer vision to bioinformatics. It demonstrates my ability to design, train, and evaluate self-supervised learning pipelines on raw scientific data.

Artifact Overview

An architectural proof-of-concept for detecting biological anomalies in C. elegans worms using the WormSwin dataset. The project evaluates different Vision Transformer (ViT) backbones for extracting latent features and building a temporal world model.

View Starpond Repository on GitHub

Why it matters

This demonstrates advanced ML competencies including dataset parsing (COCO annotations), managing hardware constraints, comparing neural network architectures (DINOv2 vs ViT), extracting deep embeddings, and temporal sequential modeling with LSTMs.

1. The Dataset (WormSwin)

We leveraged the open-source WormSwin dataset (DOI: 10.5281/zenodo.7456803), a highly robust instance segmentation dataset for C. elegans worms containing three primary splits (~13GB uncompressed).

Healthy Baseline: Wild-type and non-irradiated worms.
Anomalous Targets: Mutant or UV-irradiated worms.

Dataset Split (GB)

2. Efficient Data Pipeline (COCO Annotations)

Instead of feeding raw, gigantic 912x736 video frames and noisy petri dish backgrounds into our Vision Transformer, we used COCO bounding boxes to dynamically crop fixed patches (128x128px) of the worms in isolation. This radically minimized the computational requirement.

3. Early World Model Baseline Results

We successfully ported the dataset and architecture to an RTX 4090 virtual machine to bypass local CPU bottlenecks.

A frozen vision-transformer extracted 768-dimensional latent embeddings for each bounding-box cropped worm.
A lightweight PyTorch Autoencoder was trained exclusively on the embeddings from the 6,987 healthy wild-type worms.
The Autoencoder then evaluated 60,571 mixed test embeddings to calculate reconstruction errors (MSE).

Result: The pipeline achieved a baseline AUROC of 0.808. The model has an 80.8% probability of correctly assigning a higher prediction error to an anomalous worm compared to a healthy one.

4. Architectural Sweep (DINOv2 Superiority)

We ran a fully automated hyperparameter sweep executing the entire Phase 1 pipeline across 60,571 images using four different foundational vision models.

facebook/dinov2-small: 0.860 AUROC 🏆
facebook/dinov2-base: 0.838 AUROC
google/vit-base-patch16-224: 0.756 AUROC

AUROC Multi-Model Performance Comparison

Conclusion: Meta's strictly Self-Supervised DINOv2 architecture yields significantly denser, more statistically separable anomaly clusters for microscopic biological structures than standard supervised ImageNet Vision Transformers.

5. Temporal Video Model Execution

We successfully migrated the codebase from static 2D image analysis into a sequential video representation to build the full Temporal Phase 2 World Model.

High-resolution 3D spatio-temporal tracking containers were loaded from the Zenodo OpenWorm tracking API.
A PyTorch LSTM network was trained exclusively on healthy movement trajectories (frames 0-7) to predict structurally sound future movement vectors (frame 15).
Anomalous tracking records produced cascading elevated prediction errors compared to normalized models.

Reflection

This project pushed my understanding of how foundational computer vision models adapt to specialized non-human datasets. It highlighted the power of self-supervision (DINOv2) over standard ImageNet preprocessing when dealing with microscopic biological morphology. The shift from purely spatial analysis to a recurrent temporal pipeline (LSTM) also emphasized the importance of robust data pipelining. If I were to iterate further, I would transition the temporal sequence predictor from an LSTM to a scalable decoder-only transformer to better capture long-range movement dependencies.