Authors: Akshat Jha (B23336) & Harsh Vardhan Sharma (B23373)
Institute: Indian Institute of Technology (IIT) Mandi
Course: AR–522 Robot Vision
Professor: Dr. Praful Hambarde
This project implements an automated Crowd Surveillance System designed for extremely dense pedestrian environments. By fine-tuning YOLOv8s on the MOT20 dataset and integrating it with BoT-SORT (Box Observation Tracking with Camera Motion Compensation), we achieve robust detection and tracking even under heavy occlusion.
Unlike standard detectors that fail in high-density scenarios, this system is hyper-specialized for crowd analytics, capable of generating heatmaps, flow trajectories, and dwell-time statistics from raw CCTV footage.
- Dense Crowd Detection: Optimized for scenes with 100+ pedestrians per frame.
- Robust Tracking: Utilizes BoT-SORT to maintain IDs across camera movements and occlusions.
- Automated Analytics: Generates occupancy counts, dwell times, and density heatmaps automatically.
- Privacy-Aware: Focuses on full-body patterns rather than facial recognition.
Initially, we trained a YOLOv8 Nano model on the JHU-Crowd++ dataset to detect heads.
- Limitation: Head bounding boxes are too small and unstable for motion-based trackers (IoU often drops to zero with slight movement).
- Pivot: We switched to Body Detection using MOT20. Bodies offer larger surface areas, ensuring stable Intersection-over-Union (IoU) for the Kalman filters used in tracking.
We utilized the MOT20 benchmark, known for its extreme density (avg. 149 pedestrians/frame) and challenging indoor/outdoor scenarios.
- Training Sequences: MOT20-01, MOT20-02, MOT20-03, MOT20-05
- Validation Sequence: MOT20-04
- Testing Sequences: MOT20-06, MOT20-07, MOT20-08
We developed a custom script to convert MOTChallenge format (gt.txt) into YOLO format:
- Filtered detections with confidence < 0.5.
- Normalized coordinates (x_center, y_center, width, height) to the [0,1] range.
- Ingestion: Load MOT20 frames.
- Fine-tuning: Train YOLOv8s (Small) pretrained on COCO.
- Config: Image Size 640, Batch 4, Epochs 50, Optimizer AdamW.
- Augmentations: Blur, MedianBlur (for motion/noise), ToGray (for low light), CLAHE (for contrast).
- Export: Save weights as
.pt(PyTorch) and.onnx(Deployment).
- Input: Inject novel test video (CCTV/Drone/Phone).
- Inference: Run detection + BoT-SORT tracking.
- Spatial Analytics: Generate density heatmaps and velocity flow fields.
- Temporal Analytics: Calculate dwell time per ID and occupancy per frame.
- Event Logic: Monitor zone entries/exits (ROI).
- Visualization: Render annotated video with bounding boxes and trajectory trails.
The model was validated on the MOT20-04 sequence after 50 epochs.
- mAP @ 50: 0.982 (Extremely high accuracy for detecting pedestrians)
- mAP @ 50-95: 0.837 (Strong bounding box precision even at strict thresholds)
- Precision: 0.986 (Very few false positives/ghost detections)
- Recall: 0.956 (Detects 95.6% of people even in dense crowds)
We provide a Google Colab notebook that automates the entire testing pipeline. You can access the notebook and model weights via the link below:
Instructions:
- Open the file
robot_vision_crowd_surveillance_demo_1.ipynbin Google Colab. - Execute all the cells. (Use run all button)
- Upload your test video when prompted in Cell 4.
- Download the processed output from the
output_videosfolder and view analytics in the cells below.
- Re-Identification (ReID): Integrate DeepSORT to use visual appearance features, reducing ID switches after long occlusions.
- Crowd Counting via Regression: For densities where boxes are impossible to separate, switch to density map regression (CSRNet).
- Anomaly Detection: Implement logic to detect sudden running or dispersion events.
