YOLO (You Only Look Once) is a family of object detection models known for their real-time processing capabilities, delivering high accuracy and speed on mobile and edge devices. YOLOv4, released in 2020, improves on its predecessor, YOLOv3, by striking a balance between accuracy and speed.
Many accurate object detection models require multiple GPUs running in parallel, but YOLOv4 can operate on a single GPU with 8GB of VRAM, such as the GTX 1080 Ti, making it more accessible for widespread use.
This blog will delve deeper into the architecture of YOLOv4, exploring the changes that enable it to run on a single GPU, and examine some real-life applications of the model.
The YOLO Family of Models
The original YOLO model was introduced in 2016, revolutionizing object detection technology with its one-stage approach. YOLOv1 offered real-time applications at over 45 frames per second but had lower accuracy compared to two-stage models like Faster RCNN.
YOLOv2
Introduced in 2016, YOLOv2 improved accuracy while maintaining speed by introducing predefined anchor boxes for better bounding box predictions.
YOLOv3
Released in 2018, YOLOv3 introduced the Darknet-53 backbone with 53 convolutional layers for better feature extraction. It also added Objectness scores for bounding boxes and Spatial Pyramid Pooling (SPP) to enhance the model’s receptive field.
Key Innovations in YOLOv4
YOLOv4 enhances the efficiency of its predecessor by allowing training and operation on a single GPU. The main architecture changes in YOLOv4 include:
- Replacing Darknet-53 with CSPDarknet53 as the backbone
- Using PANet instead of FPN
- Implementing Self-Adversarial Training (SAT)
The authors also conducted experiments known as Bag of Freebies (BoF) and Bag of Specials (BoS) to optimize the model’s performance during training and inference.
Architecture of YOLOv4
The main components of YOLOv4 include CSPDarkNet53 as the backbone, SSP + PANet in the neck, and YOLOv3 in the head. The model offers flexibility with Bag of Freebies and Bag of Specials for further customization.
CSPDarkNet53 Backbone
YOLOv4 uses the CSP strategy in its Darknet-53 backbone, dividing feature maps for better preservation and accuracy while reducing computation needs.
SSP and PANet Neck
The neck of YOLOv4 incorporates SSP, modified PANet, and a modified SAM for improved feature extraction and attention mapping.
SPP
YOLOv4 uses modified Spatial Pyramid Pooling to expand the receptive field while preserving spatial details for better object detection.
PAN
The modified PANet in YOLOv4 efficiently uses multi-scale features by employing concatenation instead of aggregation.
Modified SAM
The modified SAM in YOLOv4 directly processes input feature maps using convolutional layers and a sigmoid activation function to generate attention maps.
How does a Standard SAM work?
The standard Spatial Attention Module (SAM) in YOLOv4 focuses on significant areas of input feature maps to enhance detection and classification accuracy.
YOLOv3 Head
The head of YOLO models handles object detection and classification, retaining the YOLOv3 head for anchor box predictions and bounding box regression.
Self-Adversarial Training (SAT)
SAT in YOLOv4 is a data augmentation technique that exposes the model to adversarial examples during training to improve robustness and generalization.
Real-Life Applications of YOLOv4
YOLOv4 has been successfully used in various applications such as oil palm harvesting, animal monitoring, pest control, pothole detection, and train detection, showcasing its versatility in computer vision tasks.
What’s Next
This blog has explored the architecture of YOLOv4, highlighting its single GPU operation capabilities and key features. The model’s adaptations like CSPDarknet53, SSP, PANet, and SAT have improved efficiency and performance. Additionally, real-life applications demonstrate the practicality and effectiveness of YOLOv4 in various scenarios.