A Comprehensive Guide to Implementing Baidu’s RT-DETR with Paperspace

Bring this project to life

In this article, we introduce Real-Time DEtection TRansformer (RT-DETR), the first real-time end-to-end object detector addressing the high computational cost issue existing with DETRs. A recent research paper, DETRs Beat YOLOs on Real-Time Object Detection, by Baidu Inc., successfully analyzed the negative impact of non-maximum suppression (NMS) on real-time detectors. They proposed an efficient hybrid encoder for multi-scale feature processing. The IoU-aware query selection enhances performance. RT-DETR-L achieves 53.0% AP on COCO val2017 at 114 FPS, outperforming YOLO detectors. RT-DETR-X achieves 54.8% AP at 74 FPS, surpassing YOLO in both speed and accuracy. RT-DETR-R50 achieves 53.1% AP at 108 FPS, outperforming DINO-DeformableDETR-R50 by 2.2% AP in accuracy and 21 times in FPS.


Source

Object detection is a task of identifying or localizing certain objects in an image or video. Object detection models have various practical applications across different domains such as:

  1. Autonomous Vehicles: Object detection is crucial for enabling autonomous vehicles to identify and track pedestrians, vehicles, traffic signs, and other objects on the road.
  2. Retail Analytics: In retail, object detection helps track and analyze customer behavior, monitor inventory levels, and reduce theft through the identification of suspicious activities.
  3. Facial Recognition: Object detection is a fundamental component of facial recognition systems, used in applications such as access control, identity verification, and security.
  4. Environmental Monitoring: Object detection models can be applied in environmental monitoring to track and analyze wildlife movements, monitor deforestation, or assess changes in ecosystems.
  5. Gesture Recognition: Object detection is used to interpret and recognize human gestures, facilitating interaction with devices through gestures in applications like gaming or virtual reality.
  6. Agriculture: Object detection models can assist in crop monitoring, pest detection, and yield estimation by identifying and analyzing objects such as plants, fruits, or pests in agricultural images.

However, these are just a few examples, and there are many more use cases where object detection plays a crucial role.

Recently, transformer-based detectors have achieved remarkable performance by utilizing Vision Transformers (ViT) to process multiscale features effectively by separating intra-scale interaction and cross-scale fusion. It is highly adaptable, allowing for flexible adjustment of inference speed through various decoder layers without the need for retraining.

To enable real-time object detection, a streamlined hybrid encoder substitutes the original transformer encoder. This redesigned encoder efficiently manages the processing of multi-scale features by separating intra-scale interaction and cross-scale fusion, allowing for effective feature processing across different scales.

To further enhance performance, IoU-aware query selection is introduced during the training phase that offers higher-quality initial object queries to the decoder through IoU constraints. Additionally, the proposed detector allows for the convenient adjustment of inference speed using different decoder layers, leveraging the DETR architecture’s decoder design. This feature streamlines the practical application of the real-time detector without requiring retraining, making it the new state-of-the-art for real-time object detection.

Real-time object detectors can be roughly classified into two categories: anchor-based and anchor-free.

  1. Anchor-Based Object Detectors:

  • In anchor-based detectors, predefined anchor boxes or regions of interest are used to predict the presence of objects and their bounding boxes.
  • These anchor boxes are generated at various scales and aspect ratios across the image.
  • The detector predicts two main components for each anchor box: class probabilities (is there an object or not) and bounding box offsets (adjustments to the anchor box to tightly fit the object).
  • Popular examples of anchor-based detectors include Faster R-CNN, R-FCN (Region-based Fully Convolutional Networks), and RetinaNet.

  1. Anchor-Free Object Detectors:

  • Anchor-free detectors do not rely on predefined anchor boxes. Instead, they directly predict bounding boxes and object presence without the need for anchor boxes.
  • These detectors often employ keypoint-based or center-ness prediction methods.
  • Keypoint-based methods identify key points (corners, center, etc.) and use them to estimate object bounding boxes.
  • Center-ness prediction focuses on determining the likelihood of a pixel being the center of an object, and bounding boxes are constructed based on these centers.
  • Popular examples of anchor-free detectors include CenterNet and FCOS (Fully Convolutional One-Stage).

End-to-end object detectors, first proposed by Carion et al., are based on Transformers. DETR (DEtection TRansformer) is an object detector that has attracted significant attention due to its distinctive features. DETR removes the need for hand-designed anchor and Non-Maximum Suppression (NMS) components found in traditional detection pipelines. Instead, it utilizes bipartite matching and directly predicts one-to-one object sets. This approach simplifies the detection pipeline, addressing the performance bottleneck associated with NMS. However, DETR faces challenges, including slow training convergence and difficulties in optimizing queries.

Use of NMS

Non-Maximum Suppression (NMS) is a widely used post-processing algorithm in object detection. It addresses overlapping prediction boxes by filtering out those with scores below a specified threshold and discarding lower-scored boxes when their Intersection over Union (IoU) exceeds a second threshold. NMS iteratively processes all boxes for each category, making its execution time dependent on the number of input prediction boxes and the two hyperparameters: score threshold and IoU threshold.

Model Architecture

The RT-DETR model comprises a backbone, a hybrid encoder, and a transformer decoder with auxiliary prediction heads. The architecture leverages features from the last three stages of the backbone {S3, S4, S5} as input to the encoder, which uses intra-scale interaction and cross-scale fusion to transform multi-scale features into an image feature sequence. IoU-aware query selection is then applied to choose a fixed number of image features from the encoder output as initial queries for the decoder. The decoder, along with auxiliary prediction heads, iteratively refines these queries to generate object boxes and confidence scores.


Overview RT-DETR(Source)

A novel Efficient Hybrid Encoder is proposed for RT-DETR. This encoder consists of two modules, the Attention-based Intrascale Feature Interaction (AIFI) module and the CNN based Cross-scale Feature-fusion Module (CCFM). Further, to generate a scalable version of RT-DETR, the ResNet backbone was replaced with HGNetv2.

Dataset Used

The model was trained using the COCO train2017 and validated on COCO val2017 dataset. Further ResNet and HGNetv2 series pretrained on ImageNet with SSLD from PaddleClas1 as the backbone was used in the model. For IoU-aware query selection, the top 300 encoder features are chosen to initialize the object queries of the decoder. The training strategy and hyperparameters of the decoder closely align with the DINO approach. The detectors were trained using AdamW optimizer, and data augmentation was conducted with random {colour distort, expand, crop, flip, resize} operations.

Comparisons with other SOTA model

The RT-DETR, when compared to other real-time and end-to-end object detectors, successfully demonstrates superior performance. Specifically, RT-DETR-L achieves 53.0% Average Precision (AP) at 114 Frames Per Second (FPS), and RT-DETR-X achieves 54.8% AP at 74 FPS. These results outperform current state-of-the-art YOLO detectors in terms of both speed and accuracy. Additionally, RT-DETR-R50 achieves 53.1% AP at 108 FPS, and RT-DETR-R101 achieves 54.3% AP at 74 FPS, surpassing the state-of-the-art end-to-end detectors with the same backbone in both speed and accuracy. RT-DETR allows for flexible adjustment of inference speed by applying different decoder layers, all without requiring retraining. This feature enhances the practical applicability of the real-time detector.

Ultralytics RT-DETR Pre-trained Model

Ultralytics is committed to the development of top-notch artificial intelligence models globally. Their open-source projects on GitHub showcase state-of-the-art solutions across a diverse array of AI tasks, encompassing detection, segmentation, classification, tracking, and pose estimation.

Ultralytics Python API provides pre-trained RT-DETR models with different scales:

  • RT-DETR-L: 53.0% AP on COCO val2017, 114 FPS on T4 GPU
  • RT-DETR-X: 54.8% AP on COCO val2017, 74 FPS on T4 GPU

The below example code snippet offers straightforward training and inference illustrations for RT-DETR using Ultralytics pre-trained model. For comprehensive documentation on these modes and others, refer to the pages dedicated to Predict, Train, Val, and Export in the documentation.


!pip install ultralytics


from ultralytics import RTDETR

# Load a COCO-pretrained RT-DETR-l model
model = RTDETR('rtdetr-l.pt')

# Display model information (optional)
model.info()

# Train the model on the COCO8 example dataset for 100 epochs
results = model.train(data="coco8.yaml", epochs=100, imgsz=640)

# Run inference with the RT-DETR-l model on the 'bus.jpg' image
results = model('path/to/bus.jpg')

A Snippet Displaying Model Summary

Let us check the inference on an image and video saved in the local folder!


results = model.predict('https://m.media-amazon.com/images/I/61fNoq7Y6+L._AC_UF894,1000_QL80_.jpg', show=True)


results = model.predict(source="input_video/input_video.mp4", show=True)

Conclusions

In this article, we discussed Baidu’s Real-Time Detection Transformer (RT-DETR), a model that stands out for its advanced end-to-end object detection, delivering real-time performance without compromising accuracy. RT-DETR harnesses the capabilities of Vision Transformers to effectively handle multiscale features. The model’s key features include an Efficient Hybrid Encoder, IoU-aware Query Selection, and Adaptable Inference Speed. We used the pre-trained model from Ultralytics to demonstrate the performance of the model on images and videos. We recommend our readers to click the link and get a hands-on experience of this model using the Paperspace platform.

We hope you enjoyed reading the article!

References

RT-DETR (Realtime Detection Transformer)

Discover the features and benefits of RT-DETR, Baidu’s efficient and adaptable real-time object detector powered by Vision Transformers, including pre-trained models.

DETRs Beat YOLOs on Real-time Object Detection

Recently, end-to-end transformer-based detectors~(DETRs) have achieved remarkable performance. However, the issue of the high computational cost of DETRs has not been effectively addressed, limiting their practical application and preventing them from fully exploiting the benefits of no post-processing, such as non-maximum suppression (NMS). In this paper, we first analyze the influence of NMS in modern real-time object detectors on inference speed, and establish an end-to-end speed benchmark. To avoid the inference delay caused by NMS, we propose a Real-Time DEtection TRansformer (RT-DETR), the first real-time end-to-end object detector to our best knowledge. Specifically, we design an efficient hybrid encoder to efficiently process multi-scale features by decoupling the intra-scale interaction and cross-scale fusion, and propose IoU-aware query selection to improve the initialization of object queries. In addition, our proposed detector supports flexibly adjustment of the inference speed by using different decoder layers without the need for retraining, which facilitates the practical application of real-time object detectors. Our RT-DETR-L achieves 53.0% AP on COCO val2017 and 114 FPS on T4 GPU, while RT-DETR-X achieves 54.8% AP and 74 FPS, outperforming all YOLO detectors of the same scale in both speed and accuracy.