Towards Dev

A publication for sharing projects, ideas, codes, and new theories.

Follow publication

YOLOv7 now Outperforms All Known Object Detectors!

--

Benchmarks

In terms of speed and accuracy, YOLOv7 has now surpassed all of the known object detectors like YOLOR, YOLOX, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, ViT-Adapter-B. Its speed varies from 5 FPS to 160 FPS and has the highest accuracy 56.8% AP. Also runs on 30FPS + on V100 GPU.

YOLOv7-E6 object detector (56 FPS V100, 55.9% AP) outperforms both transformer-based detector SWINL Cascade-Mask R-CNN by 509% in speed and 2% in accuracy, and convolutional based detector ConvNeXt-XL Cascade-Mask R-CNN (8.6 FPS A100, 55.2% AP) by 551% in speed and 0.7% AP in accuracy.

Comparison of state-of-the-art real-time object detectors.

Faster and accurate than :

  • YOLOv5 by 120% FPS
  • YOLOX by 180% FPS
  • Dual-Swin-T by 1200% FPS
  • ConvNext by 550% FPS
  • SWIN-L CM-RCNN by 500% FPS
  • PPYOLOE-X by 150% FPS

Performance & Improvements

  • Several trainable bag-of-freebies methods have greatly improve the detection accuracy without increasing the inference cost.
  • Proposed “extend” and “compound scaling” methods for the real-time object detector that can effectively utilize parameters and computation.
  • Reduced about 40% parameters and 50% computation of state-of-the-art real-time object detector, made faster inference speed and higher detection accuracy.

Architecture

Extended efficient layer aggregation networks
  • CSPVoVNet is a variation of VoVNet . It analyzes the gradient path to enable the weights of different layers to learn more diverse features. The approach makes inferences faster and more accurate.
  • E-ELAN is a variation of ELAN , it uses group convolution to increase the cardinality of the added features, and combine the features of different groups in a shuffle and merge cardinality manner. This has enhanced the features learned by different feature maps and improve the use of parameters and calculations.
Model scaling for concatenation-based models
  • Model scaling adjusts some attributes of the model and generate models of different scales to meet the needs of different inference speeds.
  • Proposed compound scaling(concatenation-based model) method can maintain the properties that the model had at the initial design and maintains the optimal structure.

Trainable bag-of-freebies

  • Bag of Freebies is technique that is used to improve the overall accuracy of an object detection model.
  • Training Details : (1) Connects batch normalization layer directly to convolutional layer. This integrate the mean and variance of batch normalization into the bias and weight of convolutional layer at the inference stage. (2) Implicit knowledge in YOLOR combined with convolution feature map in addition and multiplication manner: Implicit knowledge in YOLOR can be simplified to a vector by pre-computing at the inference stage. This vector can be combined with the bias and weight of the previous or subsequent convolutional layer. (3) EMA model: EMA is a technique used in mean teacher , and in our system we use EMA model purely as the final inference model.
  • What they observed is that the performance of identity connection in RepConv destroys the residual in ResNet and the concatenation in DenseNet, which provides more diversity of gradients for different feature maps. So they used RepConv without identity connection (RepConvN) to design the architecture of planned re-parameterized convolution.

Inference with Huggingface Space

https://huggingface.co/spaces/akhaliq/yolov7

Inference of YOLOv7 with TensorRT

View this thread.

Installation

> git clone https://github.com/WongKinYiu/yolov7.git
> cd yolov7
> conda create -n yolov7 python=3.7 -y && conda activate yolov7
> pip install -r requirements.txt

Inference

Download pretrained model from here.

python detect.py --weights yolov7.pt --conf 0.25 --img-size 640 --source inference/images/ --view-img --device cpu- Output will be saved at runs/detect/exp by default--weights    : Model path for inference
--source : Image file/Image path
--yaml : Yaml file for data
--img-size : Image size (h,w) for inference size
--conf-thres : confidence threshold for inference
--iou-thres : NMS iou thresold for inference
--max-det : maximum inference per image
--device : device to run model like 0,1,2,3 or cpu
--save-txt : save results to *.txt
--save-img : save visualized inference results
--classes : filter by classes
--project : save inference results to project/name

Results

When compared with results of recently released YOLOv6, Yolov7 outperforms Yolov6 in terms of accuracy and speed.

Inference on Video

python detect.py --weights yolov7.pt --conf 0.25 --img-size 640 --source inference/bird.mp4 --view-img --device cpu- Output will be saved at runs/detect/exp by default
Impressive in terms of speed and accuracy!

Reference:

  • Github
  • Paper : YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors by Chien-Yao Wang1 , Alexey Bochkovskiy, and Hong-Yuan Mark Liao.

Hope you learned something new today, Happy Learning!

--

--

Published in Towards Dev

A publication for sharing projects, ideas, codes, and new theories.

Written by Amal

Regular Post | Data Scientist, India

Responses (4)