ConvNeXt : A Convnet for 2020’s

With the introduction of Vision Transformers(ViT) in 2020’s it superseded ConvNets as the state-of-the-art image classification model. But is it capable to do more complex computer vision tasks?

A simple ViT faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation.

That’s when Swin Transformers are introduced with several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks.

In this paper , they have “modernize” a standard ResNet toward the design of a hierarchical vision Transformer, and discovered key performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. It maintains the efficiency of standard ConvNets, and the fully-convolutional nature for both training and testing makes it extremely simple to implement.

Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

Comparison of ConvNext with existing SOTA models.

ConvNets are also inherently efficient due to the fact that when used in a sliding-window manner, the computations are shared.

Transformers replaced recurrent neural networks to become the dominant backbone architecture. Despite the disparity in the task of interest between language and vision domains, the two streams surprisingly converged in the year 2020, as the introduction of Vision Transformers (ViT).

Swin Transformer demonstrates that it can be adopted as a generic vision backbone and achieve state-of-the-art performance across a range of computer vision tasks beyond image classification. Swin Transformer’s success and rapid adoption also revealed one thing: the essence of convolution is not becoming irrelevant; rather, it remains much desired and has never faded.

The only reason ConvNets appear to be losing steam is that (hierarchical) Transformers surpass them in many vision tasks, and the performance difference is usually attributed to the superior scaling behavior of Transformers, with multi-head self-attention being the key component.

How did they “modernize” a ConvNet?

  • Starting point is a ResNet-50 model, they have train it with similar training techniques used to train vision Transformers and obtain much improved results compared to the original ResNet-50. This will the baseline architechture.
  • Design decisions that they studied are 1) macro design, 2) ResNeXt, 3) inverted bottleneck, 4) large kernel size, and 5) various layer-wise micro designs.

Architecture :

FineTune Results on ImageNet :

ImageNet-1K/22K (pre-)training settings
ImageNet-1K fine-tuning settings. Multiple values (e.g., 0.8/0.95) are for each model (e.g., ConvNeXt-B/L) respectively

Results :

Classification accuracy on ImageNet-1K & 22k. Similar to Transformers, ConvNeXt also shows promising scaling behavior with higher-capacity models and a larger (pre-training) dataset
COCO object detection and segmentation results using Mask-RCNN and Cascade Mask-RCNN

Paper :

Github :

Colab Demo :

Huggingface Demo :

Hope you learned something new today, Happy Learning!



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store