HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions

Yongming Rao*^,1 Wenliang Zhao*^,1 Yansong Tang¹

Jie Zhou¹ Ser-Nam Lim^†,² Jiwen Lu^†,¹

¹Tsinghua University ²Meta AI

Figure 1: Illustration of our main idea. We show representative spatial modeling operations that perform different orders of interactions. In this paper, we focus on studying explicit spatial interactions between a feature (red) and its neighboring region (light gray). (a) The standard convolution operation does not explicitly consider spatial interaction. (b) Dynamic convolution and SE introduce the dynamic weights to improve the modeling power of convolutions with extra spatial interactions. (c) The self-attention operation performs two-order spatial interactions with two successive matrix multiplications. (d) gⁿConv realizes arbitrary-order spatial interactions using a highly efficient implementation with gated convolutions and recursive deigns.

Abstract

Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution (gⁿConv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. gⁿConv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the operation, we construct a new family of generic vision backbones named HorNet. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation show HorNet outperform Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and a larger model size. Apart from the effectiveness in visual encoders, we also show gⁿConv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. Our results demonstrate that gⁿConv can be a new basic module for visual modeling that effectively combines the merits of both vision Transformers and CNNs.

Recursive Gated Convolution (gⁿConv)

gⁿConv is an efficient operation to achieve long-term and high-order spatial interactions. gⁿConv is built with standard convolutions, linear projections and element-wise multiplications, but has a similar function of input-adaptive spatial mixing to self-attention.

Results

We exhibit favorable accuracy/complexity trade-offs of our models on both ImageNet and downstream tasks.
We demonstrate that HorNet can be a very competitive alternative to transformer-style models and CNNs.

Table 1: ImageNet classification results. We compare our models with state-of-the-art vision Transformers and CNNs that have comparable FLOPs and parameters. We report the top-1 accuracy on the validation set of ImageNet as well as the number of parameters and FLOPs. We also show the improvements over Swin Transformers that have similar overall architectures and training configurations to our models. “↑384” indicates that the model is fine-tuned on 384×384 images for 30 epochs. Our models are highlighted in gray.

Table 2: Object detection and semantic segmentation results with different backbones. We use UperNet for semantic segmentation and Cascade Mask R-CNN for object detection. ‡ indicates that the model is pre-trained on ImageNet-22K. For semantic segmentation, we report both single-scale (SS) and multi-scale (MS) mIoU. The FLOPs are calculated with image size (2048, 512) for ImageNet-1K pre-trained models and (2560, 640) for ImageNet-22K pre-trained models. For object detection, we report the box AP and the mask AP. FLOPs are measured on input sizes of (1280, 800). Our models are highlighted in gray.

Table 3: Object detection results with recent state-of-the-art frameworks. We report the single-scale box AP and mask AP on the validation set of COCO. Our models are highlighted in gray.

Table 4: Semantic Segmentation results with recent state-of-the-art frameworks. We report the single-scale (SS) and multi-scale (MS) mIoU on the validation set of ADE20K. Our models are highlighted in gray.

BibTeX

@article{rao2022hornet,

title={HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions},

author={Rao, Yongming and Zhao, Wenliang and Tang, Yansong and Zhou, Jie and Lim, Ser-Lam and Lu, Jiwen},

journal={Advances in Neural Information Processing Systems (NeurIPS)},

year={2022}

}