HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions
Yongming Rao*,1 Wenliang Zhao*,1 Yansong Tang1
Jie Zhou1 Ser-Nam Lim†,2 Jiwen Lu†,1
1Tsinghua University 2Meta AI
[Paper (arXiv)] [Code (GitHub)]
Yongming Rao*,1 Wenliang Zhao*,1 Yansong Tang1
Jie Zhou1 Ser-Nam Lim†,2 Jiwen Lu†,1
1Tsinghua University 2Meta AI
[Paper (arXiv)] [Code (GitHub)]
Figure 1: Illustration of our main idea. We show representative spatial modeling operations that perform different orders of interactions. In this paper, we focus on studying explicit spatial interactions between a feature (red) and its neighboring region (light gray). (a) The standard convolution operation does not explicitly consider spatial interaction. (b) Dynamic convolution and SE introduce the dynamic weights to improve the modeling power of convolutions with extra spatial interactions. (c) The self-attention operation performs two-order spatial interactions with two successive matrix multiplications. (d) gnConv realizes arbitrary-order spatial interactions using a highly efficient implementation with gated convolutions and recursive deigns.
Recent progress in vision Transformers exhibits great success in various tasks driven by the new spatial modeling mechanism based on dot-product self-attention. In this paper, we show that the key ingredients behind the vision Transformers, namely input-adaptive, long-range and high-order spatial interactions, can also be efficiently implemented with a convolution-based framework. We present the Recursive Gated Convolution (gnConv) that performs high-order spatial interactions with gated convolutions and recursive designs. The new operation is highly flexible and customizable, which is compatible with various variants of convolution and extends the two-order interactions in self-attention to arbitrary orders without introducing significant extra computation. gnConv can serve as a plug-and-play module to improve various vision Transformers and convolution-based models. Based on the operation, we construct a new family of generic vision backbones named HorNet. Extensive experiments on ImageNet classification, COCO object detection and ADE20K semantic segmentation show HorNet outperform Swin Transformers and ConvNeXt by a significant margin with similar overall architecture and training configurations. HorNet also shows favorable scalability to more training data and a larger model size. Apart from the effectiveness in visual encoders, we also show gnConv can be applied to task-specific decoders and consistently improve dense prediction performance with less computation. Our results demonstrate that gnConv can be a new basic module for visual modeling that effectively combines the merits of both vision Transformers and CNNs.
gnConv is an efficient operation to achieve long-term and high-order spatial interactions. gnConv is built with standard convolutions, linear projections and element-wise multiplications, but has a similar function of input-adaptive spatial mixing to self-attention.
We exhibit favorable accuracy/complexity trade-offs of our models on both ImageNet and downstream tasks.
We demonstrate that HorNet can be a very competitive alternative to transformer-style models and CNNs.
@article{rao2022hornet,
title={HorNet: Efficient High-Order Spatial Interactions with Recursive Gated Convolutions},
author={Rao, Yongming and Zhao, Wenliang and Tang, Yansong and Zhou, Jie and Lim, Ser-Lam and Lu, Jiwen},
journal={Advances in Neural Information Processing Systems (NeurIPS)},
year={2022}
}