在图像领域,CNN卷积神经网络结构已经成为了标配,所有的模型都是基于CNN来构造的。
而在NLP领域,自从Transformer横空出世之后,基本上也统治了NLP的各个领域。
基于Transformer的强大,一些论文的工作都是将Transformer也应用到CV领域,在这篇论文:AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE中,基本上就把Transformer比较成功的引入到了CV领域,而且获得了相对比较好的效果。
摘要
While the Transformer architecture has become the de-facto standard for natural
language processing tasks, its applications to computer vision remain limited. In
vision, attention is either applied in conjunction with convolutional networks, or
used to replace certain components of convolutional networks while keeping their
overall structure in place. We show that this reliance on CNNs is not necessary
and a pure transformer applied directly to sequences of image