bicheng/2025/9/14 19:57:49/文章来源:https://blog.csdn.net/pcgamer/article/details/150955913

在图像领域，CNN卷积神经网络结构已经成为了标配，所有的模型都是基于CNN来构造的。
而在NLP领域，自从Transformer横空出世之后，基本上也统治了NLP的各个领域。

基于Transformer的强大，一些论文的工作都是将Transformer也应用到CV领域，在这篇论文：AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE中，基本上就把Transformer比较成功的引入到了CV领域，而且获得了相对比较好的效果。

摘要

While the Transformer architecture has become the de-facto standard for natural
language processing tasks, its applications to computer vision remain limited. In
vision, attention is either applied in conjunction with convolutional networks, or
used to replace certain components of convolutional networks while keeping their
overall structure in place. We show that this reliance on CNNs is not necessary
and a pure transformer applied directly to sequences of image

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。
如若转载，请注明出处：http://www.pswp.cn/bicheng/96853.shtml
繁体地址，请注明出处：http://hk.pswp.cn/bicheng/96853.shtml
英文地址，请注明出处：http://en.pswp.cn/bicheng/96853.shtml

如若内容造成侵权/违法违规/事实不符，请联系英文站点网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！