Tsubatoの発信記録

主に機械学習やデータサイエンス関連で学んだことを書いています。

論文読み MetaFormer Is Actually What You Need for Vision

論文機械学習

記事の概要

前回の記事でVision Transformerのself-attentionをMLPに置き換えても性能は変わらないという論文を紹介しました。aburaku.hatenablog.com
今回紹介する論文はさらに一歩進み、特徴量をmixするパートは何でもよくて、それ以外の構造自体が重要ではないかという提案をしています。arxiv.org

論文の概要

Transformerにおけるself-attention、MLP-mixerにおけるtoken-mixing MLPを"Token Mixer"として抽象化したMetaFormerを提案しています。

We thus hypothesize compared with specific token mixers, MetaFormer is more essential for the model to achieve competitive performance.

PoolingするだけのシンプルなToken Mixerが他と同程度の性能を出していることが、Token Mixerの中身よりもそれ以外の構造の方が重要という主張を支持しています。

画像分類(ImageNet-1K), 物体検出(COCO benchmark), セマンティックセグメンテーション(ADE20K)でCNNなどの既存モデルよりも少ないパラメータ、計算量で同等の精度が示されています。
興味深いのがablation study。これによるとnormalizationとResidual connection、Channel MLPが必須であることがわかります。また、mixerにpoolとattentionを組み合わせると精度が上がっています。

感想

著者の言うようにこの考えが自然言語処理でも共通なのかは気になります。画像処理と言語処理が同じシンプルなネットワークで実現できるなら面白いですね。

Moreover, it is interesting to see whether PoolFormer still works on NLP tasks to further support the claim “MetaFormer is actually what you need” in the NLP domain.

まずVision Transformerを理解したいという方には以下の本がおすすめです。日本語で読める質の高い教科書があるのはありがたいですね。

Vision Transformer入門 Computer Vision Library

Vision Transformer入門 Computer Vision Library