The Vision Transformer differs from convolutional architectures in that it operates on a single scale and has a receptive field that covers the whole images. Several approaches that process the image at multiple scale have been proposed. Similar to convolutional networks, these generally start with small high-resolution patches and few channels and gradually enlarge the receptive field, decrease the spatial resolution, and increase the number of channels (embedding dimension).
Swin Transformer
The Swin Transformer (shifted-window) is a representative example. This is an encoder transformer that divides the image into patches and groups these patches into a grid of windows within which self-attention is applied independently. These windows are shifted in adjacent transformers, so the effective receptive field at a given patch can expand beyond a window border.

The scale is reduced periodically by concatenating features from non-overlapping 2x2 patches and applying a linear transformation that maps these concatenated features to twice the number of channels. This architecture does not have <cls> token but instead averages the output features at the last layer. These are then mapped via a linear layer to the desired number of classes and passed through a softmax function to output class probabilities.
DaViT
The Dual attention vision transformer (DaViT) integrates information from across the whole image periodically. This is done by alternating two types of transformers.
- In the first, image patches attend to one another, and the self-attention computation uses all the channels.
- In the second, the channels attend to one another, and the self-attention computation uses all the images patches.