Transformers were not immediately promising for image-based tasks. First, there are many more pixels in an image than words in a sentence, so the quadratic complexity of self-attention poses a practical bottleneck. Second, CNNs have inductive biases built in because each layer is equivariant to spatial translation, and takes into account the 2D structure of the image. Nonetheless, transformers have eclipsed CNNs for images. This is partly because of the enormous scale at which they can be constructed and the large amounts of data that can be used for pre-training.