Translation between languages is an example of a sequence-to-sequence task. A common approaches is to use both an encoder (to compute a good representation of the source sentence) and a decoder (to generate the sentence in the target sentence). This is called an encoder-decoder model.
Consider translating from English to French. The encoder receives the sentence in English and processes it through a series of transformer layers to create an output representation for each token. During training, the decoder receives the ground truth translation in French and passes it through a series of transformer layers that use masked self-attention and predict the following word at each position. However, the decoder layers also attend to the output of the encoder. Thus, each French output word is conditioned on the previous output words and and the source English sentence.

Cross-attention
This is achieved by modifying the transformer layers in the decoder. Originally, these used masked-self attention, followed by a neural network applied individually at each embedding. A new self-attention layer is added between these two components, in which the decoder embeddings attend to to the encoder embeddings. This uses a version of self-attention known as encoder-decoder attention or cross attention, where the queries are computed from the decoder embeddings and the keys and values are from the encoder embeddings.

dl Cross-attention::Queries are computed from decoder embeddings, keys and values are from encoder embeddings.