Embed to Control is a 2015 paper that introduces the core ideas behind world modeling. Though it doesn’t have many citations relative to the big papers today, it has nonetheless been influential on robot learning researchers.

Variational Autoencoders

A cheap way to do efficient feature extraction on images is to use a convolutional architecture. Thus, we add some convolution blocks in the decoder and some de-convolution blocks in the decoder.