next up: spatiotemporal autoencoder for videos

I would like to implement something like this: https://arxiv.org/pdf/1511.06309.pdf

Essentially, the idea is to nest a temporal autoencoder (probably using LSTMs) inside the spatial autoencoder, in order to warp the encoding to predict the NEXT frame in a video sequence.