Conditional Language Models are not used only in Text Summarization and Machine Translation. They can be used also for Image Captioning!

Here is a great example from Machine Learning Mastery of how we can connect the Feature Extraction component of a SOTA Computer Vision model (e.g., VGG, ResNet, Inception, Xception, etc) with the input of a Language Model in order to generate the caption of an image.

The whole deep learning architecture can be trained end-to-end. It is a simple encoder-decoder architecture but it can be extended and improved using an attention interface between encoder and decoder, or even using Transformer layers!

Adding attention not only enables the model to attend differently various parts of the input image but also explain its decisions. For each generated word in output caption we can visualize the attended visual part of input image.