Paragraph-length image captioning using hierarchical recurrent neural networks

Show full item record

Title: Paragraph-length image captioning using hierarchical recurrent neural networks
Author: Polis, Arturs
Contributor: University of Helsinki, Faculty of Science
Publisher: Helsingin yliopisto
Date: 2019
Language: eng
Thesis level: master's thesis
Degree program: Datatieteen maisteriohjelma
Master's Programme in Data Science
Magisterprogrammet i data science
Specialisation: ei opintosuuntaa
no specialization
ingen studieinriktning
Discipline: none
Abstract: Recently, a neural network based approach to automatic generation of image descriptions has become popular. Originally introduced as neural image captioning, it refers to a family of models where several neural network components are connected end-to-end to infer the most likely caption given an input image. Neural image captioning models usually comprise a Convolutional Neural Network (CNN) based image encoder and a Recurrent Neural Network (RNN) language model for generating image captions based on the output of the CNN. Generating long image captions – commonly referred to as paragraph captions – is more challenging than producing shorter, sentence-length captions. When generating paragraph captions, the model has more degrees of freedom, due to a larger total number of combinations of possible sentences that can be produced. In this thesis, we describe a combination of two approaches to improve paragraph captioning: using a hierarchical RNN model that adds a top-level RNN to keep track of the sentence context, and using richer visual features obtained from dense captioning networks. In addition to the standard MS-COCO Captions dataset used for image captioning, we also utilize the Stanford-Paragraph dataset specifically designed for paragraph captioning. This thesis describes experiments performed on three variants of RNNs for generating paragraph captions. The flat model uses a non-hierarchical RNN, the hierarchical model implements a two-level, hierarchical RNN, and the hierarchical-coherent model improves the hierarchical model by optimizing the coherence between sentences. In the experiments, the flat model outperforms the published non-hierarchical baseline and reaches similar results to our hierarchical model. The hierarchical model performs similarly to the corresponding published model, thus validating it. The hierarchical-coherent model gives us inconclusive results – it outperforms our hierarchical model but does not reach the same scores as the corresponding published model. With our flat model implementation, we have shown that with minor improvements to a simple image captioning model, one can obtain much higher scores on standard metrics than previously reported. However, it is yet unclear whether a hierarchical RNN is required to model the paragraph captions, or whether a single RNN layer on its own can be powerful enough. Our initial human evaluation indicates that the captions produced by a hierarchical RNN may in fact be more fluent, however the standard automatic evaluation metrics do not capture this.

Files in this item

Total number of downloads: Loading...

Files Size Format View
arturs_polis_thesis_final.pdf 6.930Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record