# image_captioning **Repository Path**: srwpf/image_captioning ## Basic Information - **Project Name**: image_captioning - **Description**: Tensorflow implementation of "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2019-10-21 - **Last Updated**: 2020-12-19 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README ### Introduction This neural system for image captioning is roughly based on the paper "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention" by Xu et al. (ICML2015). The input is an image, and the output is a sentence describing the content of the image. It uses a convolutional neural network to extract visual features from the image, and uses a LSTM recurrent neural network to decode these features into a sentence. A soft attention mechanism is incorporated to improve the quality of the caption. This project is implemented using the Tensorflow library, and allows end-to-end training of both CNN and RNN parts. ### Prerequisites * **Tensorflow** ([instructions](https://www.tensorflow.org/install/)) * **NumPy** ([instructions](https://scipy.org/install.html)) * **OpenCV** ([instructions](https://pypi.python.org/pypi/opencv-python)) * **Natural Language Toolkit (NLTK)** ([instructions](http://www.nltk.org/install.html)) * **Pandas** ([instructions](https://scipy.org/install.html)) * **Matplotlib** ([instructions](https://scipy.org/install.html)) * **tqdm** ([instructions](https://pypi.python.org/pypi/tqdm)) ### Usage * **Preparation:** Download the COCO train2014 and val2014 data [here](http://cocodataset.org/#download). Put the COCO train2014 images in the folder `train/images`, and put the file `captions_train2014.json` in the folder `train`. Similarly, put the COCO val2014 images in the folder `val/images`, and put the file `captions_val2014.json` in the folder `val`. Furthermore, download the pretrained VGG16 net [here](https://app.box.com/s/idt5khauxsamcg3y69jz13w6sc6122ph) or ResNet50 net [here](https://app.box.com/s/17vthb1zl0zeh340m4gaw0luuf2vscne) if you want to use it to initialize the CNN part. * **Training:** To train a model using the COCO train2014 data, first setup various parameters in the file `config.py` and then run a command like this: ```shell python main.py --phase=train \ --load_cnn \ --cnn_model_file='./vgg16_no_fc.npy'\ [--train_cnn] ``` Turn on `--train_cnn` if you want to jointly train the CNN and RNN parts. Otherwise, only the RNN part is trained. The checkpoints will be saved in the folder `models`. If you want to resume the training from a checkpoint, run a command like this: ```shell python main.py --phase=train \ --load \ --model_file='./models/xxxxxx.npy'\ [--train_cnn] ``` To monitor the progress of training, run the following command: ```shell tensorboard --logdir='./summary/' ``` * **Evaluation:** To evaluate a trained model using the COCO val2014 data, run a command like this: ```shell python main.py --phase=eval \ --model_file='./models/xxxxxx.npy' \ --beam_size=3 ``` The result will be shown in stdout. Furthermore, the generated captions will be saved in the file `val/results.json`. * **Inference:** You can use the trained model to generate captions for any JPEG images! Put such images in the folder `test/images`, and run a command like this: ```shell python main.py --phase=test \ --model_file='./models/xxxxxx.npy' \ --beam_size=3 ``` The generated captions will be saved in the folder `test/results`. ### Results A pretrained model with default configuration can be downloaded [here](https://app.box.com/s/xuigzzaqfbpnf76t295h109ey9po5t8p). This model was trained solely on the COCO train2014 data. It achieves the following BLEU scores on the COCO val2014 data (with `beam size=3`): * **BLEU-1 = 70.3%** * **BLEU-2 = 53.6%** * **BLEU-3 = 39.8%** * **BLEU-4 = 29.5%** Here are some captions generated by this model: ![examples](examples/examples.jpg) ### References * [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/abs/1502.03044). Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio. ICML 2015. * [The original implementation in Theano](https://github.com/kelvinxu/arctic-captions) * [An earlier implementation in Tensorflow](https://github.com/jazzsaxmafia/show_attend_and_tell.tensorflow) * [Microsoft COCO dataset](http://mscoco.org/)