# chinese_im2text.pytorch **Repository Path**: knifecms/chinese_im2text.pytorch ## Basic Information - **Project Name**: chinese_im2text.pytorch - **Description**: PyTorch implementation of Chinese image captioning on AI_challenger dataset - **Primary Language**: Python - **License**: Not specified - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 1 - **Forks**: 0 - **Created**: 2020-08-07 - **Last Updated**: 2022-05-24 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # chinese_im2text.pytorch This project is based on ruotian's [neuraltalk2.pytorch](https://github.com/ruotianluo/neuraltalk2.pytorch). ## Requirements ### Software enviroment Python 2.7 and Python 3 ([coco-caption](https://github.com/tylin/coco-caption)), PyTorch 0.2 (along with torchvision). ### Dataset You need to download pretrained resnet model for both training and evaluation, and you need to register the ai challenger, and then download the training and validation dataset. ## Pretrained models TODO ## Train your own network on AI Challenger ### Download AI Challenger dataset and preprocessing First, download the 图像中文描述数据库 from [link](https://challenger.ai/datasets). or ``` https://pan.baidu.com/s/1zG-qwf8otow-QZk7XsrERw code: 1s4q ``` We need training images (210,000) and val images (30,000). You should put the `ai_challenger_caption_train_20170902/` and `ai_challenger_caption_train_20170902/` in the same directory, denoted as `$IMAGE_ROOT`. Once we have these, we can now invoke the `json_preprocess.py` and `prepro_ai_challenger.py` script, which will read all of this in and create a dataset (two feature folders, a hdf5 label file and a json file). ```bash $ python scripts/json_preprocess.py $ python -m scripts.prepro_ai_challenger ``` `json_preprocess.py` will first transform the AI challenger Image Caption_json to mscoco json format. Then map all words that occur <= 5 times to a special `UNK` token, and create a vocabulary for all the remaining words. The image information and vocabulary are dumped into `coco_ai_challenger_raw.json`. `prepro_ai_challenger.py` extract the resnet101 features (both fc feature and last conv feature) of each image. The features are saved in `coco_ai_challenger_talk_fc.h5` and `coco_ai_challenger_talk_att.h5`, and resulting files are about 359GB. ### Start training The following training procedure are adopted from ruotian's project, and if you need REINFORCEMENT-based approach, you can clone from [here](https://github.com/ruotianluo/self-critical.pytorch). For ai challenger, they provide large number of validation size, you can set `--val_images_use` to a bigger size. ```bash $ python train.py --id st --caption_model show_tell --input_json data/cocotalk.json --input_fc_h5 data/coco_ai_challenger_talk_fc.h5 --input_att_h5 data/coco_ai_challenger_talk_att.h5 --input_label_h5 data/coco_ai_challenger_talk_label.h5 --batch_size 10 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log_st --save_checkpoint_every 6000 --val_images_use 5000 --max_epochs 25 ``` The train script will dump checkpoints into the folder specified by `--checkpoint_path` (default = `save/`). We only save the best-performing checkpoint on validation and the latest checkpoint to save disk space. To resume training, you can specify `--start_from` option to be the path saving `infos.pkl` and `model.pth` (usually you could just set `--start_from` and `--checkpoint_path` to be the same). If you have tensorflow, the loss histories are automatically dumped into `--checkpoint_path`, and can be visualized using tensorboard. The current command use scheduled sampling, you can also set scheduled_sampling_start to -1 to turn off scheduled sampling. If you'd like to evaluate BLEU/METEOR/CIDEr scores during training in addition to validation cross entropy loss, use `--language_eval 1` option, but don't forget to download the [coco-caption code](https://github.com/tylin/coco-caption) into `coco-caption` directory. For more options, see `opts.py`. I am training this model with stack-captioning, and the training loss is as follows: ![](./vis/training_log_mine.png) Some predicted descriptions are as follows (image xxx, xxx is the image ID): ```bash Beam size: 5, image 3750: 球场上有一个穿着运动服的男人在说话 Beam size: 5, image 3751: 一个右手拿着包的男人在T台场上走秀 Beam size: 5, image 3752: 一个穿着西装的男人和一个穿着西装的男人站在室外 Beam size: 5, image 3753: 道路上手有一个穿着着包的女人和一个男人 Beam size: 5, image 3754: 两个穿着运动服的男人在运动场上奔跑 Beam size: 5, image 3755: 房间里有一个穿着短袖的男人在给一个穿着短袖的孩子 Beam size: 5, image 3756: 舞台上手拿着话筒的男人在舞台上唱歌 Beam size: 5, image 3757: 两个穿着运动服的男人在球场上踢足球 Beam size: 5, image 3758: 室外有着长起的女人坐在椅子上 Beam size: 5, image 3759: 两个穿着球衣的男人在运动场上打篮球 Beam size: 5, image 3760: 两个穿着球衣的男人在球在球场上奔跑 Beam size: 5, image 3761: 一个右手拿着包的女人站在道路上 Beam size: 5, image 3762: 一个穿着裙子的女人和一个穿着裙子的女人走在道路上 Beam size: 5, image 3763: 宽敞油的球服的男人旁有一个穿着球 Beam size: 5, image 3764: 一个穿着裙子的女人站在广告牌前 Beam size: 5, image 3765: 球场上有两个穿着球衣的男人在打篮球 Beam size: 5, image 3766: 室外有一个人前有一个戴着帽子的男人在给一个孩子 Beam size: 5, image 3767: 一个穿着西装的男人和一个双手站在道路上 Beam size: 5, image 3768: 一个右手拿着话筒的男人站在广告牌前的红毯上 Beam size: 5, image 3769: 球场上的球场上有一个穿动场上踢足球 Beam size: 5, image 3770: 两个人旁有一个人旁边有一个戴着帽子的孩的男人在道 Beam size: 5, image 3771: 一个穿着裙子的女人站在广告牌前 Beam size: 5, image 3772: 运动场上拿着球拍的女人在打羽毛球 Beam size: 5, image 3773: 广告牌前有一个机的男人旁有一个双手拿话筒的男人在说话 Beam size: 5, image 3774: 一个人旁有一个穿着裙子的女人坐在室内的椅子上 Beam size: 5, image 3775: 一个右手拿着手机的女人在道路上 Beam size: 5, image 3776: 一个戴着帽子的男人在室内 Beam size: 5, image 3777: 道路上有一个右手拿着包的女人在走秀 Beam size: 5, image 3778: 一个戴着墨镜的女人走在道路上 Beam size: 5, image 3779: 一个双手拿着话筒的女人站在广告牌前 Beam size: 5, image 3780: 一个戴着墨镜的女人走在大厅里 Beam size: 5, image 3781: 室内一个人旁有一个右手拿着笔的男人在下围棋 Beam size: 5, image 3782: 一个右手拿着东西的女人站在道路上 Beam size: 5, image 3783: 屋子里有一个右手拿着手机的女人坐在椅子上 Beam size: 5, image 3784: 一个右手拿着话筒的女人和一个穿着裙子的女人站在舞台上 Beam size: 5, image 3785: 一个戴着墨镜的女人和一个戴着墨镜的女人坐在船上 Beam size: 5, image 3786: 绿油的草地上有两个戴着帽子的男人在草地上 Beam size: 5, image 3787: 球场上有一个右手拿着球拍的男人在打羽毛球 Beam size: 5, image 3788: 两个穿着球衣的男人走在球场上奔跑 Beam size: 5, image 3789: 球场上有两个穿着球衣的运动员在打排球 Beam size: 5, image 3790: 一个穿着裙子的女人坐在室内的沙发上 Beam size: 5, image 3791: 大厅里有一个人旁有一起的男人坐在室 Beam size: 5, image 3792: 舞台上两个的舞着旁有一个男人 Beam size: 5, image 3793: 一个穿着裙子的女人站在大厅内 Beam size: 5, image 3794: 球场上的球的男人旁有一个穿着球衣的男人在踢足球 Beam size: 5, image 3795: 两个穿着球衣穿着球衣的男人在运动场上争抢足球 Beam size: 5, image 3796: 运动场的前面上有着运动服的男人和一个 Beam size: 5, image 3797: 一个穿着裙子的衣服的女人站在道路上 Beam size: 5, image 3798: 室内有一个左腿上坐在一个坐发上 Beam size: 5, image 3799: 屋子里有一个右手拿着衣服的男人在下围棋 ``` ## Generate image captions ### Evaluate on raw images Now place all your images of interest into a folder, e.g. `blah`, and run the eval script: ```bash $ python eval.py --model model.pth --infos_path infos.pkl --image_folder blah --num_images 10 ``` This tells the `eval` script to run up to 10 images from the given folder. If you have a big GPU you can speed up the evaluation by increasing `batch_size`. Use `--num_images -1` to process all images. The eval script will create an `vis.json` file inside the `vis` folder, which can then be visualized with the provided HTML interface: ```bash $ cd vis $ python -m SimpleHTTPServer ``` Now visit `localhost:8000` in your browser and you should see your predicted captions. ### Evaluate on validation split ```bash $ python eval.py --dump_images 0 --num_images 5000 --model model.pth --infos_path infos.pkl --language_eval 1 ``` The defualt split to evaluate is test. The default inference method is greedy decoding (`--sample_max 1`), to sample from the posterior, set `--sample_max 0`. **Beam Search**. Beam search can increase the performance of the search for greedy decoding sequence by ~5%. However, this is a little more expensive. To turn on the beam search, use `--beam_size N`, N should be greater than 1 (we set beam size to 5 in our eval). ## Acknowledgements Thanks the original [neuraltalk2](https://github.com/karpathy/neuraltalk2), and the pytorch-based [neuraltalk2.pytorch](https://github.com/ruotianluo/neuraltalk2.pytorch) and awesome PyTorch team. ## Paper 1. Jiuxiang Gu, Gang Wang, Jianfei Cai, and Tsuhan Chen. ["An Empirical Study of Language CNN for Image Captioning."](https://arxiv.org/pdf/1612.07086.pdf) ICCV, 2017. ``` @article{gu2016recurrent, title={An Empirical Study of Language CNN for Image Captioning}, author={Gu, Jiuxiang and Wang, Gang and Cai, Jianfei and Chen, Tsuhan}, journal={ICCV}, year={2017} } ``` 2. Jiuxiang Gu, Jianfei cai, Gang Wang, and Tsuhan Chen. ["stack-Captioning: Coarse-to-Fine Learning for Image Captioning."](https://arxiv.org/abs/1709.03376) arXiv preprint arXiv:1709.03376 (2017). ``` @article{gu2017stack_cap, title={Stack-Captioning: Coarse-to-Fine Learning for Image Captioning}, author={Gu, Jiuxiang and Cai, Jianfei and Wang, Gang and Chen, Tsuhan}, journal={arXiv preprint arXiv:1709.03376}, year={2017} } ```