Document Type : Original/Review Paper

Authors

1 Computer Engineering Department, Yazd University, Yazd, Iran

2 Computer Science Department, University of Copenhagen, Copenhagen, Denmark.

3 Computer Engineering Departmen, Yazd University, Yazd, Iran.

10.22044/jadm.2021.10837.2224

Abstract

This research is related to the development of technology in the field of automatic text to image generation. In this regard, two main goals are pursued; first, the generated image should look as real as possible; and second, the generated image should be a meaningful description of the input text. our proposed method is a Multi Sentences Hierarchical GAN (MSH-GAN) for text to image generation. In this research project, we have considered two main strategies: 1) produce a higher quality image in the first step, and 2) use two additional descriptions to improve the original image in the next steps. Our goal is to focus on using more information to generate images with higher resolution by using more than one sentence input text. We have proposed different models based on GANs and Memory Networks. We have also used more challenging dataset called ids-ade. This is the first time; this dataset has been used in this area. We have evaluated our models based on IS, FID and, R-precision evaluation metrics. Experimental results demonstrate that our best model performs favorably against the basic state-of-the-art approaches like StackGAN and AttGAN.

Keywords

[1] Z. Zhang, Y. Xie, and L. Yang, “Photo-graphic text-to-image synthesis with a hierarchically-nested adversarial network”, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, pp. 6199-6208, 2018.
[2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks”, in Advances in neural information processing systems, pp. 2672-2680, 2014.
[3] Y. Li, Y. Chen, and Y. Shi, “Brain tumor segmentation using 3D generative adversarial networks”, International Journal of Pattern Recognition and Arti_cial Intelligence, p. 2157002, 2020.
[4] Y. Li, Z. He, Y. Zhang, and Z. Yang, “High-quality many-to-many voice conversion using transitive star generative adversarial networks with adaptive instance normalization” Journal of Circuits, Systems and Computers (2020).
[5] A. Fakhari. and K. Kiani. "An image restoration architecture using abstract features and generative models." Journal of AI and Data Mining. Vol. 9, No. 1, pp. 129-139, 2021.
[6]‎ M.M. Haji-Esmaeili and G. Montazer, “Automatic coloring of grayscale images using generative adversarial networks”,  Journal of Signal and Data Processing (JSDP), Vol. 16 (1), pp. 57-74, 2019.
[7] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis”, arXiv preprint arXiv:1605.05396, 2016.
[8] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X.Wang, and D. Metaxas, “Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks”, in Proc. of the IEEE int. conference on computer vision, pp. 5907-5915, 2017.
[9] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, “Stackgan++: Realistic image synthesis with stacked generative adversarial networks”, IEEE transactions on pattern analysis and machine intelligence, 41, 1947-1962, 2018.
[10] K.J. Joseph, A. Pal, S. Rajanala, and V.N. Balasubramanian, “C4synth: Cross-caption cycle-consistent text-to-image synthesis”, in IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 358-366, 2019.
[11] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “Attngan: Fine-grained text to image generation with attentional generative adversarial networks”, in Proc. of the IEEE conf. on computer vision and pattern recognition, pp. 1316-1324, 2018.
[12] M. Zhu, P. Pan, W. Chen, and Y. Yang, “Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis”, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802-5810, 2019.
[13] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, The caltech-ucsd birds-200-2011 dataset, 2011.
[14] N. Ilinykh, S. ZarrieB, and D. Schlangen, “Tell me more: A dataset of visual scene description sequences”, in Proc. of the 12th International Conference on Natural Language Generation, pp. 152-157, 2019.
[15] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans”, in Advances in neural information processing systems (NIPSs), pp. 2234-2242, 2016.
[16] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium”, arXiv preprint arXiv:1706.08500. 2017.
[17] P. Zhou, W. Shi, J. Tian, Z. Qi, B. Li, H. Hao, and B. Xu, “Attention-based bidirectional long short-term memory networks for relation classification”, in Proceedings of the 54th annual meeting of the association for computational linguistics, pp. 207-212, 2016.
[18] C. Szegedy, V. Vanhoucke, S. Io_e, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision”, in Proc. of the IEEE conf. on computer vision and pattern recognition, pp. 2818-2826, 2016.
[19] C. Gulcehre, S. Chandar, K. Cho, and Y. Bengio, “Dynamic neural Turing machine with continuous and discrete addressing schemes”, Neural computation, 30, 857-884, 2018.
[20] A. Miller, A. Fisch, J. Dodge, A. H. Karimi, A. Bordes, and J. Weston, “Key-value memory networks for directly reading documents”, in Proc. of Empirical Methods in Natural Language Processing (EMNLP), 2016.
[21] X. Yan, J. Yang, K. Sohn, and H. Lee, “Attribute2image: Conditional image generation
from visual attributes”, in European Conf. on Computer Vision, pp. 776-791, 2016.
[22] X. Zhu, A.B. Goldberg, M. Eldawy, C.R. Dyer, and B. Strock, “A text-to-picture synthesis system for augmenting communication”, in (AAAI) , pp. 1590-1595, 2007.
[23] A. Dash, J.C.B. Gamboa, S. Ahmed, M. Liwicki, and M. Z. Afzal, “Tac-gan-text conditioned auxiliary classifier generative adversarial network”, arXiv preprint arXiv:1703.06412, 2017.
[24] J. Y. Koh, J. Baldridge, H. Lee, and Y. Yang, “Text-to-image generation grounded by fine-grained user attention”. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 237-246, 2021.
[25] T. Baltrusaitis, C. Ahuja, and L. P. Morency, “Multi-modal machine learning: A survey and taxonomy”. IEEE transactions on pattern analysis and machine intelligence 41, 423-443, 2018.
[26] W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, and J. Gao, “Object-driven text-to-image synthesis via adversarial training”, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12174-12182, 2019.
[27] G. Yin, B. Liu, L. Sheng, N. Yu, X. Wang, and J. Shao, “Semantics disentangling for text-to-image generation”, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 2327-2336., 2019.