Generating Accurate Human Face Sketches from Text Descriptions

Shorya Sharma (1)
(1) Indian Institute of Technology Bhubaneswar, Bhubaneswar, Odisha, 751013, India
Fulltext View | Download
How to cite (IJASEIT) :
Sharma, S. (2024). Generating Accurate Human Face Sketches from Text Descriptions. International Journal of Advanced Science Computing and Engineering, 6(1), 20–26. https://doi.org/10.62527/ijasce.6.1.195

Drawing a face for a suspect just based on the descriptions of the eyewitnesses is a difficult task. There are some state-of-the-art methods in generating images from text, but there is only a little research in generating face images from text, and almost none in generating sketches from text. As a result, there is no dataset available to tackle this task.  We developed a text-to-sketch dataset derived from the CelebA dataset, which comprises 200,000 celebrity images, thereby facilitating the investigation of the novel task of generating police sketches from textual descriptions. Furthermore, we demonstrated that the application of AttnGAN for generating sketch images effectively captures the facial features described in the text. We identified the optimal configuration for AttnGAN and its variants through experiments involving various recurrent neural network types and embedding sizes. We provided commonly used metric values, such as the Inception score and Fréchet Inception Distance (FID), for the two-attention-based state-of-the-art model we achieved. However, we also identified areas for improvement in the model's application. Experiments conducted with a new dataset consisting of 200 sketch images from Beijing Normal University revealed that the model encounters challenges when handling longer sentences or unfamiliar terms within descriptions. This limitation in capturing features from such text contributes to a decrease in image diversity and realism, adversely impacting the overall performance of the model. For future improvements, consider exploring alternative models such as Stack-GAN, Conditional-GAN, DC-GAN, and Style-GAN, which are known for their capabilities in face image generation. Simplifying architecture while maintaining performance can also help deploy models on mobile devices for real-world use.

X. Chen, L. Qing, X. He, X. Luo, and Y. Xu, “FTGAN: A fully-trained generative adversarial networks for text to face generation,” arXiv preprint arXiv:1904.05729, 2019.

W. Zhang, X. Wang, and X. Tang, “Coupled information-theoretic encoding for face photo-sketch recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2011, doi:10.1109/CVPR.2011.5995324.

Y. Wang et al., “Text2Sketch: Learning face sketch from facial attribute text,” in Proc. IEEE Int. Conf. Image Process. (ICIP), Oct. 2018, doi: 10.1109/ICIP.2018.8451236.

X. Wang and X. Tang, “Face photo-sketch synthesis and recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, 2009, doi:10.1109/TPAMI.2008.222.

E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Generating images from captions with attention,” arXiv preprint arXiv:1511.02793, 2016.

S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” arXiv preprint arXiv:1605.05396, 2016.

O. R. Nasir, S. K. Jha, and M. S. Grover, “Text2FaceGAN: Face generation from fine-grained textual descriptions,” in Proc. IEEE Int. Conf. Multimedia Big Data (BigMM), 2019, doi:10.1109/BigMM.2019.00-42.

T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, “AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks,” arXiv preprint arXiv:1711.10485, 2017.

T. Wang, T. Zhang, and B. C. Lovell, “Faces la Carte: Text-to-face generation via attribute disentanglement,” in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), 2021, doi:10.1109/wacv48630.2021.00342.

A. M. Martinez and R. Benavente, “The AR face database,” CVC Tech. Rep. 24, Jun. 1998.

K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “XM2VTSDB: The extended M2VTS database,” in Proc. Int. Conf. Audio- and Video-Based Person Authentication, 1999, pp. 72–77.

K. Shmelkov, C. Schmid, and K. Alahari, “How good is my GAN?,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018, doi: 10.1007/978-3-030-01216-8_14.

IIIT-D Sketch Database. [Online]. Available: http://www.iab-rubric.org/resources/sketchDatabase.html.

H. Zhang and D. Metaxas, “StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, doi:10.1109/ICCV.2017.629.

Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), 2015, doi:10.1109/ICCV.2015.425.

J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, doi:10.1109/ICCV.2017.244.