Document Type : Original/Review Paper

Authors

Human-Computer Interaction Lab., Faculty of Electrical and Computer Engineering, Tarbiat Modares University, Tehran, Iran.

Abstract

Every facial expression involves one or more facial action units appearing on the face. Therefore, action unit recognition is commonly used to enhance facial expression detection performance. It is important to identify subtle changes in face when particular action units occur. In this paper, we propose an architecture that employs local features extracted from specific regions of face while using global features taken from the whole face. To this end, we combine the SPPNet and FPN modules to architect an end-to-end network for facial action unit recognition. First, different predefined regions of face are detected. Next, the SPPNet module captures deformations in the detected regions. The SPPNet module focuses on each region separately and can not take into account possible changes in the other areas of the face. In parallel, the FPN module finds global features related to each of the facial regions. By combining the two modules, the proposed architecture is able to capture both local and global facial features and enhance the performance of action unit recognition task. Experimental results on DISFA dataset demonstrate the effectiveness of our method.

Keywords

[1] P. Ekman and W. V. Friesen, Unmasking the face: A guide to recognizing emotions from facial clues. Vol. 10, Ishk, 2003.
 
[2] S. Wang, H. Ding, and G. Peng, “Dual learning for facial action unit detection under non-full annotation.” IEEE Transactions on Cybernetics, Vol. 52, No. 4, pp. 2225-2237, April 2022.
 
[3] C. Fabian Benitez-Quiroz, R. Srinivasan, and A. M. Martinez, “Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5562–5570.
 
[4] R. Ekman, What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press, USA,1997.
 
[5] K. Zhao, W.-S. Chu, and A. M. Martinez, “Learning facial action units from web images with scalable weakly supervised clustering,” in Proceedings of the IEEE Conference on computer vision and pattern recognition, 2018, pp. 2090–2099.
 
[6] R. Zhi, M. Liu, and D. Zhang, “Facial representation for automatic facial action unit analysis system,” in 2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC). IEEE, 2019, pp. 1368–1372.
 
[7] J. J. Bazzo and M. V. Lamar, “Recognizing facial actions using gabor wavelets with neutral face average difference,” in Sixth IEEE International Conference on Automatic Face and Gesture Recognition, 2004. Proceedings. IEEE, 2004, pp. 505–510.
 
[8] M. Valstar and M. Pantic, “Fully automatic facial action unit detection and temporal analysis,” in 2006 Conference on Computer Vision and Pattern Recognition Workshop (CVPRW’06). IEEE, 2006, pp. 149–149.
 
[9] J. Whitehill and C. W. Omlin, “Haar features for facs au recognition,” in 7th International Conference on Automatic Face and Gesture Recognition (FGR06). IEEE, 2006, pp. 5–pp.
 
[10] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proceedings of the seventh IEEE international conference on computer vision, Vol. 2. Ieee, 1999, pp. 1150–1157.
 
[11] B. Jiang, M. F. Valstar, and M. Pantic, “Action unit detection using sparse appearance descriptors in space-time video volumes,” in Face and Gesture 2011. IEEE, 2011, pp. 314–321.
 
[12] J. J.-J. Lien, T. Kanade, J. F. Cohn, and C.-C. Li, “Detection, tracking, and classification of action units in facial expression,” Robotics and Autonomous Systems, Vol. 31, No. 3, pp. 131–146, 2000.
 
[13] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews, “The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotionspecified expression,” in 2010 ieee computer society conference on computer vision and pattern recognition-workshops. IEEE, 2010, pp. 94–101.
 
[14] M. F. Valstar and M. Pantic, “Fully automatic recognition of the temporal phases of facial actions,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), Vol. 42, No. 1, pp. 28–43, 2011.
 
[15] T. Senechal, V. Rapp, H. Salam, R. Seguier, K. Bailly, and L. Prevost, “Facial action recognition combining heterogeneous features via multikernel learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), Vol. 42, No. 4, pp. 993–1005, 2012.
 
[16] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, “SimpleMKL,” Journal of Machine Learning Research, Vol. 9, No. 83, pp. 2491–2521, 2008.
 
[17] S. Eleftheriadis, O. Rudovic, and M. Pantic, “Joint facial action unit detection and feature fusion: A multi-conditional learning approach,” IEEE transactions on image processing, Vol. 25, No. 12, pp. 5727–5742, 2016.
 
[18] C. Ma, L. Chen, and J. Yong, “AU R-CNN:
Encoding expert prior knowledge into RCNN for action unit detection,” Neurocomputing, Vol. 355, pp. 35–47, 2019.
 
[19] M. Savadi Hosseini and F. Ghaderi, “A hybrid deep learning architecture using 3D CNNs and GRUs for human action recognition,” International Journal of Engineering, Vol. 33, No. 5, pp. 959–965, 2020.
 
[20] Z. Shao, Z. Liu, J. Cai, and L. Ma, “Deep adaptive attention for joint facial action
unit detection and face alignment,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 705–720.
 
[21] Z. Shao, Z. Liu, J. Cai, Y. Wu, and L. Ma, “Facial action unit detection using attention and relation learning,” IEEE Transactions on Affective Computing, Vol. 13, 2019, pp. 1274–1289.
 
[22] Y. Chen, G. Song, Z. Shao, J. Cai, T.-J. Cham, and J. Zheng, “Geoconv: Geodesic guided convolution for facial action unit recognition,” Pattern Recognition 122 (2022): 108355.
 
[23] M. Kurmanji and F. Ghaderi, “Hand gesture recognition from RGB-D data using 2D and 3D convolutional neural networks: a comparative study,” Journal of AI and Data Mining, Vol. 8, No. 2, pp. 177–188, 2020.
 
[24] Z. Liu, J. Dong, C. Zhang, L. Wang, and J. Dang, “Relation modeling with graph convolutional networks for facial action unit detection,” in International Conference on Multimedia Modeling. Springer, 2020, pp. 489–501.
 
[25] C. Corneanu, M. Madadi, and S. Escalera, “Deep structure inference network for facial action unit recognition,” in Proceedings of the European Conference on Computer Vision (ECCV), Vol. 11216, 2018, pp. 298–313.
 
[26] G. Li, X. Zhu, Y. Zeng, Q. Wang, and L. Lin, “Semantic relationships guided representation learning for facial action unit recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 2019, pp. 8594–8601.
 
[27] Y. Li, S. Shan, “Meta auxiliary learning for facial action unit detection” IEEE Transactions on Affective Computing, Vol. 19, 2021, pp. 14–17.
 
[28] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, Vol. 37, No. 9, pp. 1904–1916, 2015.
 
[29] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
 
[30] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 
[31] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn, “Disfa: A spontaneous facial action intensity database,” IEEE Transactions on Affective Computing, Vol. 4, No. 2, pp. 151–160, 2013.
 
[32] X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard, “Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database,” Image and Vision Computing, Vol. 32, No. 10, pp. 692–706, 2014.
 
[33] L. A. Jeni, J. F. Cohn, and F. De La Torre, “Facing imbalanced data–recommendations for the use of performance metrics,” in 2013 Humaine association conference on affective computing and intelligent interaction. IEEE, 2013, pp. 245–251.
 
[34] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “Liblinear: A library for large linear classification,” Journal of machine learning research, Vol. 9, No. Aug, pp. 1871–1874, 2008.
 
[35] L. Zhong, Q. Liu, P. Yang, J. Huang, and D. N. Metaxas, “Learning multiscale active facial patches for expression analysis,” IEEE Transactions on Cybernetics, Vol. 45, No. 8, pp. 1499–1510, 2014.
 
[36] K. Zhao, W.-S. Chu, and H. Zhang, “Deep region and multi-label learning for facial action unit detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3391–3399.
 
[37] W. Li, F. Abtahi, and Z. Zhu, “Action unit detection with region adaptation, multilabeling learning and optimal temporal fusing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1841–1850.