Document Type : Original/Review Paper


1 Department of Computer, Faculty of Engineering, Bozorgmehr University of Qaenat, Qaen, Iran.

2 Faculty of Electrical Engineering Shahrood University of Technology.


Optimizers are vital components of deep neural networks that perform weight updates. This paper introduces a new updating method for optimizers based on gradient descent, called whitened gradient descent (WGD). This method is easy to implement and can be used in every optimizer based on the gradient descent algorithm. It does not increase the training time of the network significantly. This method smooths the training curve and improves classification metrics. To evaluate the proposed algorithm, we performed 48 different tests on two datasets, Cifar100 and Animals-10, using three network structures, including densenet121, resnet18, and resnet50. The experiments show that using the WGD method in gradient descent based optimizers, improves the classification results significantly. For example, integrating WGD in RAdam optimizer increased the accuracy of DenseNet from 87.69% to 90.02% on the Animals-10 dataset.


[1] A. Bordes, X. Glorot, J. Weston, and Y. Bengio, "Joint learning of words and meaning representations for open-text semantic parsing," in Artificial Intelligence and Statistics, 2012, pp. 127-135.
[2] W. Ma, W. Ma, S. Xu, and H. Zha, "Pyramid ALKNet for Semantic Parsing of Building Facade Image," IEEE Geoscience and Remote Sensing Letters, 2020.
[3] V. Lialin, R. Goel, A. Simanovsky, A. Rumshisky, and R. Shah, "Continual Learning for Neural Semantic Parsing," arXiv preprint arXiv:2010.07865, 2020
[4] D. C. Cireşan, U. Meier, and J. Schmidhuber, "Transfer learning for Latin and Chinese characters with deep neural networks," in The 2012 International Joint Conference on Neural Networks (IJCNN), 2012, pp. 1-6: IEEE.
[5] J. S. Ren and L. Xu, "On vectorization of deep convolutional neural networks for vision tasks," in Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
[6] T. Kaur and T. K. Gandhi, "Deep convolutional neural networks with transfer learning for automated brain image classification," Machine Vision and Applications, vol. 31, pp. 1-16, 2020.
[7] I. D. Apostolopoulos and T. A. Mpesiana, "Covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks," Physical and Engineering Sciences in Medicine, p. 1, 2020.
[8] X. Li, Y. Grandvalet, and F. Davoine, "A baseline regularization scheme for transfer learning with convolutional neural networks," Pattern Recognition, vol. 98, p. 107049, 2020.
[9] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," in Advances in neural information processing systems, 2013, pp. 3111-3119.
[10] S. Lazebnik, C. Schmid, and J. Ponce, "Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories," in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), 2006, vol. 2, pp. 2169-2178: IEEE. [11] K. Chowdhary, "Natural language processing," in Fundamentals of Artificial Intelligence: Springer, 2020, pp. 603-649.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Advances in neural information processing systems, 2012, pp. 1097-1105.
[13] D. Ciregan, U. Meier, and J. Schmidhuber, "Multi-column deep neural networks for image classification," in 2012 IEEE conference on computer vision and pattern recognition, 2012, pp. 3642-3649: IEEE.
[14] O. Badmos, A. Kopp, T. Bernthaler, and G. Schneider, "Image-based defect detection in lithium-ion battery electrode using convolutional neural networks," Journal of Intelligent Manufacturing, vol. 31, no. 4, pp. 885-897, 2020.
[15] X. Gou, L. Qing, Y. Wang, M. Xin, and X. Wang, "Re-training and parameter sharing with the Hash trick for compressing convolutional neural networks," Applied Soft Computing, p. 106783, 2020.
[16] L. Deng, "A tutorial survey of architectures, algorithms, and applications for deep learning," APSIPA Transactions on Signal and Information Processing, vol. 3, 2014.
[17] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew, "Deep learning for visual understanding: A review," Neurocomputing, vol. 187, pp. 27-48, 2016.
[18] R. Pascanu, C. Gulcehre, K. Cho, and Y. Bengio, "How to construct deep recurrent neural networks," arXiv preprint arXiv:1312.6026, 2013.
[19] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by back-propagating errors," nature, vol. 323, no. 6088, pp. 533-536, 1986.
[20] S. Ruder, "An overview of gradient descent optimization algorithms," arXiv preprint arXiv:1609.04747, 2016.
[21] C. Y. Miao, A. Yang, and M. J. Anderson, "Deep Learning Workload Performance Auto-Optimizer," EasyChair2516-2314, 2020.
[22] R. Marcus, P. Negi, H. Mao, N. Tatbul, M. Alizadeh, and T. Kraska, "Bao: Learning to Steer Query Optimizers," arXiv preprint arXiv:2004.03814, 2020.
[23] G.-H. Liu, T. Chen, and E. A. Theodorou, "A Differential Game Theoretic Neural Optimizer for Training Residual Networks," arXiv preprint arXiv:2007.08880, 2020.
[24] I. Kandel, M. Castelli, and A. Popovič, "Comparative Study of First Order Optimizers for Image Classification Using Convolutional Neural Networks on Histopathology Images," Journal of Imaging, vol. 6, no. 9, p. 92, 2020
[25] S. Postalcıoğlu, "Performance Analysis of Different Optimizers for Deep Learning-Based Image Recognition," International Journal of Pattern Recognition and Artificial Intelligence, vol. 34, no. 02, p. 2051003, 2020.
[26] S. Kim and T.-S. Choi, "Design of Multichannel FIR Filter using Gradient Descent Optimizer for Personal Audio Systems," in Audio Engineering Society Convention 148, 2020: Audio Engineering Society.
[27] R. Sutton, "Two problems with back propagation and other steepest descent learning procedures for networks," in Proceedings of the Eighth Annual Conference of the Cognitive Science Society, 1986, 1986, pp. 823-832.
[28] N. Qian, "On the momentum term in gradient descent learning algorithms," Neural networks, vol. 12, no. 1, pp. 145-151,1999.
[29] T. Dozat, "Incorporating nesterov momentum into adam.(2016),"
[30] J. Duchi, E. Hazan, and Y. Singer, "Adaptive subgradient methods for online learning and stochastic optimization," Journal of machine learning research, vol. 12, no. 7, 2011.
[31] M. D. Zeiler, "Adadelta: an adaptive learning rate method," arXiv preprint arXiv:1212.5701, 2012.
[32] D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
[33] M. Kögel and R. Findeisen, "A fast gradient method for embedded linear predictive control," IFAC Proceedings Volumes, vol. 44, no. 1, pp. 1362-1367, 2011.
[34] L. Liu et al., "On the variance of the adaptive learning rate and beyond," arXiv preprint arXiv:1908.03265, 2019.
[35] P. Efraimidis and P. Spirakis, "Weighted Random Sampling," in Encyclopedia of Algorithms, M.-Y. Kao, Ed. Boston, MA: Springer US, 2008, pp. 1024-1027.
[36] K. He, X. Zhang, and S. Ren, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016, pp. 770-778.
[37] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, "Densely connected convolutional networks," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700-4708.
[38] Animals-10 image dataset. Available:
[39] M. L. McHugh, "Interrater reliability: the kappa statistic," Biochemia medica: Biochemia medica, vol. 22, no. 3, pp. 276-282, 2012
[40] G. Beliakov, "Smoothing Lipschitz functions," Optimisation Methods and Software, vol. 22, no. 6, pp. 901-916, 2007.