Automated learning rate search using batch-level cross-validation

Deep learning researchers and practitioners have accumulated a significant amount of experience on training a wide variety of architectures on various datasets. However, given a network architecture and a dataset, obtaining the best model (i.e. the model giving the smallest test set error) while keeping the training time complexity low is still a challenging task. Hyper-parameters of deep neural networks, especially the learning rate and its (decay) schedule, highly affect the network's final performance. The general approach is to search the best learning rate and learning rate decay parameters within a cross-validation framework, a process that usually requires a significant amount of experimentation with extensive time cost. In classical cross-validation (CV), a random part of the dataset is reserved for the evaluation of model performance on unseen data. This technique is usually run multiple times to decide learning rate settings with random validation sets. In this paper, we explore batch-level cross-validation as an alternative to the classical dataset-level, hence macro, CV. The advantage of batch-level or micro CV methods is that the gradient computed during training is re-used to evaluate several different learning rates. We propose an algorithm based on micro CV and stochastic gradient descent with momentum, which produces a learning rate schedule during training by selecting a learning rate per epoch, automatically. In our algorithm, a random half of the current batch (of examples) is used for training and the other half is used for validating several different step sizes or learning rates. We conducted comprehensive experiments on three datasets (CIFAR10, SVHN and Adience) using three different network architectures (a custom CNN, ResNet and VGG) to compare the performances of our micro-CV algorithm and the widely used stochastic gradient descent with momentum in a early-stopping macro-CV setup. The results show that, our micro-CV algorithm achieves comparable test accuracy to macro-CV with a much lower computational cost.

___

[1] K. Anand, Z. Wang, M. Loog, and J. van Gemert, “Black magic in deep learning: How human skill impacts network training,” in British Machine Vision Conference, 2020.

[2] R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni, “Green ai,” Communications of the ACM, vol. 63, p. 54–63, Nov. 2020.

[3] G. E. Hinton, N. Srivastava, and K. Swersky, “Neural Networks for Machine Learning,” COURSERA: Neural Networks for Machine Learning, 2012.

[4] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic gradient descent,” in ICLR: International Conference on Learning Representations, 2015.

[5] “Neural networks (maybe) evolved to make adam the best optimizer – parameter-free learning and optimization algorithms.” https://parameterfree.com/2020/12/06/neural-network-maybe-evolved-to-make-adam-the-best-optimizer/. (Accessed on 03/01/2021).

[6] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “The Marginal Value of Adaptive Gradient Methods in Machine Learning,” in Advances in Neural Information Processing Systems 30 (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), pp. 4148–4158, Curran Associates, Inc., 2017.

[7] J. Zhang, I. Mitliagkas, and C. R´e, “YellowFin and the Art of Momentum Tuning,” CoRR, vol. abs/1706.0, 2017.

[8] D. Kabakci, “Automated learning rate search using batch-level cross-validation,” Master’s thesis, Middle East Technical University, Ankara, Turkey, July 2019. https://open.metu.edu.tr/handle/11511/43629.

[9] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” The Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2011.

[10] J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter Optimization,” Journal of Machine Learning Research, vol. 13, pp. 281–305, 2012.

[11] L. N. Smith, “Cyclical Learning Rates for Training Neural Networks,” 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 3 2017.

[12] L. N. Smith and N. Topin, “Super-Convergence: Very Fast Training of Residual Networks Using Large Learning Rates,” CoRR, vol. abs/1708.0, 2017.

[13] I. Loshchilov and F. Hutter, “SGDR: Stochastic Gradient Descent with Warm Restarts,” in ICLR: International Conference on Learning Representations, pp. 1–16, 2017.

[14] T. Schaul, Z. Sixin, and Y. LeCun, “No More Pesky Learning Rates,” in Proceedings of the 30th International Conference on Machine Learning, 2013.

[15] A. G. Baydin, R. Cornish, D. M. Rubio, M. Schmidt, and F. Wood, “Online Learning Rate Adaptation with Hypergradient Descent,” in International Conference on Learning Representations, 2018.

[16] N. S. Keskar and R. Socher, “Improving generalization performance by switching from Adam to SGD,” arXiv preprint arXiv:1712.07628, 2017.

[17] M. Zaheer, S. Reddi, D. Sachan, S. Kale, and S. Kumar, “Adaptive Methods for Nonconvex Optimization,” in Advances in Neural Information Processing Systems 31 (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.), pp. 9793–9803, Curran Associates, Inc., 2018.

[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

[19] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017.

[20] J. Hu, L. Shen, and G. Sun, “Squeeze-and-Excitation Networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 6 2018.

[21] Y. Huang, Y. Cheng, D. Chen, H. Lee, J. Ngiam, Q. Le, and Z. Chen, “Gpipe: Efficient training of giant neural networks using pipeline parallelism.,” in Advances in Neural Information Processing Systems (NIPS), 2019.

[22] S. Jenni and P. Favaro, “Deep bilevel learning,” in Proceedings of the European conference on computer vision (ECCV), pp. 618–633, 2018.

[23] B. T. Polyak, “Some methods of speeding up the convergence of iteration methods,” USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, 1964.

[24] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986.

[25] A. Krizhevsky, G. Hinton, et al., “Learning Multiple Layers of Features from Tiny Images,” tech. rep., Department of Computer Science, University of Toronto, 2009.

[26] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y Ng, “Reading Digits in Natural Images with Unsupervised Feature Learning,” in Advances in Neural Information Processing Systems (NIPS), 2011.

[27] E. Eidinger, R. Enbar, and T. Hassner, “Age and Gender Estimation of Unfiltered Faces,” IEEE Transactions on Information Forensics and Security, vol. 9, pp. 2170–2179, 12 2014.

[28] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in ICLR: International Conference on Learning Representations, pp. 1–14, 2015.