TY - JOUR
T1 - Fine-tuning adaptive stochastic optimizers
T2 - determining the optimal hyperparameter ϵ via gradient magnitude histogram analysis
AU - Silva, Gustavo
AU - Rodriguez, Paul
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.
PY - 2024/12
Y1 - 2024/12
N2 - Stochastic optimizers play a crucial role in the successful training of deep neural network models. To achieve optimal model performance, designers must carefully select both model and optimizer hyperparameters. However, this process is frequently demanding in terms of computational resources and processing time. While it is a well-established practice to tune the entire set of optimizer hyperparameters for peak performance, there is still a lack of clarity regarding the individual influence of hyperparameters mislabeled as “low priority”, including the safeguard factor ϵ and decay rate β, in leading adaptive stochastic optimizers like the Adam optimizer. In this manuscript, we introduce a new framework based on the empirical probability density function of the loss’ gradient magnitude, termed as the “gradient magnitude histogram”, for a thorough analysis of adaptive stochastic optimizers and the safeguard hyperparameter ϵ. This framework reveals and justifies valuable relationships and dependencies among hyperparameters in connection to optimal performance across diverse tasks, such as classification, language modeling and machine translation. Furthermore, we propose a novel algorithm using gradient magnitude histograms to automatically estimate a refined and accurate search space for the optimal safeguard hyperparameter ϵ, surpassing the conventional trial-and-error methodology by establishing a worst-case search space that is two times narrower.
AB - Stochastic optimizers play a crucial role in the successful training of deep neural network models. To achieve optimal model performance, designers must carefully select both model and optimizer hyperparameters. However, this process is frequently demanding in terms of computational resources and processing time. While it is a well-established practice to tune the entire set of optimizer hyperparameters for peak performance, there is still a lack of clarity regarding the individual influence of hyperparameters mislabeled as “low priority”, including the safeguard factor ϵ and decay rate β, in leading adaptive stochastic optimizers like the Adam optimizer. In this manuscript, we introduce a new framework based on the empirical probability density function of the loss’ gradient magnitude, termed as the “gradient magnitude histogram”, for a thorough analysis of adaptive stochastic optimizers and the safeguard hyperparameter ϵ. This framework reveals and justifies valuable relationships and dependencies among hyperparameters in connection to optimal performance across diverse tasks, such as classification, language modeling and machine translation. Furthermore, we propose a novel algorithm using gradient magnitude histograms to automatically estimate a refined and accurate search space for the optimal safeguard hyperparameter ϵ, surpassing the conventional trial-and-error methodology by establishing a worst-case search space that is two times narrower.
KW - Deep neural network
KW - Fine-tuning
KW - Hyperparameter
KW - Stochastic optimizers
UR - http://www.scopus.com/inward/record.url?scp=85204285831&partnerID=8YFLogxK
U2 - 10.1007/s00521-024-10302-2
DO - 10.1007/s00521-024-10302-2
M3 - Article
AN - SCOPUS:85204285831
SN - 0941-0643
VL - 36
SP - 22223
EP - 22243
JO - Neural Computing and Applications
JF - Neural Computing and Applications
IS - 35
ER -