TY - GEN
T1 - Evolution and Recent trends for the SGD algorithm
T2 - 32nd European Signal Processing Conference, EUSIPCO 2024
AU - Rodriguez, Paul
N1 - Publisher Copyright:
© 2024 European Signal Processing Conference, EUSIPCO. All rights reserved.
PY - 2024
Y1 - 2024
N2 - One of the key challenges when designing a ten (10) hours educational short course, entitled “A Hands-on Approach for Implementing Stochastic Optimization Algorithms from Scratch”, which was accepted for inclusion at ICASSP’23, was related to addressing “how to introduce the Stochastic Gradient Descent (SGD) algorithm and variants in a consistent, accessible fashion”? From a simplistic perspective, the SGD algorithm is nothing else than the classical gradient descent (GD) algorithm along with a (very) noisy gradient. Nonetheless, arguably, SGD’s most influential variants, e.g. AdaGrad, RMSprop and Adam, nor more recent ones (LookAhead, ϵAdam, MadGrad, among several others) may not be explained in such superficial terms. Moreover, such variants are usually given as as black-boxes by most deep-learning (DL) libraries (e.g. TensorFlow, PyTorch, etc.). In this article, based on the experience of the aforementioned short-course, I propose to link the SGD algorithm and variants via an “evolutionary path”, in which each SGD variant may be understood as a set of add-on features over the vanilla SGD, resulting in a generalized algorithm along with a “family tree” graph which are both intuitive and useful when implementing a given SGD variant.
AB - One of the key challenges when designing a ten (10) hours educational short course, entitled “A Hands-on Approach for Implementing Stochastic Optimization Algorithms from Scratch”, which was accepted for inclusion at ICASSP’23, was related to addressing “how to introduce the Stochastic Gradient Descent (SGD) algorithm and variants in a consistent, accessible fashion”? From a simplistic perspective, the SGD algorithm is nothing else than the classical gradient descent (GD) algorithm along with a (very) noisy gradient. Nonetheless, arguably, SGD’s most influential variants, e.g. AdaGrad, RMSprop and Adam, nor more recent ones (LookAhead, ϵAdam, MadGrad, among several others) may not be explained in such superficial terms. Moreover, such variants are usually given as as black-boxes by most deep-learning (DL) libraries (e.g. TensorFlow, PyTorch, etc.). In this article, based on the experience of the aforementioned short-course, I propose to link the SGD algorithm and variants via an “evolutionary path”, in which each SGD variant may be understood as a set of add-on features over the vanilla SGD, resulting in a generalized algorithm along with a “family tree” graph which are both intuitive and useful when implementing a given SGD variant.
KW - Adam
KW - Short course
KW - stochastic gradient descent
UR - https://www.scopus.com/pages/publications/85208440173
U2 - 10.23919/eusipco63174.2024.10715078
DO - 10.23919/eusipco63174.2024.10715078
M3 - Conference contribution
AN - SCOPUS:85208440173
T3 - European Signal Processing Conference
SP - 1761
EP - 1765
BT - 32nd European Signal Processing Conference, EUSIPCO 2024 - Proceedings
PB - European Signal Processing Conference, EUSIPCO
Y2 - 26 August 2024 through 30 August 2024
ER -