9. Adam Generalization Gap vs SGD
medium
A practitioner switches from SGD to Adam and finds that training loss decreases faster but final test performance is slightly worse. What phenomenon might explain this?
A practitioner switches from SGD to Adam and finds that training loss decreases faster but final test performance is slightly worse. What phenomenon might explain this?