9. Adam Generalization Gap vs SGD
medium

A practitioner switches from SGD to Adam and finds that training loss decreases faster but final test performance is slightly worse. What phenomenon might explain this?