StackedML
Practice
Labs
Questions
Models
Pricing
Sign in
Questions
/
Optimization
/
Gradient Methods
/
Stochastic gradient descent
← Previous
Next →
720.
SGD Convergence Speed
easy
Why does SGD often converge faster in practice than batch gradient descent for large datasets?
A
It uses a larger effective learning rate since the noise in gradient estimates accelerates escape from flat regions
B
It exploits GPU parallelism more efficiently since single-sample gradients require less memory bandwidth per update
C
It avoids computing redundant gradients since many training samples carry similar information about the loss
D
It performs many parameter updates per epoch since each update uses only one sample, providing faster early progress
Sign in to verify your answer
← Back to Questions