StackedML

711. Scaling in Dot-Product Attention

medium

The scaled dot-product in self-attention divides the dot product by √d_k. Why is this scaling necessary?