711. Scaling in Dot-Product Attention
medium

The scaled dot-product in self-attention divides the dot product by √d_k. Why is this scaling necessary?