The scaled dot-product in self-attention divides the dot product by √d_k. Why is this scaling necessary?