StackedML
Practice
Labs
Questions
Models
Pricing
Sign in
Questions
/
Deep Learning
/
Architectures (Conceptual)
/
Transformers (high-level intuition)
← Previous
Next →
493.
Masked Self-Attention in Decoder
medium
The transformer decoder uses masked self-attention during training. Why is masking necessary?
A
It prevents gradient flow through positions with zero attention weight, stabilizing training of deep transformer decoders
B
It prevents each position from attending to padding tokens, ensuring attention weights are not wasted on meaningless inputs
C
It prevents each position from attending to future positions, ensuring the model cannot cheat by looking at tokens it has not yet generated
D
It prevents the encoder output from influencing the decoder's self-attention, maintaining the separation between encoder and decoder
Sign in to verify your answer
← Back to Questions