Multi-head attention uses multiple attention heads in parallel. What is the benefit of using multiple heads?