lucidrains / coordinate-descent-attention

Implementation of an Attention layer where each head can attend to more than just one token, using coordinate descent to pick topk
46Updated last year

Related projects: