microsoft / MInference

[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
874Updated 2 weeks ago

Alternatives and similar repositories for MInference:

Users that are interested in MInference are comparing it to the libraries listed below