microsoft / MInferenceLinks

[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
1,035Updated this week

Alternatives and similar repositories for MInference

Users that are interested in MInference are comparing it to the libraries listed below

Sorting: