microsoft / MoPQLinks
☆13Updated 4 years ago
Alternatives and similar repositories for MoPQ
Users that are interested in MoPQ are comparing it to the libraries listed below
Sorting:
- Retrieval with Learned Similarities (http://arxiv.org/abs/2407.15462, WWW'25 Oral)☆51Updated 7 months ago
- Inference framework for MoE layers based on TensorRT with Python binding☆41Updated 4 years ago
- Official code for "Binary embedding based retrieval at Tencent"☆44Updated last year
- AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers☆48Updated 3 years ago
- This package implements THOR: Transformer with Stochastic Experts.☆65Updated 4 years ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆40Updated last year
- [KDD'22] Learned Token Pruning for Transformers☆102Updated 2 years ago
- ☆19Updated last year
- ☆74Updated 2 years ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆57Updated 2 years ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆112Updated 9 months ago
- Code for the preprint "Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?"☆47Updated 4 months ago
- An LLM inference engine, written in C++☆18Updated 6 months ago
- Repository of LV-Eval Benchmark☆72Updated last year
- ☆21Updated last year
- Odysseus: Playground of LLM Sequence Parallelism☆78Updated last year
- Manages vllm-nccl dependency☆17Updated last year
- [ICLR 2022] Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators☆26Updated 2 years ago
- Linear Attention Sequence Parallelism (LASP)☆88Updated last year
- ☆21Updated 8 months ago
- ACL 2024 | LooGLE: Long Context Evaluation for Long-Context Language Models☆193Updated last year
- This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022).☆112Updated 3 years ago
- Ongoing research training transformer models at scale☆18Updated 2 years ago
- hnsw implemented by python☆71Updated 6 years ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆52Updated last year
- ☆142Updated last year
- Distributed DataLoader For Pytorch Based On Ray☆24Updated 4 years ago
- Examples for MS-AMP package.☆30Updated 5 months ago
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"☆121Updated last year
- Implementation of a Quantized Transformer Model☆19Updated 6 years ago