microsoft / MoPQLinks
☆13Updated 4 years ago
Alternatives and similar repositories for MoPQ
Users that are interested in MoPQ are comparing it to the libraries listed below
Sorting:
- AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers☆48Updated 3 years ago
- Inference framework for MoE layers based on TensorRT with Python binding☆41Updated 4 years ago
- This package implements THOR: Transformer with Stochastic Experts.☆65Updated 4 years ago
- Retrieval with Learned Similarities (http://arxiv.org/abs/2407.15462, WWW'25 Oral)☆52Updated 9 months ago
- Official code for "Binary embedding based retrieval at Tencent"☆44Updated last year
- TSDG: An efficient index graph for graph-based nearest neighbor search☆10Updated 3 years ago
- ☆70Updated 3 years ago
- A memory efficient DLRM training solution using ColossalAI☆105Updated 3 years ago
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆25Updated 6 months ago
- [KDD'22] Learned Token Pruning for Transformers☆102Updated 2 years ago
- ☆25Updated 4 years ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆40Updated last year
- Sparse Backpropagation for Mixture-of-Expert Training☆29Updated last year
- Block Sparse movement pruning☆83Updated 5 years ago
- [ICLR 2022] Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators☆26Updated 2 years ago
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"☆123Updated last year
- Differentiable Product Quantization for End-to-End Embedding Compression.☆64Updated 3 years ago
- This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022).☆113Updated 3 years ago
- Linear Attention Sequence Parallelism (LASP)☆88Updated last year
- An LLM inference engine, written in C++☆18Updated 2 weeks ago
- Are Intermediate Layers and Labels Really Necessary? A General Language Model Distillation Method ; GKD: A General Knowledge Distillation…☆33Updated 2 years ago
- Asynchronous Stochastic Gradient Descent with Delay Compensation☆22Updated 8 years ago
- Examples for MS-AMP package.☆30Updated 6 months ago
- ☆109Updated 6 months ago
- BANG is a new pretraining model to Bridge the gap between Autoregressive (AR) and Non-autoregressive (NAR) Generation. AR and NAR generat…☆28Updated 4 years ago
- Repository of LV-Eval Benchmark☆73Updated last year
- Retrieval as Attention☆82Updated 3 years ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆31Updated last year
- Summary of system papers/frameworks/codes/tools on training or serving large model☆57Updated 2 years ago
- ☆64Updated last year