microsoft / MoPQLinks
☆12Updated 3 years ago
Alternatives and similar repositories for MoPQ
Users that are interested in MoPQ are comparing it to the libraries listed below
Sorting:
- Retrieval with Learned Similarities (http://arxiv.org/abs/2407.15462, WWW'25 Oral)☆51Updated 5 months ago
- Inference framework for MoE layers based on TensorRT with Python binding☆41Updated 4 years ago
- Official code for "Binary embedding based retrieval at Tencent"☆43Updated last year
- ☆19Updated last year
- Summary of system papers/frameworks/codes/tools on training or serving large model☆57Updated last year
- [KDD'22] Learned Token Pruning for Transformers☆100Updated 2 years ago
- This package implements THOR: Transformer with Stochastic Experts.☆65Updated 4 years ago
- A memory efficient DLRM training solution using ColossalAI☆106Updated 2 years ago
- ☆74Updated 2 years ago
- Code for the preprint "Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?"☆46Updated 2 months ago
- Official Codebase for the paper: A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone.☆22Updated 4 months ago
- Dynamic Context Selection for Efficient Long-Context LLMs☆40Updated 4 months ago
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆68Updated 2 years ago
- Repository of LV-Eval Benchmark☆70Updated last year
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆40Updated last year
- Train llm (bloom, llama, baichuan2-7b, chatglm3-6b) with deepspeed pipeline mode. Faster than zero/zero++/fsdp.☆98Updated last year
- Odysseus: Playground of LLM Sequence Parallelism☆77Updated last year
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆30Updated last year
- Block Sparse movement pruning☆81Updated 4 years ago
- Implementation of NAACL 2024 Outstanding Paper "LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models"☆148Updated 7 months ago
- A MoE impl for PyTorch, [ATC'23] SmartMoE☆71Updated 2 years ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆110Updated 6 months ago
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆25Updated 3 months ago
- Official Code For Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM☆14Updated last year
- Manages vllm-nccl dependency☆17Updated last year
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"☆120Updated last year
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…☆60Updated last year
- ☆121Updated last year
- ☆21Updated last year
- Beyond KV Caching: Shared Attention for Efficient LLMs☆19Updated last year