eth-easl / deltazip
Compression for Foundation Models
☆19Updated 2 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for deltazip
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆34Updated 8 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆38Updated 9 months ago
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆70Updated this week
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆72Updated 3 weeks ago
- Make triton easier☆41Updated 4 months ago
- ☆36Updated 4 months ago
- Open sourced backend for Martian's LLM Inference Provider Leaderboard☆17Updated 2 months ago
- ☆24Updated last year
- Cascade Speculative Drafting☆26Updated 7 months ago
- PostText is a QA system for querying your text data. When appropriate structured views are in place, PostText is good at answering querie…☆31Updated last year
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models"☆56Updated 3 weeks ago
- Code repository for the public reproduction of the language modelling experiments on "MatFormer: Nested Transformer for Elastic Inference…☆18Updated 11 months ago
- Hydragen: High-Throughput LLM Inference with Shared Prefixes☆22Updated 6 months ago
- Using FlexAttention to compute attention with different masking patterns☆40Updated last month
- [EMNLP 2024 Main] Virtual Personas for Language Models via an Anthology of Backstories☆14Updated 4 months ago
- FlexAttention w/ FlashAttention3 Support☆26Updated last month
- Beyond KV Caching: Shared Attention for Efficient LLMs☆13Updated 3 months ago
- A toolkit enhances PyTorch with specialized functions for low-bit quantized neural networks.☆28Updated 4 months ago
- Code repository for the paper - "AdANNS: A Framework for Adaptive Semantic Search"☆59Updated last year
- Low-Rank Llama Custom Training☆19Updated 7 months ago
- DPO, but faster 🚀☆20Updated last week
- ☆15Updated 7 months ago
- Fast Inference of MoE Models with CPU-GPU Orchestration☆170Updated last week
- ☆43Updated 3 months ago
- Repository for CPU Kernel Generation for LLM Inference☆24Updated last year
- Experiments to assess SPADE on different LLM pipelines.☆16Updated 7 months ago
- ☆19Updated last year
- ☆96Updated last month
- [ICML 2023] "Outline, Then Details: Syntactically Guided Coarse-To-Fine Code Generation", Wenqing Zheng, S P Sharan, Ajay Kumar Jaiswal, …☆36Updated last year