MooreThreads / TurboRAG
☆74Updated 4 months ago
Alternatives and similar repositories for TurboRAG:
Users that are interested in TurboRAG are comparing it to the libraries listed below
- Modular and structured prompt caching for low-latency LLM inference☆91Updated 5 months ago
- Simple extension on vLLM to help you speed up reasoning model without training.☆146Updated last month
- PGRAG☆48Updated 9 months ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆101Updated 3 weeks ago
- ☆94Updated 4 months ago
- ☆30Updated 8 months ago
- [ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement☆181Updated last year
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆82Updated last month
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆113Updated 4 months ago
- REST: Retrieval-Based Speculative Decoding, NAACL 2024☆199Updated 4 months ago
- ☆47Updated 4 months ago
- ☆14Updated this week
- FuseAI Project☆85Updated 2 months ago
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models☆131Updated 10 months ago
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆47Updated 5 months ago
- Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper☆130Updated 8 months ago
- Evaluation tools for Retrieval-augmented Generation (RAG) methods.☆150Updated 4 months ago
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"☆236Updated 2 months ago
- Repository for “PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers”, NAACL24☆139Updated 10 months ago
- ☆197Updated this week
- The driver for LMCache core to run in vLLM☆36Updated 2 months ago
- This repository contains the code for the paper: SirLLM: Streaming Infinite Retentive LLM☆57Updated 10 months ago
- [EMNLP 2024] LongRAG: A Dual-perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering☆102Updated 2 months ago
- KV cache compression for high-throughput LLM inference☆126Updated 2 months ago
- A highly capable 2.4B lightweight LLM using only 1T pre-training data with all details.☆170Updated this week
- ☆30Updated last month
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆158Updated this week
- A flexible and efficient training framework for large-scale alignment tasks☆342Updated 2 months ago
- LLM Serving Performance Evaluation Harness☆76Updated last month
- Official repository for RAGViz: Diagnose and Visualize Retrieval-Augmented Generation [EMNLP 2024]☆82Updated 2 months ago