MooreThreads / TurboRAG
☆66Updated last month
Alternatives and similar repositories for TurboRAG:
Users that are interested in TurboRAG are comparing it to the libraries listed below
- Modular and structured prompt caching for low-latency LLM inference☆84Updated 2 months ago
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆107Updated last month
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆85Updated 3 months ago
- LLM Serving Performance Evaluation Harness☆65Updated 4 months ago
- ☆28Updated 5 months ago
- ☆159Updated last month
- StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization☆98Updated last week
- [EMNLP 2024 (Oral)] Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA☆108Updated 2 months ago
- Evaluation tools for Retrieval-augmented Generation (RAG) methods.☆144Updated 2 months ago
- ☆112Updated 2 months ago
- Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper☆125Updated 6 months ago
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆78Updated this week
- AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark☆118Updated last month
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"☆210Updated 3 months ago
- ☆43Updated 6 months ago
- ☆212Updated 8 months ago
- The official code for paper "parallel speculative decoding with adaptive draft length."☆32Updated 4 months ago
- PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design (KDD 2025)☆16Updated 7 months ago
- [ACL24] Official repo for "Synthesizing Text-to-SQL Data from Weak and Strong LLMs"☆63Updated 5 months ago
- REST: Retrieval-Based Speculative Decoding, NAACL 2024☆186Updated last month
- ☆51Updated 3 months ago
- PGRAG☆44Updated 6 months ago
- A pipeline for LLM knowledge distillation☆83Updated 5 months ago
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆147Updated last month
- KV cache compression for high-throughput LLM inference☆104Updated last month
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆152Updated 6 months ago
- [ICLR 2024] Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation☆149Updated 10 months ago
- ☆52Updated 3 months ago
- Code implementation of synthetic continued pretraining☆79Updated 2 weeks ago
- [ICML'24] Data and code for our paper "Training-Free Long-Context Scaling of Large Language Models"☆378Updated 3 months ago