JerryYin777 / PaperHelper
PaperHelper: Knowledge-Based LLM QA Paper Reading Assistant with Reliable References
☆9Updated 5 months ago
Related projects ⓘ
Alternatives and complementary repositories for PaperHelper
- An auxiliary project analysis of the characteristics of KV in DiT Attention.☆15Updated this week
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference☆26Updated 5 months ago
- Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)☆38Updated this week
- Code for paper: "Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines"☆11Updated last month
- Quantized Attention on GPU☆30Updated 2 weeks ago
- ☆18Updated last week
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆24Updated 2 weeks ago
- A Sparse-tensor Communication Framework for Distributed Deep Learning☆13Updated 3 years ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆34Updated 8 months ago
- Accepted LLM Papers in NeurIPS 2024☆22Updated last month
- 研究生课《网络大数据管理理论和应用》大作业项目代码☆12Updated last year
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆20Updated 5 months ago
- ☆32Updated last week
- ☆42Updated 6 months ago
- Multi-Agent System for Science of Science☆49Updated this week
- Official resporitory for "IPDPS' 24 QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices".☆19Updated 8 months ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- This repo contains the source code for VB-LoRA: Extreme Parameter Efficient Fine-Tuning with Vector Banks (NeurIPS 2024).☆24Updated last month
- My learning notes/codes for ML SYS.☆135Updated this week
- GIFT: Generative Interpretable Fine-Tuning☆18Updated last month
- Beyond KV Caching: Shared Attention for Efficient LLMs☆13Updated 4 months ago
- Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".☆41Updated 4 months ago
- [ICML 2023] "Data Efficient Neural Scaling Law via Model Reusing" by Peihao Wang, Rameswar Panda, Zhangyang Wang☆13Updated 10 months ago
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆32Updated 3 months ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆38Updated last month
- ☆13Updated 7 months ago
- An Attention Superoptimizer☆20Updated 6 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆57Updated 5 months ago
- Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.☆25Updated 2 weeks ago
- A comprehensive overview of Data Distillation and Condensation (DDC). DDC is a data-centric task where a representative (i.e., small but …☆13Updated last year