JerryYin777 / PaperHelper
PaperHelper: Knowledge-Based LLM QA Paper Reading Assistant with Reliable References
☆13Updated 8 months ago
Alternatives and similar repositories for PaperHelper:
Users that are interested in PaperHelper are comparing it to the libraries listed below
- Efficient Mixture of Experts for LLM Paper List☆36Updated 2 months ago
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference☆40Updated 3 months ago
- Code for paper: "Executing Arithmetic: Fine-Tuning Large Language Models as Turing Machines"☆11Updated 4 months ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆42Updated 4 months ago
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆18Updated 2 weeks ago
- Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)☆51Updated last week
- [NeurIPS 2024] A Novel Rank-Based Metric for Evaluating Large Language Models☆40Updated 3 months ago
- This is a repo for showcasing using MCTS with LLMs to solve gsm8k problems☆49Updated last month
- ☆14Updated last year
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference☆32Updated 8 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆38Updated 11 months ago
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆22Updated 8 months ago
- ☆62Updated last week
- The Official Implementation of Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference☆58Updated 3 weeks ago
- ☆37Updated 4 months ago
- Manages vllm-nccl dependency☆17Updated 8 months ago
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning. COLM 2024 Accepted Paper☆28Updated 8 months ago
- Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.☆29Updated 3 months ago
- Course materials for MIT6.5940: TinyML and Efficient Deep Learning Computing☆28Updated last month
- [ICLR 2025] MiniPLM: Knowledge Distillation for Pre-Training Language Models☆34Updated 2 months ago
- Self Reproduction Code of Paper "Reducing Transformer Key-Value Cache Size with Cross-Layer Attention (MIT CSAIL)☆12Updated 8 months ago
- The source code of "Merging Experts into One: Improving Computational Efficiency of Mixture of Experts (EMNLP 2023)":☆35Updated 10 months ago
- LLMem: GPU Memory Estimation for Fine-Tuning Pre-Trained LLMs☆17Updated last year
- Quantized Attention on GPU☆34Updated 3 months ago
- Beyond KV Caching: Shared Attention for Efficient LLMs☆14Updated 7 months ago
- PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design (KDD 2025)☆17Updated 8 months ago
- Pretrain、decay、SFT a CodeLLM from scratch 🧙♂️☆36Updated 9 months ago
- [ICML 2023] "Data Efficient Neural Scaling Law via Model Reusing" by Peihao Wang, Rameswar Panda, Zhangyang Wang☆14Updated last year