LambdaLabsML / llama
Inference code for LLaMA models
☆38Updated last year
Related projects ⓘ
Alternatives and complementary repositories for llama
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆89Updated this week
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆70Updated last week
- QuIP quantization☆46Updated 7 months ago
- Data preparation code for Amber 7B LLM☆82Updated 6 months ago
- ☆114Updated 6 months ago
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆73Updated 3 weeks ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆38Updated 9 months ago
- KV cache compression for high-throughput LLM inference☆82Updated last week
- Experiments on speculative sampling with Llama models☆117Updated last year
- Bamboo-7B Large Language Model☆88Updated 7 months ago
- A collection of all available inference solutions for the LLMs☆72Updated last month
- ☆96Updated last month
- Official code for "SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient"☆127Updated 11 months ago
- Repository for CPU Kernel Generation for LLM Inference☆24Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- ☆121Updated 9 months ago
- Benchmark suite for LLMs from Fireworks.ai☆58Updated last week
- Demonstration that finetuning RoPE model on larger sequences than the pre-trained model adapts the model context limit☆63Updated last year
- Layer-Condensed KV cache w/ 10 times larger batch size, fewer params and less computation. Dramatic speed up with better task performance…☆137Updated this week
- [ICLR 2024] Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation☆144Updated 8 months ago
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆29Updated 6 months ago
- ☆43Updated 3 months ago
- Comparison of Language Model Inference Engines☆190Updated 2 months ago
- Manage scalable open LLM inference endpoints in Slurm clusters☆236Updated 4 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆261Updated last year
- Example of applying CUDA graphs to LLaMA-v2☆10Updated last year
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆184Updated last month
- Fast Inference of MoE Models with CPU-GPU Orchestration☆170Updated 2 weeks ago
- Modular and structured prompt caching for low-latency LLM inference☆65Updated this week
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees" adapted for Llama models☆36Updated last year