abhisheknair10 / llama3.cu
Lightweight Llama 3 8B Inference Engine in CUDA C
☆43Updated last week
Alternatives and similar repositories for llama3.cu:
Users that are interested in llama3.cu are comparing it to the libraries listed below
- ☆44Updated 6 months ago
- Easy to use, High Performant Knowledge Distillation for LLMs☆40Updated 2 weeks ago
- ☆79Updated 2 months ago
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆34Updated 9 months ago
- Official repository for the paper "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks". This rep…☆48Updated 2 months ago
- llama.cpp fork with additional SOTA quants and improved performance☆133Updated this week
- ☆42Updated last week
- Make triton easier☆44Updated 7 months ago
- Training hybrid models for dummies.☆18Updated 2 weeks ago
- [WIP] Transformer to embed Danbooru labelsets☆13Updated 10 months ago
- Testing LLM reasoning abilities with family relationship quizzes.☆57Updated this week
- A minimalistic C++ Jinja templating engine for LLM chat templates☆104Updated this week
- Fast approximate inference on a single GPU with sparsity aware offloading☆38Updated last year
- ☆27Updated 5 months ago
- SGEMM that beats cuBLAS☆68Updated last week
- LLM training in simple, raw C/CUDA☆18Updated 8 months ago
- RWKV-7: Surpassing GPT☆73Updated 2 months ago
- Github repo for Peifeng's internship project☆13Updated last year
- The code repository for the CURLoRA research paper. Stable LLM continual fine-tuning and catastrophic forgetting mitigation.☆41Updated 5 months ago
- Latent Large Language Models☆17Updated 5 months ago
- An open source replication of the stawberry method that leverages Monte Carlo Search with PPO and or DPO☆27Updated this week
- Multi-Layer Key-Value sharing experiments on Pythia models☆32Updated 7 months ago
- ☆52Updated 7 months ago
- A fork of llama3.c used to do some R&D on inferencing☆17Updated last month
- FlexAttention w/ FlashAttention3 Support☆27Updated 3 months ago
- Nexusflow function call, tool use, and agent benchmarks.☆19Updated last month
- Train, tune, and infer Bamba model☆80Updated 2 weeks ago
- Explore training for quantized models☆13Updated 3 weeks ago
- Experiments with BitNet inference on CPU☆52Updated 9 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters☆113Updated last month