viai957 / llama-inferenceLinks
A simple implementation of Llama 1, 2. Llama Architecture built from scratch using PyTorch all the models are built from scratch that includes GQA (Grouped Query Attention) , RoPE (Rotary Positional Embeddings) , RMS Norm, FeedForward Block, Encoder (as this is only for Inferencing the model) , SwiGLU (Activation Function),
☆13Updated last year
Alternatives and similar repositories for llama-inference
Users that are interested in llama-inference are comparing it to the libraries listed below
Sorting:
- Fine-Tuning Llama3-8B LLM in a multi-GPU environment using DeepSpeed☆18Updated last year
- llama3.cuda is a pure C/CUDA implementation for Llama 3 model.☆348Updated 8 months ago
- Step by step explanation/tutorial of llama2.c☆225Updated 2 years ago
- Easy and Efficient Quantization for Transformers☆202Updated 6 months ago
- ☆52Updated last year
- Sakura-SOLAR-DPO: Merge, SFT, and DPO☆116Updated 2 years ago
- Accelerate Model Training with PyTorch 2.X, published by Packt☆50Updated 3 weeks ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆195Updated 7 months ago
- A hackable, simple, and reseach-friendly GRPO Training Framework with high speed weight synchronization in a multinode environment.☆35Updated 4 months ago
- LoRA and DoRA from Scratch Implementations☆215Updated last year
- showing various ways to serve Keras based stable diffusion☆111Updated 2 years ago
- Google TPU optimizations for transformers models☆132Updated 3 weeks ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆267Updated last month
- Efficient Finetuning for OpenAI GPT-OSS☆23Updated 3 months ago
- Inference Llama/Llama2/Llama3 Modes in NumPy☆21Updated 2 years ago
- 1-Click is all you need.☆63Updated last year
- Collection of autoregressive model implementation☆85Updated this week
- Simple Byte pair Encoding mechanism used for tokenization process . written purely in C☆144Updated last year
- An extension of the nanoGPT repository for training small MOE models.☆225Updated 10 months ago
- ☆45Updated 7 months ago
- This code repository contains the code used for my "Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch" blog po…☆92Updated 2 years ago
- A set of scripts and notebooks on LLM finetunning and dataset creation☆113Updated last year
- Training small GPT-2 style models using Kolmogorov-Arnold networks.☆122Updated last year
- Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)☆161Updated last month
- Large scale 4D parallelism pre-training for 🤗 transformers in Mixture of Experts *(still work in progress)*☆86Updated 2 years ago
- LLaMA 3 is one of the most promising open-source model after Mistral, we will recreate it's architecture in a simpler manner.☆196Updated last year
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆201Updated last year
- Notebook and Scripts that showcase running quantized diffusion models on consumer GPUs☆38Updated last year
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (NeurIPS 2024)☆163Updated 9 months ago
- ☆233Updated last year