viai957 / llama-inference
A simple implementation of Llama 1, 2. Llama Architecture built from scratch using PyTorch all the models are built from scratch that includes GQA (Grouped Query Attention) , RoPE (Rotary Positional Embeddings) , RMS Norm, FeedForward Block, Encoder (as this is only for Inferencing the model) , SwiGLU (Activation Function),
☆13Updated 11 months ago
Alternatives and similar repositories for llama-inference:
Users that are interested in llama-inference are comparing it to the libraries listed below
- Inference Llama/Llama2/Llama3 Modes in NumPy☆20Updated last year
- a simplified version of Google's Gemma model to be used for learning☆24Updated last year
- Learn CUDA with PyTorch☆20Updated 2 months ago
- Collection of autoregressive model implementation☆85Updated 2 months ago
- Google TPU optimizations for transformers models☆108Updated 3 months ago
- making the official triton tutorials actually comprehensible☆26Updated last month
- Experiments with BitNet inference on CPU☆53Updated last year
- Sakura-SOLAR-DPO: Merge, SFT, and DPO☆117Updated last year
- llama3.cuda is a pure C/CUDA implementation for Llama 3 model.☆331Updated 10 months ago
- This is the code that went into our practical dive using mamba as information extraction☆54Updated last year
- ☆157Updated 3 months ago
- Easy and Efficient Quantization for Transformers☆197Updated 2 months ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆178Updated last week
- NanoGPT-speedrunning for the poor T4 enjoyers☆62Updated this week
- 1-Click is all you need.☆61Updated 11 months ago
- Fully fine-tune large models like Mistral, Llama-2-13B, or Qwen-14B completely for free☆231Updated 5 months ago
- The simplest, fastest repository for training/finetuning medium-sized xLSTMs.☆42Updated 11 months ago
- This repository shows various ways of deploying a vision model (TensorFlow) from 🤗 Transformers.☆30Updated 2 years ago
- Minimal example scripts of the Hugging Face Trainer, focused on staying under 150 lines☆198Updated 11 months ago
- An extension of the nanoGPT repository for training small MOE models.☆131Updated last month
- ☆94Updated 3 months ago
- Video+code lecture on building nanoGPT from scratch☆65Updated 10 months ago
- Quantization of LLMs and benchmarking.☆10Updated last year
- Reference implementation of Mistral AI 7B v0.1 model.☆28Updated last year
- ☆46Updated 5 months ago
- Complete implementation of Llama2 with/without KV cache & inference 🚀☆47Updated 11 months ago
- Inference of Mamba models in pure C☆187Updated last year
- Simple implementation of Speculative Sampling in NumPy for GPT-2.☆93Updated last year
- Simple Byte pair Encoding mechanism used for tokenization process . written purely in C☆129Updated 5 months ago
- Fine-Tuning Llama3-8B LLM in a multi-GPU environment using DeepSpeed☆17Updated 10 months ago