viai957 / llama-inferenceLinks
A simple implementation of Llama 1, 2. Llama Architecture built from scratch using PyTorch all the models are built from scratch that includes GQA (Grouped Query Attention) , RoPE (Rotary Positional Embeddings) , RMS Norm, FeedForward Block, Encoder (as this is only for Inferencing the model) , SwiGLU (Activation Function),
☆13Updated last year
Alternatives and similar repositories for llama-inference
Users that are interested in llama-inference are comparing it to the libraries listed below
Sorting:
- Google TPU optimizations for transformers models☆114Updated 5 months ago
- llama3.cuda is a pure C/CUDA implementation for Llama 3 model.☆335Updated 2 months ago
- Inference Llama/Llama2/Llama3 Modes in NumPy☆21Updated last year
- ☆40Updated last month
- Step by step explanation/tutorial of llama2.c☆222Updated last year
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆188Updated last month
- Learn CUDA with PyTorch☆29Updated this week
- Easy and Efficient Quantization for Transformers☆198Updated 3 weeks ago
- LoRA and DoRA from Scratch Implementations☆206Updated last year
- An extension of the nanoGPT repository for training small MOE models.☆160Updated 4 months ago
- ☆179Updated 6 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆264Updated 9 months ago
- Collection of autoregressive model implementation☆85Updated 2 months ago
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.☆139Updated this week
- Fully fine-tune large models like Mistral, Llama-2-13B, or Qwen-14B completely for free☆232Updated 8 months ago
- Fine-Tuning Llama3-8B LLM in a multi-GPU environment using DeepSpeed☆18Updated last year
- Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)☆150Updated last year
- Mixed precision training from scratch with Tensors and CUDA☆24Updated last year
- making the official triton tutorials actually comprehensible☆45Updated 3 months ago
- 1-Click is all you need.☆62Updated last year
- Inference of Mamba models in pure C☆188Updated last year
- LORA: Low-Rank Adaptation of Large Language Models implemented using PyTorch☆110Updated last year
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 3 months ago
- Simple Byte pair Encoding mechanism used for tokenization process . written purely in C☆134Updated 8 months ago
- This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT)…☆68Updated last year
- ☆198Updated 5 months ago
- Notes on quantization in neural networks☆89Updated last year
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆198Updated 11 months ago
- manage histories of LLM applied applications☆91Updated last year
- NanoGPT-speedrunning for the poor T4 enjoyers☆68Updated 2 months ago