viai957 / llama-inferenceLinks

A simple implementation of Llama 1, 2. Llama Architecture built from scratch using PyTorch all the models are built from scratch that includes GQA (Grouped Query Attention) , RoPE (Rotary Positional Embeddings) , RMS Norm, FeedForward Block, Encoder (as this is only for Inferencing the model) , SwiGLU (Activation Function),

☆13

Alternatives and similar repositories for llama-inference

Users that are interested in llama-inference are comparing it to the libraries listed below

Sorting:

huggingface / optimum-tpu
Google TPU optimizations for transformers models
☆114Updated 5 months ago
likejazz / llama3.cuda
llama3.cuda is a pure C/CUDA implementation for Llama 3 model.
☆335Updated 2 months ago
hscspring / llama.np
Inference Llama/Llama2/Llama3 Modes in NumPy
☆21Updated last year
hkproj / multi-latent-attention
☆40Updated last month
RahulSChand / llama2.c-for-dummies
Step by step explanation/tutorial of llama2.c
☆222Updated last year
MekkCyber / TritonAcademy
A repository to unravel the language of GPUs, making their kernel conversations easy to understand
☆188Updated last month
gau-nernst / learn-cuda
Learn CUDA with PyTorch
☆29Updated this week
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆198Updated 3 weeks ago
rasbt / dora-from-scratch
LoRA and DoRA from Scratch Implementations
☆206Updated last year
wolfecameron / nanoMoE
An extension of the nanoGPT repository for training small MOE models.
☆160Updated 4 months ago
hkproj / triton-flash-attention
☆179Updated 6 months ago
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆264Updated 9 months ago
joey00072 / ohara
Collection of autoregressive model implementation
☆85Updated 2 months ago
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆139Updated this week
Locutusque / TPU-Alignment
Fully fine-tune large models like Mistral, Llama-2-13B, or Qwen-14B completely for free
☆232Updated 8 months ago
mallik3006 / LLM_fine_tuning_llama3_8b
Fine-Tuning Llama3-8B LLM in a multi-GPU environment using DeepSpeed
☆18Updated last year
lucasdelimanogueira / PyNorch
Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)
☆150Updated last year
tspeterkim / mixed-precision-from-scratch
Mixed precision training from scratch with Tensors and CUDA
☆24Updated last year
evintunador / triton_docs_tutorials
making the official triton tutorials actually comprehensible
☆45Updated 3 months ago
StableFluffy / EasyLLMFeaturePorter
1-Click is all you need.
☆62Updated last year
kroggen / mamba.c
Inference of Mamba models in pure C
☆188Updated last year
hkproj / pytorch-lora
LORA: Low-Rank Adaptation of Large Language Models implemented using PyTorch
☆110Updated last year
cloneofsimo / ptx-tutorial-by-aislop
PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)
☆66Updated 3 months ago
ash-01xor / bpe.c
Simple Byte pair Encoding mechanism used for tokenization process . written purely in C
☆134Updated 8 months ago
aju22 / LLaMA2
This repository contains an implementation of the LLaMA 2 (Large Language Model Meta AI) model, a Generative Pretrained Transformer (GPT)…
☆68Updated last year
huggingface / picotron_tutorial
☆198Updated 5 months ago
hkproj / quantization-notes
Notes on quantization in neural networks
☆89Updated last year
VITA-Group / Q-GaLore
Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.
☆198Updated 11 months ago
deep-diver / PingPong
manage histories of LLM applied applications
☆91Updated last year
VatsaDev / NanoPoor
NanoGPT-speedrunning for the poor T4 enjoyers
☆68Updated 2 months ago