microsoft / Llama-2-OnnxLinks

☆1,028

Alternatives and similar repositories for Llama-2-Onnx

Users that are interested in Llama-2-Onnx are comparing it to the libraries listed below

Sorting:

skeskinen / bert.cpp
ggml implementation of BERT
☆495Updated last year
NouamaneTazi / bloomz.cpp
C++ implementation for BLOOM
☆810Updated 2 years ago
intel / intel-extension-for-transformers
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Pl…
☆2,169Updated 9 months ago
bigcode-project / starcoder.cpp
C++ implementation for 💫StarCoder
☆455Updated last year
trholding / llama2.c
Llama 2 Everywhere (L2E)
☆1,519Updated 6 months ago
punica-ai / punica
Serving multiple LoRA finetuned LLM as one
☆1,075Updated last year
kuleshov-group / llmtools
Finetuning Large Language Models on One Consumer GPU in 2 Bits
☆726Updated last year
marella / ctransformers
Python bindings for the Transformer models implemented in C/C++ using GGML library.
☆1,871Updated last year
Vahe1994 / SpQR
☆544Updated 7 months ago
tairov / llama2.mojo
Inference Llama 2 in one file of pure 🔥
☆2,115Updated last year
RWKV / rwkv.cpp
INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
☆1,532Updated 3 months ago
Maknee / minigpt4.cpp
Port of MiniGPT4 in C++ (4bit, 5bit, 6bit, 8bit, 16bit CPU inference with GGML)
☆567Updated last year
abacaj / mpt-30B-inference
Run inference on MPT-30B using CPU
☆575Updated 2 years ago
scaleapi / llm-engine
Scale LLM Engine public repository
☆808Updated this week
tomaarsen / attention_sinks
Extend existing LLMs way beyond the original training length with constant memory usage, without retraining
☆702Updated last year
okuvshynov / slowllama
Finetune llama2-70b and codellama on MacBook Air without quantization
☆447Updated last year
salesforce / xgen
Salesforce open-source LLMs with 8k sequence length.
☆720Updated 5 months ago
rmihaylov / falcontune
Tune any FALCON in 4-bit
☆467Updated last year
SqueezeAILab / SqueezeLLM
[ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization
☆695Updated 11 months ago
sahil280114 / codealpaca
☆1,478Updated 2 years ago
mit-han-lab / TinyChatEngine
TinyChatEngine: On-Device LLM Inference Library
☆875Updated last year
kuleshov / minillm
MiniLLM is a minimal system for running modern LLMs on consumer-grade GPUs
☆915Updated 2 years ago
abacaj / fine-tune-mistral
Fine-tune mistral-7B on 3090s, a100s, h100s
☆715Updated last year
tloen / llama-int8
Quantized inference code for LLaMA models
☆1,049Updated 2 years ago
huggingface / optimum-nvidia
☆988Updated 5 months ago
persimmon-ai-labs / adept-inference
Inference code for Persimmon-8B
☆415Updated last year
MDK8888 / GPTFast
Accelerate your Hugging Face Transformers 7.6-9x. Native to Hugging Face and PyTorch.
☆685Updated 10 months ago
tpoisonooo / llama.onnx
LLaMa/RWKV onnx models, quantization and testcase
☆363Updated 2 years ago
mobiusml / hqq
Official implementation of Half-Quadratic Quantization (HQQ)
☆846Updated last week
S-LoRA / S-LoRA
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
☆1,844Updated last year