jorahn / llama-int8
Quantized inference code for LLaMA models
☆13Updated last year
Alternatives and similar repositories for llama-int8:
Users that are interested in llama-int8 are comparing it to the libraries listed below
- A library for incremental loading of large PyTorch checkpoints☆56Updated 2 years ago
- Trying to deconstruct RWKV in understandable terms☆14Updated last year
- ☆26Updated last year
- GGML implementation of BERT model with Python bindings and quantization.☆54Updated last year
- Demonstration that finetuning RoPE model on larger sequences than the pre-trained model adapts the model context limit☆63Updated last year
- Rust bindings for CTranslate2☆14Updated last year
- Preprint: Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning☆28Updated last year
- ☆27Updated last year
- ☆37Updated 2 years ago
- A playground to make it easy to try crazy things☆33Updated this week
- Experimental sampler to make LLMs more creative☆30Updated last year
- Simple LLM inference server☆20Updated 8 months ago
- ☆40Updated last year
- ☆22Updated 9 months ago
- An OpenAI API compatible LLM inference server based on ExLlamaV2.☆25Updated last year
- Image Generation API Server - Similar to https://text-generator.io but for images☆50Updated 3 months ago
- A fork of llama3.c used to do some R&D on inferencing☆19Updated 2 months ago
- Training hybrid models for dummies.☆20Updated last month
- Command-line script for inferencing from models such as falcon-7b-instruct☆76Updated last year
- Merge LLM that are split in to parts☆26Updated last year
- A library for simplifying fine tuning with multi gpu setups in the Huggingface ecosystem.☆16Updated 4 months ago
- ☆32Updated last year
- Latent Large Language Models☆17Updated 6 months ago
- Hidden Engrams: Long Term Memory for Transformer Model Inference☆35Updated 3 years ago
- 🤗Transformers: State-of-the-art Natural Language Processing for Pytorch and TensorFlow 2.0.☆56Updated 3 years ago
- Efficiently computing & storing token n-grams from large corpora☆18Updated 4 months ago
- Embeddings focused small version of Llama NLP model☆103Updated last year
- Command-line script for inferencing from models such as LLaMA, in a chat scenario, with LoRA adaptations☆33Updated last year
- Text-writing denoising diffusion (and much more)☆30Updated last year
- Chatbot that answers frequently asked questions in French, English, and Tunisian using the Rasa NLU framework and RWKV-4-Raven☆13Updated last year