kolinko / effort
An implementation of bucketMul LLM inference
☆214Updated 6 months ago
Alternatives and similar repositories for effort:
Users that are interested in effort are comparing it to the libraries listed below
- Mistral7B playing DOOM☆123Updated 6 months ago
- Visualize the intermediate output of Mistral 7B☆333Updated 11 months ago
- ☆112Updated 2 months ago
- WebGPU LLM inference tuned by hand☆148Updated last year
- Visualizing the internal board state of a GPT trained on chess PGN strings, and performing interventions on its internal board state and …☆198Updated last month
- Stop messing around with finicky sampling parameters and just use DRµGS!☆336Updated 7 months ago
- Fast parallel LLM inference for MLX☆152Updated 6 months ago
- a small code base for training large models☆283Updated 3 weeks ago
- Bayesian Optimization as a Coverage Tool for Evaluating LLMs. Accurate evaluation (benchmarking) that's 10 times faster with just a few l…☆276Updated 3 weeks ago
- ☆237Updated 9 months ago
- Lightweight Nearest Neighbors with Flexible Backends☆194Updated last week
- Stateful load balancer custom-tailored for llama.cpp 🏓🦙☆666Updated last week
- Run GGML models with Kubernetes.☆173Updated last year
- This project collects GPU benchmarks from various cloud providers and compares them to fixed per token costs. Use our tool for efficient …☆216Updated last month
- ☆163Updated 7 months ago
- 1.58 Bit LLM on Apple Silicon using MLX☆178Updated 8 months ago
- A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and full…☆605Updated last month
- Finetune llama2-70b and codellama on MacBook Air without quantization☆448Updated 9 months ago
- run paligemma in real time☆129Updated 7 months ago
- a curated list of data for reasoning ai☆118Updated 5 months ago
- Absolute minimalistic implementation of a GPT-like transformer using only numpy (<650 lines).☆250Updated last year
- An mlx project to train a base model on your whatsapp chats using (Q)Lora finetuning☆162Updated last year
- ☆185Updated last month
- Tiny inference-only implementation of LLaMA☆91Updated 9 months ago
- ☆250Updated this week
- LLM-based code completion engine☆178Updated last month
- Code to train and evaluate Neural Attention Memory Models to obtain universally-applicable memory systems for transformers.☆273Updated 2 months ago
- ☆85Updated 3 months ago
- an implementation of Self-Extend, to expand the context window via grouped attention☆118Updated last year
- LLaVA server (llama.cpp).☆176Updated last year