AllenJWZhu / LlamaInferLinks
LLM Inference Engine: High-performance CUDA-accelerated framework for large language model inference A cutting-edge, open-source implementation of a large language model (LLM) inference engine, optimized for consumer-grade hardware. This project showcases advanced techniques in GPU acceleration, memory management, and algorithmic optimizations
☆11Updated last year
Alternatives and similar repositories for LlamaInfer
Users that are interested in LlamaInfer are comparing it to the libraries listed below
Sorting:
- Efficient and Scalable Estimation of Tool Representations in Vector Space☆29Updated last year
- Repository for go shared libraries (for now).☆11Updated 2 months ago
- Cuda extensions for PyTorch☆12Updated 2 months ago
- Pure C inference for the GTE Small embedding model☆101Updated 3 weeks ago
- Example using echo conversational agent server☆14Updated last year
- Proof of concept minimal Linux distribution written in Lua and built with Bazel. Bazel enables one command to download Lua, compile Lua b…☆17Updated last month
- Memory map objects in S3☆23Updated 6 years ago
- asynchronous/distributed speculative evaluation for llama3☆40Updated last year
- The last-write-wins register CRDT☆16Updated last year
- notes, config, tools, etc. for kicking the tires on cockroachdb☆11Updated 10 months ago
- The official evaluation suite and dynamic data release for MixEval.☆11Updated last year
- CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning☆417Updated last month
- Toy distributed PostgreSQL by implementing SQL over KV☆11Updated 3 weeks ago
- First token cutoff sampling inference example☆30Updated 2 years ago
- Lightweight Llama 3 8B Inference Engine in CUDA C☆53Updated 10 months ago
- Thin wrapper around GGML to make life easier☆42Updated 3 months ago
- Multiplexer for AI protocols, agents, models, and tools☆24Updated 6 months ago
- A minimalistic C++ Jinja templating engine for LLM chat templates☆203Updated 4 months ago
- A massively parallel, optimal functional runtime in Rust☆31Updated last year
- iterate quickly with llama.cpp hot reloading. use the llama.cpp bindings with bun.sh☆50Updated 2 years ago
- Fun with wgpu: Simulating slime mold☆24Updated last year
- mHC kernels implemented in CUDA☆249Updated 3 weeks ago
- Inference RWKV v7 in pure C.☆44Updated 4 months ago
- 🛰 Uplink cluster management for Instellar☆13Updated last year
- Inference of Mamba and Mamba2 models in pure C☆196Updated 2 weeks ago
- Extracts structured data from unstructured input. Programming language agnostic. Uses llama.cpp☆45Updated last year
- High-Performance FP32 GEMM on CUDA devices☆117Updated last year
- An LLM inference engine, written in C++☆18Updated this week
- Unleash the full potential of exascale LLMs on consumer-class GPUs, proven by extensive benchmarks, with no long-term adjustments and min…☆26Updated last year
- Because it's there.☆16Updated last year