LastBotInc / llama2jLinks
Pure Java Llama2 inference with optional multi-GPU CUDA implementation
☆13Updated 2 years ago
Alternatives and similar repositories for llama2j
Users that are interested in llama2j are comparing it to the libraries listed below
Sorting:
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆128Updated 10 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆197Updated 3 weeks ago
- Modular and structured prompt caching for low-latency LLM inference☆100Updated 11 months ago
- The driver for LMCache core to run in vLLM☆52Updated 8 months ago
- LLM Serving Performance Evaluation Harness☆79Updated 7 months ago
- ☆25Updated 5 months ago
- ☆47Updated last year
- ☆95Updated 6 months ago
- ☆58Updated 4 months ago
- vLLM Router☆45Updated last year
- Fast and memory-efficient exact attention☆94Updated this week
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆115Updated last year
- JAX backend for SGL☆71Updated this week
- ☆46Updated 9 months ago
- ☆72Updated 6 months ago
- Benchmark suite for LLMs from Fireworks.ai☆83Updated this week
- DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.☆58Updated 2 weeks ago
- Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks☆144Updated 2 years ago
- ☆39Updated 2 months ago
- Efficient Compute-Communication Overlap for Distributed LLM Inference☆58Updated last week
- Implementation from scratch in C of the Multi-head latent attention used in the Deepseek-v3 technical paper.☆19Updated 8 months ago
- vLLM performance dashboard☆36Updated last year
- Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.☆83Updated 3 weeks ago
- An extension to Llama2.java implementation accelerated with GPUs, using TornadoVM☆26Updated last year
- Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport☆63Updated 5 months ago
- Ongoing research training transformer models at scale☆29Updated this week
- ArcticInference: vLLM plugin for high-throughput, low-latency inference☆270Updated this week
- Inference Llama 2 in one file of pure Java☆18Updated last year
- KV cache compression for high-throughput LLM inference☆141Updated 8 months ago
- ☆64Updated 5 months ago