LastBotInc / llama2jLinks

Pure Java Llama2 inference with optional multi-GPU CUDA implementation

☆13

Alternatives and similar repositories for llama2j

Users that are interested in llama2j are comparing it to the libraries listed below

Sorting:

Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆128Updated 10 months ago
OpenBMB / CPM.cu
CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…
☆197Updated 3 weeks ago
yale-sys / prompt-cache
Modular and structured prompt caching for low-latency LLM inference
☆100Updated 11 months ago
LMCache / lmcache-vllm
The driver for LMCache core to run in vLLM
☆52Updated 8 months ago
project-etalon / etalon
LLM Serving Performance Evaluation Harness
☆79Updated 7 months ago
LMCache / demo
☆25Updated 5 months ago
tyler-griggs / melange-release
☆47Updated last year
InternLM / turbomind
☆95Updated 6 months ago
wdlctc / headinfer
☆58Updated 4 months ago
LLM-inference-router / vllm-router
vLLM Router
☆45Updated last year
vllm-project / flash-attention
Fast and memory-efficient exact attention
☆94Updated this week
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆115Updated last year
sgl-project / sglang-jax
JAX backend for SGL
☆71Updated this week
Azure / msccl-executor-nccl
☆46Updated 9 months ago
deepspeedai / DeepSpeed-Kernels
☆72Updated 6 months ago
fw-ai / benchmark
Benchmark suite for LLMs from Fireworks.ai
☆83Updated this week
antgroup / DeepXTrace
DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.
☆58Updated 2 weeks ago
New-Consensus-Concurrency-Control / Nezha
Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks
☆144Updated 2 years ago
zhuzilin / flash-attention-with-sink
☆39Updated 2 months ago
microsoft / tokenweave
Efficient Compute-Communication Overlap for Distributed LLM Inference
☆58Updated last week
thomaschlt / mla.c
Implementation from scratch in C of the Multi-head latent attention used in the Deepseek-v3 technical paper.
☆19Updated 8 months ago
vllm-project / dashboard
vLLM performance dashboard
☆36Updated last year
microsoft / RetrievalAttention
Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.
☆83Updated 3 weeks ago
mikepapadim / llama2.tornadovm.java
An extension to Llama2.java implementation accelerated with GPUs, using TornadoVM
☆26Updated last year
Infrawaves / DeepEP_ibrc_dual-ports_multiQP
Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport
☆63Updated 5 months ago
ROCm / Megatron-LM
Ongoing research training transformer models at scale
☆29Updated this week
snowflakedb / ArcticInference
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆270Updated this week
neoremind / llama2.java
Inference Llama 2 in one file of pure Java
☆18Updated last year
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆141Updated 8 months ago
flashinfer-ai / cutlass-viz
☆64Updated 5 months ago