Example of applying CUDA graphs to LLaMA-v2
☆12Aug 25, 2023Updated 2 years ago
Alternatives and similar repositories for llama-cuda-graph-example
Users that are interested in llama-cuda-graph-example are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Distributed SDDMM Kernel☆12Jul 8, 2022Updated 3 years ago
- GPU operators for sparse tensor operations☆35Mar 11, 2024Updated 2 years ago
- Experiment of using Tangent to autodiff triton☆82Jan 22, 2024Updated 2 years ago
- torch.compile artifacts for common deep learning models, can be used as a learning resource for torch.compile☆19Dec 22, 2023Updated 2 years ago
- Pure Java Llama2 inference with optional multi-GPU CUDA implementation☆13Sep 2, 2023Updated 2 years ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- JAX Scalify: end-to-end scaled arithmetics☆18Oct 30, 2024Updated last year
- ☆15Jul 13, 2025Updated 8 months ago
- Frontend for v2.opyn.co☆11May 28, 2023Updated 2 years ago
- a fast implementation of BM25☆10Sep 15, 2022Updated 3 years ago
- A flexible Handlebars view engine for Express☆12Jul 6, 2016Updated 9 years ago
- Storytelling With Matplotlib (SWMat)☆13Jul 25, 2019Updated 6 years ago
- Factories over fixtures. Chai Assertion Library.☆23Nov 1, 2016Updated 9 years ago
- Todo List example using React and Apollo Client☆10Mar 2, 2017Updated 9 years ago
- ☆25Sep 9, 2024Updated last year
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- train with kittens!☆64Oct 25, 2024Updated last year
- Mathematical expression evaluator with just in time code generation.☆12Apr 7, 2013Updated 12 years ago
- Sardeenz is a proof-of-concept application that allows you to load more than one model on a given GPU. It allows you to add more and more…☆38Mar 5, 2026Updated 2 weeks ago
- This repository contains the results and code for the MLPerf™ Inference v4.0 benchmark.☆11Jul 24, 2025Updated 8 months ago
- A fork of the PEFT library, supporting Robust Adaptation (RoSA)☆15Aug 16, 2024Updated last year
- Service for estimating gas on a series of dependent transactions☆19Jun 16, 2023Updated 2 years ago
- A pytorch implementation of focal loss☆10Jan 9, 2020Updated 6 years ago
- ☆12Mar 31, 2021Updated 4 years ago
- ☆49Apr 15, 2024Updated last year
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- IPFS-related scripts and utilities☆15Sep 23, 2021Updated 4 years ago
- An ultra-fast, distributed Safetensors loader☆31Updated this week
- [IEEE CAL 2025] Accelerating Page Migrations in Operating Systems with Intel DSA☆16Nov 20, 2024Updated last year
- ☆12Jun 3, 2019Updated 6 years ago
- Generative Agents: Interactive Simulacra of Human Behavior - with Local LLMs☆18Aug 15, 2023Updated 2 years ago
- Accelerating GPU Data Processing using FastLanes Compression☆17May 9, 2024Updated last year
- ☆13Jan 7, 2025Updated last year
- Unleash the full potential of exascale LLMs on consumer-class GPUs, proven by extensive benchmarks, with no long-term adjustments and min…☆26Nov 11, 2024Updated last year
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆95Feb 20, 2026Updated last month
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Inference Llama/Llama2/Llama3 Modes in NumPy☆21Nov 22, 2023Updated 2 years ago
- Draft-Target Disaggregation LLM Serving System via Parallel Speculative Decoding.☆180Mar 18, 2026Updated last week
- UVA command line client to upload solutions and search for statistics☆10Dec 23, 2016Updated 9 years ago
- Keyformer proposes KV Cache reduction through key tokens identification and without the need for fine-tuning☆57Mar 26, 2024Updated last year
- ☆13May 25, 2023Updated 2 years ago
- CUDA C simple application for Nvidia's GPU☆11Jun 7, 2022Updated 3 years ago
- Demonstration that finetuning RoPE model on larger sequences than the pre-trained model adapts the model context limit☆63Jun 21, 2023Updated 2 years ago