adamydwang / mobilellamaLinks
a lightweight C++ LLaMA inference engine for mobile devices
☆13Updated last year
Alternatives and similar repositories for mobilellama
Users that are interested in mobilellama are comparing it to the libraries listed below
Sorting:
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆19Updated this week
- This repository is a read-only mirror of https://gitlab.arm.com/kleidi/kleidiai☆51Updated this week
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆38Updated 2 weeks ago
- Nsight Compute In Docker☆12Updated last year
- ☢️ TensorRT 2023复赛——基于TensorRT-LLM的Llama模型推断加速优化☆48Updated last year
- qwen2 and llama3 cpp implementation☆44Updated last year
- ☆11Updated 3 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆16Updated last year
- ☆29Updated 4 months ago
- ☆32Updated 11 months ago
- ☆31Updated 9 months ago
- Snapdragon Neural Processing Engine (SNPE) SDKThe Snapdragon Neural Processing Engine (SNPE) is a Qualcomm Snapdragon software accelerate…☆34Updated 3 years ago
- Benchmark tests supporting the TiledCUDA library.☆16Updated 7 months ago
- llm deploy project based onnx.☆42Updated 8 months ago
- A practical way of learning Swizzle☆20Updated 4 months ago
- Inference RWKV v5, v6 and v7 with Qualcomm AI Engine Direct SDK☆72Updated last week
- SGEMM optimization with cuda step by step☆19Updated last year
- Tiny C++11 GPT-2 inference implementation from scratch☆62Updated last month
- Standalone Flash Attention v2 kernel without libtorch dependency☆110Updated 9 months ago
- Quantized Attention on GPU☆44Updated 7 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆16Updated last year
- ☆35Updated 2 months ago
- Awesome code, projects, books, etc. related to CUDA☆17Updated last week
- ☆71Updated last month
- ☆69Updated 7 months ago
- Experiments with BitNet inference on CPU☆54Updated last year
- Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend☆54Updated this week
- Compression for Foundation Models☆32Updated 3 months ago
- ☆58Updated 7 months ago
- ☆77Updated 2 months ago