apple / ml-cross-entropy
β331Updated 2 weeks ago
Alternatives and similar repositories for ml-cross-entropy:
Users that are interested in ml-cross-entropy are comparing it to the libraries listed below
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ501Updated 3 months ago
- Efficient LLM Inference over Long Sequencesβ356Updated this week
- LLM KV cache compression made easyβ384Updated this week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β513Updated this week
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β294Updated 2 months ago
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β221Updated this week
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β215Updated 2 weeks ago
- Helpful tools and examples for working with flex-attentionβ630Updated this week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ424Updated this week
- β165Updated 2 months ago
- Ring attention implementation with flash attentionβ673Updated last month
- Large Context Attentionβ681Updated 3 weeks ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMsβ228Updated this week
- Triton-based implementation of Sparse Mixture of Experts.β196Updated 2 months ago
- This repository contains the experimental PyTorch native float8 training UXβ221Updated 6 months ago
- [ICML 2024] CLLMs: Consistency Large Language Modelsβ371Updated 3 months ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β385Updated 2 months ago
- Muon optimizer: +~30% sample efficiency with <3% wallclock overheadβ251Updated last week
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"β220Updated 2 months ago
- ring-attention experimentsβ123Updated 3 months ago
- β175Updated this week
- β195Updated 3 weeks ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantizationβ331Updated 6 months ago
- β251Updated 5 months ago
- PyTorch implementation of models from the Zamba2 series.β176Updated 3 weeks ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problemsβ166Updated this week
- Scalable and Performant Data Loadingβ217Updated this week
- Minimalistic 4D-parallelism distributed training framework for education purposeβ717Updated this week
- Some preliminary explorations of Mamba's context scaling.β213Updated last year
- β111Updated 4 months ago