apple / ml-cross-entropyLinks
β489Updated last week
Alternatives and similar repositories for ml-cross-entropy
Users that are interested in ml-cross-entropy are comparing it to the libraries listed below
Sorting:
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ529Updated 2 months ago
- Helpful tools and examples for working with flex-attentionβ876Updated this week
- Large Context Attentionβ718Updated 5 months ago
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β255Updated last week
- Scalable toolkit for efficient model reinforcementβ499Updated this week
- β290Updated 2 months ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β429Updated 2 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β560Updated this week
- Muon is an optimizer for hidden layers in neural networksβ1,092Updated this week
- Efficient LLM Inference over Long Sequencesβ382Updated 3 weeks ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β341Updated 7 months ago
- LLM KV cache compression made easyβ535Updated last week
- [ICML 2024] CLLMs: Consistency Large Language Modelsβ396Updated 8 months ago
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β725Updated 3 months ago
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024β318Updated 2 months ago
- π₯ A minimal training framework for scaling FLA modelsβ188Updated last month
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β243Updated 5 months ago
- Normalized Transformer (nGPT)β184Updated 7 months ago
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ668Updated last month
- β200Updated 5 months ago
- H-Net: Hierarchical Network with Dynamic Chunkingβ115Updated this week
- β181Updated 7 months ago
- [ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parametersβ564Updated 5 months ago
- Load compute kernels from the Hubβ207Updated this week
- Scalable and Performant Data Loadingβ288Updated last week
- A family of compressed models obtained via pruning and knowledge distillationβ344Updated 8 months ago
- Ring attention implementation with flash attentionβ802Updated 2 weeks ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"β237Updated last month
- Efficient triton implementation of Native Sparse Attention.β175Updated last month
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAIβ287Updated last month