apple / ml-cross-entropyLinks
β510Updated last week
Alternatives and similar repositories for ml-cross-entropy
Users that are interested in ml-cross-entropy are comparing it to the libraries listed below
Sorting:
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ533Updated 2 months ago
- Helpful tools and examples for working with flex-attentionβ908Updated 3 weeks ago
- Scalable toolkit for efficient model reinforcementβ578Updated this week
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β258Updated 2 weeks ago
- Large Context Attentionβ720Updated 6 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β344Updated 7 months ago
- Efficient LLM Inference over Long Sequencesβ385Updated last month
- β293Updated 3 months ago
- LLM KV cache compression made easyβ566Updated last week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β567Updated this week
- Ring attention implementation with flash attentionβ831Updated this week
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024β324Updated 3 months ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β433Updated 2 months ago
- π₯ A minimal training framework for scaling FLA modelsβ220Updated last month
- Normalized Transformer (nGPT)β185Updated 8 months ago
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β778Updated 4 months ago
- Muon is an optimizer for hidden layers in neural networksβ1,454Updated 3 weeks ago
- Load compute kernels from the Hubβ220Updated last week
- Efficient triton implementation of Native Sparse Attention.β186Updated 2 months ago
- β190Updated 7 months ago
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ700Updated last month
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ483Updated 5 months ago
- [ICML 2024] CLLMs: Consistency Large Language Modelsβ397Updated 8 months ago
- Scalable and Performant Data Loadingβ291Updated this week
- H-Net: Hierarchical Network with Dynamic Chunkingβ632Updated last week
- β182Updated this week
- [ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parametersβ567Updated 5 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β244Updated 6 months ago
- Megatron's multi-modal data loaderβ232Updated last week
- β208Updated 5 months ago