fla-org / hybrid-distillationView external linksLinks
☆27Dec 31, 2025Updated last month
Alternatives and similar repositories for hybrid-distillation
Users that are interested in hybrid-distillation are comparing it to the libraries listed below
Sorting:
- Use the tokenizer in parallel to achieve superior acceleration☆20Mar 21, 2024Updated last year
- 🔥 A minimal training framework for scaling FLA models☆344Nov 15, 2025Updated 3 months ago
- ☆221Nov 19, 2025Updated 2 months ago
- Code and data for paper "(How) do Language Models Track State?"☆21Mar 31, 2025Updated 10 months ago
- Expanding linear RNN state-transition matrix eigenvalues to include negatives improves state-tracking tasks and language modeling without…☆20Mar 15, 2025Updated 11 months ago
- Official implementation of Log-linear Sparse Attention (LLSA).☆56Feb 2, 2026Updated 2 weeks ago
- Stick-breaking attention☆62Jul 1, 2025Updated 7 months ago
- Experiments on the impact of depth in transformers and SSMs.☆40Oct 23, 2025Updated 3 months ago
- ☆129Jun 6, 2025Updated 8 months ago
- Official Code Repository for the paper "Key-value memory in the brain"☆31Feb 25, 2025Updated 11 months ago
- ☆44Nov 1, 2025Updated 3 months ago
- M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models☆46Jul 17, 2025Updated 6 months ago
- RWKV-X is a Linear Complexity Hybrid Language Model based on the RWKV architecture, integrating Sparse Attention to improve the model's l…☆54Jan 12, 2026Updated last month
- 在监控画质下实现对校园自行车的重识别,包含REID模型识别,向量数据库检索,UI展示☆10Feb 13, 2024Updated 2 years ago
- Workshop materials for AI Engineer World's Fair☆13Jun 3, 2025Updated 8 months ago
- [ICLR 2026] GRAPE: Group Representational Position Encoding (https://arxiv.org/abs/2512.07805)☆78Jan 27, 2026Updated 2 weeks ago
- Spectral Sphere Optimizer☆96Jan 14, 2026Updated last month
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆89Jan 29, 2026Updated 2 weeks ago
- ☆12Jan 29, 2021Updated 5 years ago
- Implementation of Reinforce for educational purposes.☆12Jun 12, 2023Updated 2 years ago
- ☆12Jun 15, 2023Updated 2 years ago
- ☆11Dec 15, 2025Updated 2 months ago
- Math evaluations of llama models.☆10Jan 3, 2024Updated 2 years ago
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- LaTex template for ITMO style presentations☆10Jan 19, 2025Updated last year
- JsonTuning: Towards Generalizable, Robust, and Controllable Instruction Tuning☆10Nov 3, 2024Updated last year
- High-performance tokenized language data-loader for Python C++ extension☆14Jul 22, 2024Updated last year
- LongAttn :Selecting Long-context Training Data via Token-level Attention☆15Jul 16, 2025Updated 7 months ago
- ☆10Dec 18, 2023Updated 2 years ago
- Fork of HyenaDNA, a long-range genomic foundation model built with Hyena☆10Aug 14, 2023Updated 2 years ago
- A toolkit for developers to simplify the transformation of nn.Module instances. It's now corresponding to Pytorch.fx.☆13Apr 7, 2023Updated 2 years ago
- an implementation of paper"Retentive Network: A Successor to Transformer for Large Language Models" https://arxiv.org/pdf/2307.08621.pdf☆11Jul 25, 2023Updated 2 years ago
- Efficient retrieval head analysis with triton flash attention that supports topK probability☆13Jun 15, 2024Updated last year
- Fully open reproduction of DeepSeek-R1☆12Mar 24, 2025Updated 10 months ago
- Code for "AtTGen: Attribute Tree Generation for Real-World Attribute Joint Extraction", ACL 2023☆13May 19, 2023Updated 2 years ago
- Ilya Sutskever 推荐的30篇Deep learning 必读论文 (中英文对照翻译版)☆13Dec 18, 2024Updated last year
- ☆14Dec 25, 2024Updated last year
- ☆15Jul 13, 2025Updated 7 months ago
- Showing how to use CUDA on google colab☆13Feb 24, 2025Updated 11 months ago