microsoft / encoder-decoder-slm
Efficient encoder-decoder architecture for small language models (≤1B parameters) with cross-architecture knowledge distillation and vision-language capabilities
☆23Updated 2 months ago
Alternatives and similar repositories for encoder-decoder-slm:
Users that are interested in encoder-decoder-slm are comparing it to the libraries listed below
- NanoGPT (124M) quality in 2.67B tokens☆28Updated this week
- This repo is based on https://github.com/jiaweizzhao/GaLore☆26Updated 7 months ago
- Train a SmolLM-style llm on fineweb-edu in JAX/Flax with an assortment of optimizers.☆17Updated last month
- Trully flash implementation of DeBERTa disentangled attention mechanism.☆45Updated last week
- ☆77Updated 8 months ago
- ☆48Updated 5 months ago
- ☆33Updated 10 months ago
- ☆41Updated 2 months ago
- Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812)☆28Updated last month
- Official implementation of "BERTs are Generative In-Context Learners"☆26Updated last month
- ☆47Updated 7 months ago
- ☆43Updated last year
- ☆79Updated last year
- GoldFinch and other hybrid transformer components☆45Updated 9 months ago
- Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"☆37Updated last year
- Fast, Modern, Memory Efficient, and Low Precision PyTorch Optimizers☆90Updated 9 months ago
- One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation☆39Updated 6 months ago
- XTR: Rethinking the Role of Token Retrieval in Multi-Vector Retrieval☆50Updated 10 months ago
- ☆25Updated last year
- Repository containing the SPIN experiments on the DIBT 10k ranked prompts☆24Updated last year
- The simplest, fastest repository for training/finetuning medium-sized xLSTMs.☆42Updated 11 months ago
- A byte-level decoder architecture that matches the performance of tokenized Transformers.☆63Updated last year
- Triton Implementation of HyperAttention Algorithm☆47Updated last year
- NanoGPT-speedrunning for the poor T4 enjoyers☆62Updated this week
- DPO, but faster 🚀☆40Updated 4 months ago
- ☆49Updated last year
- Code repository for the paper "MrT5: Dynamic Token Merging for Efficient Byte-level Language Models."☆38Updated last week
- Collection of autoregressive model implementation☆85Updated 2 months ago
- Supercharge huggingface transformers with model parallelism.☆76Updated 6 months ago
- Minimum Description Length probing for neural network representations☆19Updated 2 months ago