huggingface / fineweb-2Links
☆116Updated 5 months ago
Alternatives and similar repositories for fineweb-2
Users that are interested in fineweb-2 are comparing it to the libraries listed below
Sorting:
- Manage scalable open LLM inference endpoints in Slurm clusters☆258Updated 10 months ago
- PyTorch building blocks for the OLMo ecosystem☆222Updated this week
- Official repository for "Scaling Retrieval-Based Langauge Models with a Trillion-Token Datastore".☆201Updated 3 weeks ago
- This is the official repository for Inheritune.☆111Updated 3 months ago
- ☆121Updated last month
- 🚢 Data Toolkit for Sailor Language Models☆91Updated 3 months ago
- Official repository for paper "ReasonIR Training Retrievers for Reasoning Tasks".☆162Updated last month
- Lightweight demos for finetuning LLMs. Powered by 🤗 transformers and open-source datasets.☆77Updated 7 months ago
- Lightweight toolkit package to train and fine-tune 1.58bit Language models☆69Updated 2 weeks ago
- General Reasoner: Advancing LLM Reasoning Across All Domains☆126Updated last week
- Code for Zero-Shot Tokenizer Transfer☆128Updated 4 months ago
- Reproducible, flexible LLM evaluations☆204Updated 3 weeks ago
- ☆27Updated last month
- Maya: An Instruction Finetuned Multilingual Multimodal Model using Aya☆110Updated 2 weeks ago
- Let's build better datasets, together!☆259Updated 5 months ago
- Pretraining Efficiently on S2ORC!☆164Updated 7 months ago
- ☆120Updated 8 months ago
- EvolKit is an innovative framework designed to automatically enhance the complexity of instructions used for fine-tuning Large Language M…☆220Updated 7 months ago
- Code for ExploreTom☆83Updated 5 months ago
- MultilingualSIFT: Multilingual Supervised Instruction Fine-tuning☆89Updated last year
- ☆149Updated last year
- LongRoPE is a novel method that can extends the context window of pre-trained LLMs to an impressive 2048k tokens.☆228Updated 9 months ago
- Code for KaLM-Embedding models☆78Updated 2 months ago
- awesome synthetic (text) datasets☆281Updated 7 months ago
- The official evaluation suite and dynamic data release for MixEval.☆242Updated 6 months ago
- code for training & evaluating Contextual Document Embedding models☆191Updated 3 weeks ago
- This project studies the performance and robustness of language models and task-adaptation methods.☆150Updated last year
- EvaByte: Efficient Byte-level Language Models at Scale☆98Updated last month
- LOFT: A 1 Million+ Token Long-Context Benchmark☆198Updated last month
- A pipeline for LLM knowledge distillation☆104Updated 2 months ago