r-three / common-pileView external linksLinks
Code for collecting, processing, and preparing datasets for the Common Pile
☆250Updated this week
Alternatives and similar repositories for common-pile
Users that are interested in common-pile are comparing it to the libraries listed below
Sorting:
- ☆17Aug 5, 2025Updated 6 months ago
- ☆16Nov 26, 2024Updated last year
- ☆32Dec 2, 2024Updated last year
- Using fourier interpolation to merge large language models☆11Jan 6, 2026Updated last month
- A place to document learning, projects and ideas for a carbon neutral internet.☆11Feb 21, 2020Updated 5 years ago
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…☆52Jun 12, 2020Updated 5 years ago
- Code for the ACL 2023 paper: "Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Sc…☆35Sep 16, 2023Updated 2 years ago
- A simple uv workspace☆19Apr 5, 2025Updated 10 months ago
- ☆15Updated this week
- Code for co-training large language models (e.g. T0) with smaller ones (e.g. BERT) to boost few-shot performance☆17Sep 23, 2022Updated 3 years ago
- A curated list of software, tools, resources and projects by and for libraries.☆17May 25, 2020Updated 5 years ago
- Collection of academic works in natural language processing, computational linguistics, and computational cognitive science that study th…☆22Mar 20, 2024Updated last year
- Repository containing the open source code of works published at the FBK MT unit.☆59Jan 16, 2026Updated 3 weeks ago
- ☆42Aug 5, 2025Updated 6 months ago
- Pretraining Efficiently on S2ORC!☆179Oct 23, 2024Updated last year
- Model Merging with Functional Dual Anchors☆45Nov 23, 2025Updated 2 months ago
- A PDF classifier ensemble with REST API service☆23Mar 5, 2021Updated 4 years ago
- Process, enhance and evaluate multiple OCR output.☆24Dec 2, 2025Updated 2 months ago
- ☆48Aug 29, 2024Updated last year
- Utilities for PyTorch distributed☆25Feb 27, 2025Updated 11 months ago
- Explore the Linux kernel source code with AI-generated summaries☆31Dec 20, 2024Updated last year
- Local emulator for Hugging Face Inference Endpoints customer handlers☆27Jul 25, 2023Updated 2 years ago
- Code and Data Repo for the CoNLL Paper -- Future Lens: Anticipating Subsequent Tokens from a Single Hidden State☆20Oct 24, 2025Updated 3 months ago
- The official repo for “Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem” [EMNLP25]☆34Sep 1, 2025Updated 5 months ago
- Ongoing research training transformer models at scale☆43Updated this week
- Efficiently computing & storing token n-grams from large corpora☆26Oct 6, 2024Updated last year
- It's a cooler way to store simple linear models.☆27Jul 15, 2024Updated last year
- Verifiers for LLM Reinforcement Learning☆80Apr 15, 2025Updated 9 months ago
- ☆63Dec 29, 2025Updated last month
- An example self-hosted map with all dependencies included☆26Jul 9, 2024Updated last year
- The pipeline for the OSCAR corpus☆176Nov 9, 2025Updated 3 months ago
- Small python package to measure OCR quality and other related metrics.☆27Feb 19, 2024Updated last year
- Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from e…☆28May 23, 2024Updated last year
- ☆27Nov 4, 2024Updated last year
- KoCommonGEN v2: A Benchmark for Navigating Korean Commonsense Reasoning Challenges in Large Language Models☆25Aug 24, 2024Updated last year
- Data and tools for generating and inspecting OLMo pre-training data.☆1,404Nov 5, 2025Updated 3 months ago
- The official code of LM-Debugger, an interactive tool for inspection and intervention in transformer-based language models.☆183May 13, 2022Updated 3 years ago
- The corresponding code for our paper: "Exploring the Challenges of Open Domain Multi-Document Summarization". Do not hesitate to open an …☆33Jun 24, 2023Updated 2 years ago
- SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…☆35May 24, 2024Updated last year