EleutherAI/stackexchange-dataset

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/EleutherAI/stackexchange-dataset)

EleutherAI / stackexchange-dataset

Python tools for processing the stackexchange data dumps into a text dataset for Language Models

☆87

Alternatives and similar repositories for stackexchange-dataset

Users that are interested in stackexchange-dataset are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

noanabeshima / github-downloader
View on GitHub
Script for downloading GitHub.
☆13Sep 24, 2020Updated 5 years ago
cverluise / openPatstat
View on GitHub
Load, build and explore Patstat using the Google Cloud Platform
☆10Jan 19, 2019Updated 7 years ago
laurentromary / stdfSpec
View on GitHub
Specification of a stand-off element for the TEI guidelines
☆12Apr 29, 2021Updated 5 years ago
istex-archives / istex-browser-extension
View on GitHub
Bouton ISTEX : extension web capable d'insérer dynamiquement sur la page web consultée un lien vers le fulltext d'un document si ce dern…
☆11May 30, 2023Updated 3 years ago
softcite / softcite_kb
View on GitHub
A Knowledge Base for research software relying on large-scale text mining and curated knowledge sources
☆18May 14, 2023Updated 3 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
vwoloszyn / diaa
View on GitHub
Inter-annotator agreement for Doccano
☆28May 3, 2020Updated 6 years ago
sdtblck / Opensubtitles_dataset
View on GitHub
downloads and parses subtitle dataset from opensubtitles.org
☆15Apr 19, 2024Updated 2 years ago
noanabeshima / wikipedia-downloader
View on GitHub
Downloads 2020 English Wikipedia articles as plaintext
☆27Mar 25, 2023Updated 3 years ago
Cohere-Labs-Community / aya-annotations-ui
View on GitHub
Web UI & Backend for Data Annotations in Aya
☆30Mar 16, 2024Updated 2 years ago
CarperAI / Code-Pile
View on GitHub
This repository contains all the code for collecting large scale amounts of code from GitHub.
☆109Feb 17, 2023Updated 3 years ago
kermitt2 / grisp
View on GitHub
Knowledge Base stuff
☆23Mar 1, 2026Updated 4 months ago
conceptmath / conceptmath
View on GitHub
[ACL 2024 Findings] The official repo for "ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large …
☆26May 29, 2024Updated 2 years ago
neukg / KAT-TSLF
View on GitHub
Source code of paper “A Novel Three-Stage Learning Framework for Low-Resource Knowledge-Grounded Dialogue Generation”
☆16Nov 25, 2021Updated 4 years ago
nikitakit / sabertooth
View on GitHub
Standalone pre-training recipe with JAX+Flax
☆35Apr 3, 2023Updated 3 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
kermitt2 / datastet
View on GitHub
Finding mentions and citations to named and implicit research datasets from within the academic literature
☆31Jun 14, 2025Updated last year
AkariAsai / unanswerable_qa
View on GitHub
The official implementation for ACL 2021 "Challenges in Information Seeking QA: Unanswerable Questions and Paragraph Retrieval".
☆28Jun 19, 2021Updated 5 years ago
zhangir-azerbayev / proof-pile
View on GitHub
Scripts for downloading and pre-processing the `proof-pile`, a high quality dataset of mathematical text and code.
☆22Nov 26, 2022Updated 3 years ago
kermitt2 / biblio_glutton_harvester
View on GitHub
Open Access PDF harvester
☆42May 3, 2024Updated 2 years ago
saltudelft / CD4Py
View on GitHub
CD4Py: Code De-Duplication for Python
☆23Dec 13, 2020Updated 5 years ago
kailums / flash-attention-rocm
View on GitHub
Fast and memory-efficient exact attention ported to rocm
☆14Dec 1, 2023Updated 2 years ago
google-research-datasets / c4repset
View on GitHub
C4RepSet: Representative Subset from C4 data for Training Pre-trained LMs
☆11Jan 13, 2023Updated 3 years ago
EleutherAI / the-pile
View on GitHub
☆1,670Apr 27, 2023Updated 3 years ago
gchhablani / multilingual-vqa
View on GitHub
Repository for Multilingual-VQA task created during HuggingFace JAX/Flax community week.
☆33Jul 27, 2021Updated 5 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
howisonlab / softcite-dataset
View on GitHub
A gold-standard dataset of software mentions in research publications.
☆39Jul 27, 2023Updated 3 years ago
istex-archives / sisyphe
View on GitHub
Sisyphe is a modulable NodeJS BIG-DATA analyser & transformer
☆12Oct 16, 2023Updated 2 years ago
ScienciaLAB / document-qa
View on GitHub
Scientific Document Insight Q/A
☆37Jun 7, 2026Updated last month
LAION-AI / Anh
View on GitHub
Anh - LAION's multilingual assistant datasets and models
☆28Apr 5, 2023Updated 3 years ago
NL2Code / CodeM
View on GitHub
☆44Jun 2, 2024Updated 2 years ago
tlringer / proof-chat-fun
View on GitHub
playing with gpt4
☆13Mar 17, 2023Updated 3 years ago
eth-easl / mixtera
View on GitHub
A lightweight, user-friendly data-plane for LLM training.
☆40Sep 10, 2025Updated 10 months ago
saltudelft / type4py
View on GitHub
Type4Py: Deep Similarity Learning-Based Type Inference for Python
☆67Sep 6, 2023Updated 2 years ago
EleutherAI / semantic-memorization
View on GitHub
☆44Nov 17, 2024Updated last year
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
NUS-IDS / eacl23_soqg
View on GitHub
☆15Mar 4, 2026Updated 4 months ago
zhangir-azerbayev / mathlib-semantic-search
View on GitHub
☆15Apr 12, 2023Updated 3 years ago
thoppe / The-Pile-FreeLaw
View on GitHub
Download, parse, and filter data from Court Listener, part of the FreeLaw projects. Data-ready for The-Pile.
☆16Jun 3, 2023Updated 3 years ago
EleutherAI / openwebtext2
View on GitHub
☆94Jul 16, 2022Updated 4 years ago
euirim / goodwiki
View on GitHub
Package and scripts used to build a dataset of Wikipedia articles in Markdown.
☆20Sep 11, 2023Updated 2 years ago
lfoppiano / material-parsers
View on GitHub
Material parsers and other tools, scripts Initially developed for Grobid Superconductor
☆14Feb 21, 2025Updated last year
zaydzuhri / flame
View on GitHub
Fork of Flame repo for training of some new stuff in development
☆20Jul 15, 2026Updated 2 weeks ago