thoppe/The-Pile-PhilPapers

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/thoppe/The-Pile-PhilPapers)

thoppe / The-Pile-PhilPapers

Download, parse, and filter data from Phil Papers. Data-ready for The-Pile.

☆20

Alternatives and similar repositories for The-Pile-PhilPapers

Users that are interested in The-Pile-PhilPapers are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

sdtblck / youtube_subtitle_dataset
View on GitHub
YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training
☆47Sep 22, 2020Updated 5 years ago
harsh19 / Structured-Adversary
View on GitHub
"Learning Rhyming Constraints using Structured Adversaries. Jhamtani H., Mehta S., Carbonell J., Berg-Kirkpatrick T. EMNLP-IJCNLP (Short …
☆11Mar 17, 2020Updated 6 years ago
UCSB-AI / Discffusion
View on GitHub
Official repo for the TMLR paper "Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners"
☆29Apr 27, 2024Updated 2 years ago
matthewjdenny / REmail
View on GitHub
R package for Email Data Processing
☆15Mar 1, 2018Updated 8 years ago
xcratch / xcratch.github.io
View on GitHub
Extendable Scratch3 Programming Environment
☆10Jul 6, 2026Updated 3 weeks ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
marian-nmt / amun
View on GitHub
Fast stand-alone C++ decoder for RNN-based NMT models
☆31Dec 12, 2020Updated 5 years ago
TiagoVentura / workshop_big_data_conference
View on GitHub
Workshop "Analyzing Social Media Data" at the Big Data and Development Conference
☆11Sep 11, 2023Updated 2 years ago
IQSS / cem
View on GitHub
☆17Oct 8, 2022Updated 3 years ago
DAMO-NLP-SG / LLM-Multilingual-Knowledge-Boundaries
View on GitHub
[ACL 2025] Analyzing LLMs' Multilingual Knowledge Boundary Cognition Across Languages Through the Lens of Internal Representations
☆19Oct 18, 2025Updated 9 months ago
whyr2021turkey / Konusmalar
View on GitHub
Why R? 2021 Turkey konferansında sunulan çalışmaların özet, sunum ve video kayıtlarını içerir.
☆11Apr 26, 2021Updated 5 years ago
featherless-ai / featherless-cookbook
View on GitHub
A collection of guides, notebooks and examples using the Featherless API
☆15Mar 17, 2026Updated 4 months ago
mithrendal / boostanista
View on GitHub
alternative remote for Lego Boost with Pythonista and iOS
☆10Aug 27, 2017Updated 8 years ago
sfeucht / footprints
View on GitHub
https://footprints.baulab.info
☆17Oct 4, 2024Updated last year
liushulinle / MarsRL
View on GitHub
MarsRL: Advancing Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism
☆18Nov 18, 2025Updated 8 months ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
blinkenrocket / hardware
View on GitHub
Schematics, board layout and BOM
☆12May 1, 2019Updated 7 years ago
ETCBC / Tutorials
View on GitHub
☆11Apr 6, 2021Updated 5 years ago
blinkenrocket / firmware
View on GitHub
firmware for blinkenrocket
☆18Sep 14, 2024Updated last year
suzgunmirac / belief-in-the-machine
View on GitHub
Belief in the Machine: Investigating Epistemological Blind Spots of Language Models
☆35Apr 19, 2025Updated last year
tnhaider / DLK
View on GitHub
Deutsches Lyrik Korpus (DLK) / German Poetry Corpus
☆20May 21, 2024Updated 2 years ago
bjoernpl / lm-evaluation-harness-de
View on GitHub
A framework for few-shot evaluation of autoregressive language models.
☆13Feb 14, 2024Updated 2 years ago
lemurproject / ClueWeb22
View on GitHub
☆17Dec 11, 2024Updated last year
harbor-framework / harbor-index
View on GitHub
A compact high-signal benchmark for evaluating frontier agents
☆21Updated this week
JorgePe / ev3-mqtt-micropython
View on GitHub
Using LEGO EV3 MicroPyhton with MQTT
☆12Apr 29, 2019Updated 7 years ago
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
unisonweb / unison-local-ui
View on GitHub
The Codebase UI that ships with UCM
☆21May 20, 2026Updated 2 months ago
EleutherAI / pile-pubmedcentral
View on GitHub
A script for collecting the PubMed Central dataset in a language modelling friendly format.
☆26Feb 16, 2021Updated 5 years ago
princetonvisualai / pointingqa
View on GitHub
Code for paper "Point and Ask: Incorporating Pointing into Visual Question Answering"
☆19Oct 4, 2022Updated 3 years ago
lancedb / ragged
View on GitHub
☆22Oct 14, 2024Updated last year
SapienzaNLP / mcl-wic
View on GitHub
Semeval-2021 Multilingual and Cross-lingual Word-in-Context Task
☆18May 27, 2021Updated 5 years ago
goodfire-ai / sae-manifold
View on GitHub
code for 'Do Sparse Autoencoders Capture Concept Manifolds?'
☆20May 21, 2026Updated 2 months ago
simonw / pge-outages
View on GitHub
Tracking PG&E power outages
☆24Updated this week
mkremins / praxish
View on GitHub
Partial reconstruction of Versu's Praxis language
☆19Jun 30, 2026Updated 3 weeks ago
oxocard / offline-editor
View on GitHub
Offline version of the NanoPy Editor. No server needed.
☆14Mar 7, 2025Updated last year
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
mmosleh / minfo-exposure
View on GitHub
☆25Dec 19, 2022Updated 3 years ago
cisnlp / MEXA
View on GitHub
[ACL 2025] 🔍 Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment
☆11Apr 6, 2025Updated last year
ev3dev / lms-hacker-tools
View on GitHub
Tools for reverse engineering LEGO MINDSTORMS and related products.
☆16Oct 11, 2019Updated 6 years ago
sail-sg / regmix
View on GitHub
[ICLR 2025] 🧬 RegMix: Data Mixture as Regression for Language Model Pre-training (Spotlight)
☆195Feb 17, 2025Updated last year
laihuiyuan / multilingual-tst
View on GitHub
Multilingual Pre-training with Language and Task Adaptation for Multilingual Text Style Transfer (ACL 2022)
☆10Sep 22, 2022Updated 3 years ago
Liebeck / IWNLP-py
View on GitHub
Python port for IWNLP.Lemmatizer
☆19Apr 13, 2026Updated 3 months ago
DerekKane / YouTube-Tutorials
View on GitHub
This is my repository for all of my R code as described in the YouTube lectures
☆25Jun 24, 2024Updated 2 years ago