Extracts plain text, language identification and more metadata from WARC records
☆23Apr 16, 2026Updated 2 weeks ago
Alternatives and similar repositories for warc2text
Users that are interested in warc2text are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.☆160Jun 18, 2024Updated last year
- Tool to fix bitexts and tag near-duplicates for removal☆35Sep 4, 2025Updated 7 months ago
- Terminal tool that converts files encoding to UTF-8☆10Oct 5, 2019Updated 6 years ago
- Library for fast text representation and classification.☆31Jan 9, 2024Updated 2 years ago
- [WWW 2026] 🕸 GlotWeb: Web Indexing for Minority Languages☆17Apr 14, 2026Updated 2 weeks ago
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Game Boy Clock Accuracy Challenge☆13Mar 30, 2023Updated 3 years ago
- Line shuffler for huge text file which does not fit in memory☆13Dec 1, 2022Updated 3 years ago
- Web-site mirroring tool for archive.org☆24Mar 17, 2026Updated last month
- Automagically ignore all notifications related to work when you are on vacations☆21Aug 21, 2020Updated 5 years ago
- Tutorial on running keras model in C++ and python tensorflow☆11Oct 30, 2018Updated 7 years ago
- Train a SmolLM-style llm on fineweb-edu in JAX/Flax with an assortment of optimizers.☆19Jul 24, 2025Updated 9 months ago
- Poor man's simple harvester for arXiv resources☆14Jul 14, 2023Updated 2 years ago
- Heatmap of multiclass confusion matrix☆11Sep 11, 2019Updated 6 years ago
- ACL style for Typst☆21Jan 27, 2026Updated 3 months ago
- Serverless GPU API endpoints on Runpod - Get Bonus Credits • AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- A polite and user-friendly downloader for Common Crawl data☆79Apr 24, 2026Updated last week
- 🎮Backup, restore save, dump ROM, and program Flashcarts via GBA link port☆21Sep 17, 2025Updated 7 months ago
- Logic Pro X MIDI FX Plugin for polychaining multiple instruments☆17Feb 20, 2018Updated 8 years ago
- This is a modified version of the AM29F016 or AM29F032 flash memory adapter board to easily DIY a Game Boy flash cartridge from J.Rodrigo…☆12Jun 20, 2022Updated 3 years ago
- A browser extension providing Open Access bibliographical services☆18Dec 9, 2022Updated 3 years ago
- MIDict (Multi-Index Dict) can be indexed by any "keys" or "values", suitable as a bidirectional/inverse dict or a multi-key/multi-value d…☆14May 19, 2016Updated 9 years ago
- Tool for manual evaluation of parallel sentences.☆15Jan 26, 2026Updated 3 months ago
- Precise type-checker for JavaScript☆11Oct 23, 2025Updated 6 months ago
- Myanmar and Thai Language Resources☆10Jul 18, 2022Updated 3 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Reverse-engineered schematics for the IR3R53 audio amplifier chip used in the Game Boy Color☆14May 30, 2020Updated 5 years ago
- Lossless normalization of uppercase characters☆11Jul 3, 2023Updated 2 years ago
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.☆64Jul 29, 2024Updated last year
- ☆11Jul 23, 2023Updated 2 years ago
- ☆18Apr 6, 2021Updated 5 years ago
- A friendly companion that helps you edit and improve digital score encodings in MEI format. Extension to the Atom editor.☆11Jul 6, 2022Updated 3 years ago
- A social media open post web archiving tool☆26Feb 4, 2026Updated 2 months ago
- ChatGPT solutions for the MLE interview☆14Dec 9, 2022Updated 3 years ago
- ParaNames: A multilingual resource for parallel names☆40May 20, 2024Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- The definitive collection of is* functions for runtime type checking. Lodash-compatible, tree-shakable, with types.☆17Jan 25, 2025Updated last year
- ☆13Jun 26, 2020Updated 5 years ago
- An evaluation suite for Retrieval-Augmented Generation (RAG).☆23Apr 26, 2025Updated last year
- An abstract, safe, and concise color conversion library for rust nightly This requires the feature adt_const_params☆12Nov 18, 2022Updated 3 years ago
- [ICCV 2023] Going Beyond Nouns With Vision & Language Models Using Synthetic Data☆13Sep 30, 2023Updated 2 years ago
- Rababa, the diacritization library for Arabic and Hebrew (Abjad scripts in general)☆12May 1, 2025Updated last year
- A python module for evaluating NERC and NEL system performances as defined in the HIPE shared tasks (formerly CLEF-HIPE-2020-scorer).☆15Jun 4, 2024Updated last year