Extracts plain text, language identification and more metadata from WARC records
☆23Oct 1, 2025Updated 5 months ago
Alternatives and similar repositories for warc2text
Users that are interested in warc2text are comparing it to the libraries listed below
Sorting:
- ☆19Sep 16, 2025Updated 6 months ago
- Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.☆160Jun 18, 2024Updated last year
- Tool to fix bitexts and tag near-duplicates for removal☆34Sep 4, 2025Updated 6 months ago
- ☆11Nov 21, 2025Updated 4 months ago
- Targetted language identifier, based on FastText and Hunspell.☆38Sep 4, 2025Updated 6 months ago
- Support for writing WARC files with Scrapy☆24Dec 21, 2019Updated 6 years ago
- This is a new metric that can be used to evaluate faithfulness of text generated by LLMs. The work behind this repository can be found he…☆31Aug 25, 2023Updated 2 years ago
- Web-site mirroring tool for archive.org☆24Updated this week
- Google Coral containers☆12Apr 28, 2022Updated 3 years ago
- Automagically ignore all notifications related to work when you are on vacations☆21Aug 21, 2020Updated 5 years ago
- A tool for collection archival slivers of the web and web archives☆17Feb 18, 2025Updated last year
- Tutorial on running keras model in C++ and python tensorflow☆11Oct 30, 2018Updated 7 years ago
- Create and edit WARC and WACZ files☆24Dec 6, 2024Updated last year
- Hidden Engrams: Long Term Memory for Transformer Model Inference☆35Jun 26, 2021Updated 4 years ago
- Poor man's simple harvester for arXiv resources☆13Jul 14, 2023Updated 2 years ago
- GC4LM: A Colossal (Biased) language model for German☆13May 2, 2021Updated 4 years ago
- ACL style for Typst☆22Jan 27, 2026Updated last month
- 🎮Backup, restore save, dump ROM, and program Flashcarts via GBA link port☆21Sep 17, 2025Updated 6 months ago
- Analyze standard numbers like ARK, DOI, EAN, GTIN, IBAN, ISAN, ISBN, ISMN, ISNI, ISSN, ISTC, ISWC, ORCID, PPN, SICI, UPC, ZDB with Elasti…☆24Jul 5, 2016Updated 9 years ago
- Logic Pro X MIDI FX Plugin for polychaining multiple instruments☆17Feb 20, 2018Updated 8 years ago
- This is a modified version of the AM29F016 or AM29F032 flash memory adapter board to easily DIY a Game Boy flash cartridge from J.Rodrigo…☆12Jun 20, 2022Updated 3 years ago
- A Java implementation of error correcting codes similar to Reed Solomon codes for my "Algorithms in the Real World" class☆17Mar 1, 2010Updated 16 years ago
- MIDict (Multi-Index Dict) can be indexed by any "keys" or "values", suitable as a bidirectional/inverse dict or a multi-key/multi-value d…☆14May 19, 2016Updated 9 years ago
- ☆11Aug 26, 2021Updated 4 years ago
- Tool for manual evaluation of parallel sentences.☆15Jan 26, 2026Updated last month
- Precise type-checker for JavaScript☆11Oct 23, 2025Updated 4 months ago
- Reverse-engineered schematics for the IR3R53 audio amplifier chip used in the Game Boy Color☆13May 30, 2020Updated 5 years ago
- ☆11Jul 23, 2023Updated 2 years ago
- Datasets collection and preprocessings framework for NLP extreme multitask learning☆193Jul 9, 2025Updated 8 months ago
- ☆17Apr 6, 2021Updated 4 years ago
- Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https:/…☆28Apr 17, 2024Updated last year
- ParaNames: A multilingual resource for parallel names☆39May 20, 2024Updated last year
- The definitive collection of is* functions for runtime type checking. Lodash-compatible, tree-shakable, with types.☆17Jan 25, 2025Updated last year
- An evaluation suite for Retrieval-Augmented Generation (RAG).☆23Apr 26, 2025Updated 10 months ago
- Public domain songs for the seasons☆43Dec 20, 2014Updated 11 years ago
- [ICCV 2023] Going Beyond Nouns With Vision & Language Models Using Synthetic Data☆13Sep 30, 2023Updated 2 years ago
- Rababa, the diacritization library for Arabic and Hebrew (Abjad scripts in general)☆13May 1, 2025Updated 10 months ago
- Common tools for data processing☆22Dec 8, 2025Updated 3 months ago
- Java library for reading and writing WARC files with a typed API☆55Feb 26, 2026Updated 3 weeks ago