Extracts plain text, language identification and more metadata from WARC records
☆23Apr 16, 2026Updated last month
Alternatives and similar repositories for warc2text
Users that are interested in warc2text are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆20Apr 26, 2026Updated last month
- Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.☆160Jun 18, 2024Updated last year
- English Resource Grammar☆29May 22, 2026Updated 2 weeks ago
- Object Resource Stream and CDXJ Drafts☆15Nov 28, 2018Updated 7 years ago
- Tool to fix bitexts and tag near-duplicates for removal☆35Sep 4, 2025Updated 9 months ago
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- Library for fast text representation and classification.☆31Jan 9, 2024Updated 2 years ago
- Line shuffler for huge text file which does not fit in memory☆13Dec 1, 2022Updated 3 years ago
- This is a new metric that can be used to evaluate faithfulness of text generated by LLMs. The work behind this repository can be found he…☆31Aug 25, 2023Updated 2 years ago
- Web-site mirroring tool for archive.org☆26Mar 17, 2026Updated 2 months ago
- Google Coral containers☆12Apr 28, 2022Updated 4 years ago
- Automagically ignore all notifications related to work when you are on vacations☆21Aug 21, 2020Updated 5 years ago
- Tutorial on running keras model in C++ and python tensorflow☆11Oct 30, 2018Updated 7 years ago
- Hidden Engrams: Long Term Memory for Transformer Model Inference☆35Jun 26, 2021Updated 4 years ago
- Poor man's simple harvester for arXiv resources☆14Jul 14, 2023Updated 2 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- GC4LM: A Colossal (Biased) language model for German☆13May 2, 2021Updated 5 years ago
- Heatmap of multiclass confusion matrix☆11Sep 11, 2019Updated 6 years ago
- ☆14Feb 9, 2022Updated 4 years ago
- Portable Unicode library for Common Lisp☆66Nov 18, 2023Updated 2 years ago
- Analyze standard numbers like ARK, DOI, EAN, GTIN, IBAN, ISAN, ISBN, ISMN, ISNI, ISSN, ISTC, ISWC, ORCID, PPN, SICI, UPC, ZDB with Elasti…☆24Jul 5, 2016Updated 9 years ago
- A polite and user-friendly downloader for Common Crawl data☆82May 4, 2026Updated last month
- A tool for collection archival slivers of the web and web archives☆19Jun 1, 2026Updated last week
- Open-source Chrome extension for injecting and overriding HTTP request headers☆15Jul 4, 2024Updated last year
- This is a modified version of the AM29F016 or AM29F032 flash memory adapter board to easily DIY a Game Boy flash cartridge from J.Rodrigo…☆12Jun 20, 2022Updated 3 years ago
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- A browser extension providing Open Access bibliographical services☆18Dec 9, 2022Updated 3 years ago
- Software to Manipulate Different Flavors of Semantic Graphs☆53Mar 5, 2026Updated 3 months ago
- MIDict (Multi-Index Dict) can be indexed by any "keys" or "values", suitable as a bidirectional/inverse dict or a multi-key/multi-value d…☆14May 19, 2016Updated 10 years ago
- ☆11Aug 26, 2021Updated 4 years ago
- Precise type-checker for JavaScript☆11Oct 23, 2025Updated 7 months ago
- Myanmar and Thai Language Resources☆10Jul 18, 2022Updated 3 years ago
- Reverse-engineered schematics for the IR3R53 audio amplifier chip used in the Game Boy Color☆14May 30, 2020Updated 6 years ago
- Lossless normalization of uppercase characters: Go, C++ & JavaScript☆11Updated this week
- A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.☆65Jul 29, 2024Updated last year
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- ☆11Jul 23, 2023Updated 2 years ago
- This repository provides a starter code for using tensorboard via tensorflow for visualising embeddings☆14Apr 4, 2018Updated 8 years ago
- A friendly companion that helps you edit and improve digital score encodings in MEI format. Extension to the Atom editor.☆11Jul 6, 2022Updated 3 years ago
- ParaNames: A multilingual resource for parallel names☆40May 20, 2024Updated 2 years ago
- The definitive collection of is* functions for runtime type checking. Lodash-compatible, tree-shakable, with types.☆17Jan 25, 2025Updated last year
- ☆13Jun 26, 2020Updated 5 years ago
- Public domain songs for the seasons☆43Dec 20, 2014Updated 11 years ago