Extracts plain text, language identification and more metadata from WARC records
☆23Apr 16, 2026Updated 2 months ago
Alternatives and similar repositories for warc2text
Users that are interested in warc2text are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- ☆20Apr 26, 2026Updated 2 months ago
- DELPH-IN Documentation☆33Jun 24, 2026Updated last week
- Docker Compose based system for running remote browsers (including Flash and Java support) connected to web archives☆16Jun 10, 2021Updated 5 years ago
- Hosts text-to-speech corpus and speech synthesizers for African languages.☆18May 31, 2023Updated 3 years ago
- Object Resource Stream and CDXJ Drafts☆15Nov 28, 2018Updated 7 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Arabic Transliteration in Python☆36Aug 19, 2013Updated 12 years ago
- Tool to fix bitexts and tag near-duplicates for removal☆35Sep 4, 2025Updated 9 months ago
- Terminal tool that converts files encoding to UTF-8☆10Oct 5, 2019Updated 6 years ago
- Library for fast text representation and classification.☆31Jan 9, 2024Updated 2 years ago
- Targetted language identifier, based on FastText and Hunspell.☆38Sep 4, 2025Updated 9 months ago
- A pretty simple HTTP client as handy as `curl` command☆44Nov 12, 2016Updated 9 years ago
- Code from Bellingcat's guide☆11Dec 8, 2022Updated 3 years ago
- Tools for content datamining and NLP at scale☆45Jun 20, 2024Updated 2 years ago
- JSON with biographical and political data of Austrian Members of Parliament (Nationalrat/first Chamber) since 1920.☆13Dec 23, 2021Updated 4 years ago
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- Game Boy Clock Accuracy Challenge☆13Mar 30, 2023Updated 3 years ago
- Support for writing WARC files with Scrapy☆24Dec 21, 2019Updated 6 years ago
- This is a new metric that can be used to evaluate faithfulness of text generated by LLMs. The work behind this repository can be found he…☆31Aug 25, 2023Updated 2 years ago
- Decoders for weather sensor data from RTL SDR.☆18Apr 27, 2025Updated last year
- Google Coral containers☆12Apr 28, 2022Updated 4 years ago
- Turn your Game Boy Advance into a Bluetooth Gamepad.☆18May 2, 2026Updated last month
- Automagically ignore all notifications related to work when you are on vacations☆21Aug 21, 2020Updated 5 years ago
- This repository provides German documentation relating to the text recognition software Tesseract. The documentation was created in the c…☆15Sep 6, 2022Updated 3 years ago
- Tutorial on running keras model in C++ and python tensorflow☆11Oct 30, 2018Updated 7 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- Create and edit WARC and WACZ files☆29Dec 6, 2024Updated last year
- Train a SmolLM-style llm on fineweb-edu in JAX/Flax with an assortment of optimizers.☆19Jul 24, 2025Updated 11 months ago
- Hidden Engrams: Long Term Memory for Transformer Model Inference☆35Jun 26, 2021Updated 5 years ago
- Poor man's simple harvester for arXiv resources☆14Jul 14, 2023Updated 2 years ago
- GC4LM: A Colossal (Biased) language model for German☆13May 2, 2021Updated 5 years ago
- ACL style for Typst☆23Jan 27, 2026Updated 5 months ago
- Heatmap of multiclass confusion matrix☆11Sep 11, 2019Updated 6 years ago
- Analyze standard numbers like ARK, DOI, EAN, GTIN, IBAN, ISAN, ISBN, ISMN, ISNI, ISSN, ISTC, ISWC, ORCID, PPN, SICI, UPC, ZDB with Elasti…☆24Jul 5, 2016Updated 9 years ago
- A polite and user-friendly downloader for Common Crawl data☆84Jun 16, 2026Updated 2 weeks ago
- Virtual machines for every use case on DigitalOcean • AdGet dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
- A tool for collection archival slivers of the web and web archives☆19Jun 1, 2026Updated 3 weeks ago
- Command line tool to convert page layout files to the latest PAGE XML format. It supports all previous versions of the PAGE format as wel…☆24Jan 30, 2021Updated 5 years ago
- A Java implementation of error correcting codes similar to Reed Solomon codes for my "Algorithms in the Real World" class☆17Mar 1, 2010Updated 16 years ago
- This is a modified version of the AM29F016 or AM29F032 flash memory adapter board to easily DIY a Game Boy flash cartridge from J.Rodrigo…☆12Jun 20, 2022Updated 4 years ago
- A browser extension providing Open Access bibliographical services☆18Dec 9, 2022Updated 3 years ago
- MIDict (Multi-Index Dict) can be indexed by any "keys" or "values", suitable as a bidirectional/inverse dict or a multi-key/multi-value d…☆14May 19, 2016Updated 10 years ago
- ☆11Aug 26, 2021Updated 4 years ago