Extracts plain text, language identification and more metadata from WARC records
☆23Oct 1, 2025Updated 6 months ago
Alternatives and similar repositories for warc2text
Users that are interested in warc2text are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Neural Semantic Graph Parser☆29Mar 14, 2018Updated 8 years ago
- Docker Compose based system for running remote browsers (including Flash and Java support) connected to web archives☆16Jun 10, 2021Updated 4 years ago
- Hosts text-to-speech corpus and speech synthesizers for African languages.☆18May 31, 2023Updated 2 years ago
- Object Resource Stream and CDXJ Drafts☆15Nov 28, 2018Updated 7 years ago
- Terminal tool that converts files encoding to UTF-8☆10Oct 5, 2019Updated 6 years ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Library for fast text representation and classification.☆31Jan 9, 2024Updated 2 years ago
- ☆11Nov 21, 2025Updated 4 months ago
- A pretty simple HTTP client as handy as `curl` command☆43Nov 12, 2016Updated 9 years ago
- Sources is a web application that allows your team to store, manage and annotate your sources and to make them easily available to your r…☆13Jun 2, 2022Updated 3 years ago
- Code from Bellingcat's guide☆11Dec 8, 2022Updated 3 years ago
- Tools for content datamining and NLP at scale☆45Jun 20, 2024Updated last year
- 🕸 GlotWeb: Web Indexing for Minority Languages (WWW 2026)☆17Feb 27, 2026Updated last month
- This is a new metric that can be used to evaluate faithfulness of text generated by LLMs. The work behind this repository can be found he…☆31Aug 25, 2023Updated 2 years ago
- Decoders for weather sensor data from RTL SDR.☆18Apr 27, 2025Updated 11 months ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- Automagically ignore all notifications related to work when you are on vacations☆21Aug 21, 2020Updated 5 years ago
- This repository provides German documentation relating to the text recognition software Tesseract. The documentation was created in the c…☆14Sep 6, 2022Updated 3 years ago
- Tutorial on running keras model in C++ and python tensorflow☆11Oct 30, 2018Updated 7 years ago
- Create and edit WARC and WACZ files☆25Dec 6, 2024Updated last year
- Train a SmolLM-style llm on fineweb-edu in JAX/Flax with an assortment of optimizers.☆19Jul 24, 2025Updated 8 months ago
- Hidden Engrams: Long Term Memory for Transformer Model Inference☆35Jun 26, 2021Updated 4 years ago
- ☆14Feb 9, 2022Updated 4 years ago
- Open-source Chrome extension for injecting and overriding HTTP request headers☆15Jul 4, 2024Updated last year
- A Java implementation of error correcting codes similar to Reed Solomon codes for my "Algorithms in the Real World" class☆17Mar 1, 2010Updated 16 years ago
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- A browser extension providing Open Access bibliographical services☆18Dec 9, 2022Updated 3 years ago
- ☆11Aug 26, 2021Updated 4 years ago
- Precise type-checker for JavaScript☆11Oct 23, 2025Updated 5 months ago
- Myanmar and Thai Language Resources☆10Jul 18, 2022Updated 3 years ago
- Datasets collection and preprocessings framework for NLP extreme multitask learning☆193Jul 9, 2025Updated 9 months ago
- This repository provides a starter code for using tensorboard via tensorflow for visualising embeddings☆14Apr 4, 2018Updated 8 years ago
- Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https:/…☆29Apr 17, 2024Updated last year
- ParaNames: A multilingual resource for parallel names☆40May 20, 2024Updated last year
- The definitive collection of is* functions for runtime type checking. Lodash-compatible, tree-shakable, with types.☆17Jan 25, 2025Updated last year
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- An evaluation suite for Retrieval-Augmented Generation (RAG).☆23Apr 26, 2025Updated 11 months ago
- Public domain songs for the seasons☆43Dec 20, 2014Updated 11 years ago
- An abstract, safe, and concise color conversion library for rust nightly This requires the feature adt_const_params☆12Nov 18, 2022Updated 3 years ago
- Smatch tool: evaluation of AMR semantic structures☆71May 25, 2022Updated 3 years ago
- Simple script for running interactive masked language model with pre-trained BERT models.☆18May 3, 2020Updated 5 years ago
- API of austrian election results .☆17Nov 14, 2020Updated 5 years ago
- Rababa, the diacritization library for Arabic and Hebrew (Abjad scripts in general)☆13May 1, 2025Updated 11 months ago