bitextor/warc2text

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/bitextor/warc2text)

bitextor / warc2text

Extracts plain text, language identification and more metadata from WARC records

☆23

Alternatives and similar repositories for warc2text

Users that are interested in warc2text are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

bitextor / bicleaner
View on GitHub
Bicleaner is a parallel corpus classifier/cleaner that aims at detecting noisy sentence pairs in a parallel corpus.
☆160Jun 18, 2024Updated 2 years ago
webrecorder / pywb-remote-browsers
View on GitHub
Docker Compose based system for running remote browsers (including Flash and Java support) connected to web archives
☆16Jun 10, 2021Updated 5 years ago
oduwsdl / ORS
View on GitHub
Object Resource Stream and CDXJ Drafts
☆15Nov 28, 2018Updated 7 years ago
bitextor / bifixer
View on GitHub
Tool to fix bitexts and tag near-duplicates for removal
☆35Sep 4, 2025Updated 10 months ago
amasad / arabish
View on GitHub
Arabic Transliteration in Python
☆36Aug 19, 2013Updated 12 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
mbanon / fastspell
View on GitHub
Targetted language identifier, based on FastText and Hunspell.
☆38Sep 4, 2025Updated 10 months ago
cisnlp / GlotWeb
View on GitHub
[WWW 2026] 🕸 GlotWeb: Web Indexing for Minority Languages
☆17Apr 14, 2026Updated 3 months ago
gitcordier / bellingcat
View on GitHub
Code from Bellingcat's guide
☆11Dec 8, 2022Updated 3 years ago
ctylim / rhuffle
View on GitHub
Line shuffler for huge text file which does not fit in memory
☆13Dec 1, 2022Updated 3 years ago
caarlos0-graveyard / github-vacations
View on GitHub
Automagically ignore all notifications related to work when you are on vacations
☆21Aug 21, 2020Updated 5 years ago
facebookresearch / lss_eval
View on GitHub
This is a new metric that can be used to evaluate faithfulness of text generated by LLMs. The work behind this repository can be found he…
☆31Aug 25, 2023Updated 2 years ago
skaringa / weather-sdr-decode
View on GitHub
Decoders for weather sensor data from RTL SDR.
☆18Apr 27, 2025Updated last year
jg23497 / Header-Inject
View on GitHub
Open-source Chrome extension for injecting and overriding HTTP request headers
☆15Jul 4, 2024Updated 2 years ago
kermitt2 / arxiv_harvester
View on GitHub
Poor man's simple harvester for arXiv resources
☆14Jul 14, 2023Updated 3 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
UB-Mannheim / Tesseract_Dokumentation
View on GitHub
This repository provides German documentation relating to the text recognition software Tesseract. The documentation was created in the c…
☆16Sep 6, 2022Updated 3 years ago
stefan-it / gc4lm
View on GitHub
GC4LM: A Colossal (Biased) language model for German
☆13May 2, 2021Updated 5 years ago
WebarchivCZ / Seeder
View on GitHub
Seeder - Czech webarchive curating tool and public site
☆17Feb 12, 2026Updated 5 months ago
evanatyourservice / llm-jax
View on GitHub
Train a SmolLM-style llm on fineweb-edu in JAX/Flax with an assortment of optimizers.
☆19Jul 24, 2025Updated 11 months ago
jkmackie / confusion_matrix_visualization
View on GitHub
Heatmap of multiclass confusion matrix
☆11Sep 11, 2019Updated 6 years ago
thunderpoot / scdx
View on GitHub
A simple tool for querying the Common Crawl CDX
☆16Jan 10, 2026Updated 6 months ago
kermitt2 / biblio-glutton-extension
View on GitHub
A browser extension providing Open Access bibliographical services
☆18Dec 9, 2022Updated 3 years ago
Harry-Chan / seq2seqlm-on-qg
View on GitHub
☆13Feb 9, 2022Updated 4 years ago
ShenggaoZhu / midict
View on GitHub
MIDict (Multi-Index Dict) can be indexed by any "keys" or "values", suitable as a bidirectional/inverse dict or a multi-key/multi-value d…
☆14May 19, 2016Updated 10 years ago
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
iPieter / llmq
View on GitHub
A Scheduler for Batched LLM Inference
☆19Oct 5, 2025Updated 9 months ago
hipster-philology / nlp-pie-taggers
View on GitHub
Extension for pie to include taggers with their models and pre/postprocessors
☆11Jun 23, 2026Updated 3 weeks ago
paracrawl / keops
View on GitHub
Tool for manual evaluation of parallel sentences.
☆15Jan 26, 2026Updated 5 months ago
alvations / myth
View on GitHub
Myanmar and Thai Language Resources
☆10Jul 18, 2022Updated 4 years ago
tatHi / optok
View on GitHub
☆10Aug 26, 2021Updated 4 years ago
alasdairforsythe / capcode
View on GitHub
Lossless normalization of uppercase characters: Go, C++ & JavaScript
☆11Jul 7, 2026Updated 2 weeks ago
npk48 / rwkv_cuda
View on GitHub
☆11Jul 23, 2023Updated 2 years ago
malteos / llm-datasets
View on GitHub
A collection of datasets for language model pretraining including scripts for downloading, preprocesssing, and sampling.
☆66Jul 29, 2024Updated last year
kermitt2 / xpdf-4.00
View on GitHub
☆19Apr 6, 2021Updated 5 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
sileod / tasksource
View on GitHub
Datasets collection and preprocessings framework for NLP extreme multitask learning
☆195Jul 9, 2025Updated last year
cisnlp / multypo
View on GitHub
A Multilingual Keyboard Layout-Based Typo Generator
☆17Nov 23, 2025Updated 7 months ago
lampts / chatgpt-mle-interview
View on GitHub
ChatGPT solutions for the MLE interview
☆14Dec 9, 2022Updated 3 years ago
peterk / munin-indexer
View on GitHub
A social media open post web archiving tool
☆26Feb 4, 2026Updated 5 months ago
elastic / elasticsearch-transport-wares
View on GitHub
Servlet transport for Elasticsearch
☆41Aug 8, 2024Updated last year
dvrensk / single_instance
View on GitHub
Ruby Gem that makes sure that only a single instance of a code block is running.
☆16Mar 13, 2013Updated 13 years ago
fajri91 / minangNLP
View on GitHub
Minangkabau NLP corpus. PACLIC 2020
☆11Jun 7, 2021Updated 5 years ago