chatnoir-eu/chatnoir-resiliparse

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/chatnoir-eu/chatnoir-resiliparse)

chatnoir-eu / chatnoir-resiliparse

A robust web archive analytics toolkit

☆144

Alternatives and similar repositories for chatnoir-resiliparse

Users that are interested in chatnoir-resiliparse are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

commoncrawl / ia-web-commons
View on GitHub
Web archiving utility library
☆11Jun 19, 2026Updated last month
keirp / OpenWebMath
View on GitHub
☆173May 2, 2024Updated 2 years ago
adbar / courlan
View on GitHub
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
☆176Updated this week
iipc / jwarc
View on GitHub
Java library for reading and writing WARC files with a typed API
☆60Jun 27, 2026Updated 3 weeks ago
webis-de / lightning-ir
View on GitHub
One-stop shop for running and fine-tuning transformer-based language models for retrieval
☆65Jul 9, 2026Updated last week
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
informagi / GeeseDB
View on GitHub
Graph Engine for Exploration and Search
☆42Jan 26, 2024Updated 2 years ago
iipc / webarchive-commons
View on GitHub
Common web archive utility code.
☆65Jul 3, 2026Updated 2 weeks ago
archivesunleashed / notebooks
View on GitHub
Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archive…
☆26Dec 5, 2022Updated 3 years ago
allenai / bff
View on GitHub
☆39Apr 17, 2024Updated 2 years ago
p-lambda / dsir
View on GitHub
DSIR large-scale data selection framework for language model training
☆275Apr 7, 2024Updated 2 years ago
rjagerman / shoelace
View on GitHub
Neural Learning to Rank using Chainer
☆31Jun 29, 2020Updated 6 years ago
adbar / trafilatura
View on GitHub
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XM…
☆6,318Updated this week
capreolus-ir / diffir
View on GitHub
Tool for comparing two ranked lists (TREC run files)
☆20Nov 9, 2022Updated 3 years ago
andrewyates / profane
View on GitHub
A library for creating complex experimental pipelines
☆12Jul 25, 2022Updated 3 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
allenai / dolma
View on GitHub
Data and tools for generating and inspecting OLMo pre-training data.
☆1,526Nov 5, 2025Updated 8 months ago
huggingface / datatrove
View on GitHub
Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.
☆3,214Updated this week
tbowers / python-podcastindex-org-example
View on GitHub
Python example of how to engage with the https://podcastindex.org/ APIs
☆13Sep 12, 2020Updated 5 years ago
kakao / kanana-2
View on GitHub
☆23Jun 30, 2026Updated 3 weeks ago
commoncrawl / cc-pyspark
View on GitHub
Process Common Crawl data with Python and Spark
☆457Mar 26, 2026Updated 3 months ago
oscar-project / ungoliant
View on GitHub
The pipeline for the OSCAR corpus
☆178Nov 9, 2025Updated 8 months ago
UAlbanyArchives / describingWebArchives
View on GitHub
Automating description for Web Archives in ArchivesSpace using the Archive-It CDX and Partner Data APIs
☆11Aug 10, 2018Updated 7 years ago
eugene-yang / tarexp
View on GitHub
An opensource TAR framework for experiments and applications
☆18Mar 18, 2024Updated 2 years ago
osirrc / jig
View on GitHub
Jig for the Open-Source IR Replicability Challenge (OSIRRC)
☆13Dec 8, 2022Updated 3 years ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
leogao2 / commoncrawl_downloader
View on GitHub
☆33May 23, 2023Updated 3 years ago
archivesunleashed / auk
View on GitHub
Rails application for the Archives Unleashed Cloud.
☆11Jun 30, 2021Updated 5 years ago
edsu / memento-cli
View on GitHub
A command line utility for listing and searching snapshots in web archives
☆20Jun 4, 2026Updated last month
terrierteam / pyterrier_t5
View on GitHub
☆17Apr 30, 2026Updated 2 months ago
osirrc / ciff
View on GitHub
Common Index File Format to to support interoperability between open-source IR engines
☆40Sep 19, 2024Updated last year
webis-de / ir_axioms
View on GitHub
↕️ Intuitive axiomatic retrieval experimentation.
☆31Jun 15, 2026Updated last month
facebookresearch / CCQA
View on GitHub
CCQA A New Web-Scale Question Answering Dataset for Model Pre-Training
☆33Jul 20, 2022Updated 4 years ago
bigscience-workshop / data-preparation
View on GitHub
Code used for sourcing and cleaning the BigScience ROOTS corpus
☆318Mar 20, 2023Updated 3 years ago
datatogether / warc
View on GitHub
Golang WARC (Web ARChive) Library
☆30Aug 6, 2019Updated 6 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
miso-belica / jusText
View on GitHub
Heuristic based boilerplate removal tool
☆818Feb 25, 2025Updated last year
shjwudp / c4-dataset-script
View on GitHub
Inspired by google c4, here is a series of colossal clean data cleaning scripts focused on CommonCrawl data processing. Including Chinese…
☆136Jun 7, 2023Updated 3 years ago
maturban / WARCMerge
View on GitHub
Merging WARCs into a single WARC file
☆15Aug 29, 2014Updated 11 years ago
commoncrawl / cdx_toolkit
View on GitHub
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
☆208Jun 24, 2026Updated 3 weeks ago
HarrieO / 2021-SIGIR-plackett-luce
View on GitHub
☆32Jul 4, 2022Updated 4 years ago
hscells / pybool_ir
View on GitHub
Toolkit for domain-specific information retrieval experimentation
☆19May 18, 2026Updated 2 months ago
huggingface / cosmopedia
View on GitHub
☆572Nov 20, 2024Updated last year