ikreymer/cdx-index-client

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/ikreymer/cdx-index-client)

ikreymer / cdx-index-client

A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/

☆203

Alternatives and similar repositories for cdx-index-client

Users that are interested in cdx-index-client are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

commoncrawl / cdx_toolkit
View on GitHub
A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine
☆208Jun 24, 2026Updated last month
ikreymer / webarchive-indexing
View on GitHub
Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.
☆46Dec 4, 2017Updated 8 years ago
commoncrawl / cc-pyspark
View on GitHub
Process Common Crawl data with Python and Spark
☆457Mar 26, 2026Updated 3 months ago
commoncrawl / cc-index-table
View on GitHub
Index Common Crawl archives in tabular format
☆132Updated this week
rossf7 / elasticrawl
View on GitHub
Launch AWS Elastic MapReduce jobs that process Common Crawl data.
☆49Feb 15, 2017Updated 9 years ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
reprozip-news-apps / reprozip-web
View on GitHub
ReproZip for the Preservation of Web Applications
☆17May 6, 2024Updated 2 years ago
webrecorder / warcio
View on GitHub
Streaming WARC/ARC library for fast web archive IO
☆462Jun 10, 2026Updated last month
commoncrawl / cc-warc-examples
View on GitHub
CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
☆38Jun 30, 2026Updated 3 weeks ago
commoncrawl / news-crawl
View on GitHub
News crawling with StormCrawler - stores content as WARC
☆375Updated this week
NationalLibraryOfNorway / warchaeology
View on GitHub
Command line tool for digging into WARC files
☆50Jul 17, 2026Updated last week
cmeister2 / dauntless
View on GitHub
Tools for analysing the forward DNS data set published at https://scans.io/study/sonar.fdns_v2
☆17May 9, 2026Updated 2 months ago
paxan / ccooo
View on GitHub
Common Crawl One-Oh-One (aka "A Common Crawl Experiment")
☆26Oct 31, 2014Updated 11 years ago
helgeho / Web2Warc
View on GitHub
An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)
☆26Oct 9, 2017Updated 8 years ago
internetarchive / surt
View on GitHub
Sort-friendly URI Reordering Transform (SURT) python module
☆45Sep 11, 2025Updated 10 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
iproduct-database / vpm-filter-spark
View on GitHub
Virtual patent marking crawler at iproduct.epfl.ch
☆15Sep 13, 2017Updated 8 years ago
CI-Research / KeywordAnalysis
View on GitHub
Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends
☆57Jan 28, 2024Updated 2 years ago
GLAM-Workbench / web-archives
View on GitHub
☆27May 5, 2023Updated 3 years ago
commoncrawl / cc-downloader
View on GitHub
A polite and user-friendly downloader for Common Crawl data
☆86Jul 13, 2026Updated last week
maxrousseau / pynoter
View on GitHub
Convert powerpoint (pptx) files into raw text org or LaTeX files
☆15Aug 28, 2018Updated 7 years ago
ukwa / webarchive-discovery
View on GitHub
Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in t…
☆133Nov 21, 2025Updated 8 months ago
vinaygoel / ars-workshop
View on GitHub
Archive Research Services Workshop
☆31Sep 29, 2017Updated 8 years ago
webrecorder / oembed.link
View on GitHub
A Cloudflare Worker to render embeds on a single page using oEmbed
☆25Nov 17, 2022Updated 3 years ago
tberg12 / murphy
View on GitHub
Support library for NLP and machine learning.
☆27May 11, 2017Updated 9 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
nla / outbackcdx
View on GitHub
Web archive index server based on RocksDB
☆43Jul 9, 2026Updated 2 weeks ago
nlnwa / gowarcserver
View on GitHub
☆17Mar 31, 2025Updated last year
apache / stormcrawler
View on GitHub
A scalable, mature and versatile web crawler based on Apache Storm
☆986Updated this week
maxdotio / mighty-batch
View on GitHub
Highly concurrent and fast content processing for Mighty Inference Server
☆10Feb 6, 2023Updated 3 years ago
commoncrawl / example-warc-java
View on GitHub
☆50Feb 22, 2017Updated 9 years ago
facebookresearch / cc_net
View on GitHub
Tools to download and cleanup Common Crawl data
☆1,047Apr 25, 2023Updated 3 years ago
cldellow / real-estate-prices-cc
View on GitHub
Source real estate prices from the Common Crawl.
☆27Oct 22, 2018Updated 7 years ago
N0taN3rd / simplechrome
View on GitHub
Webrecorders DevTools Protocol Automation Library
☆18Oct 18, 2022Updated 3 years ago
andychu / webpipe
View on GitHub
Bridge the terminal and browser
☆18Jul 28, 2023Updated 2 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
unt-libraries / py-wasapi-client
View on GitHub
A client for the Archive-It And Webrecorder WASAPI Data Transfer API
☆16Oct 18, 2019Updated 6 years ago
timoschick / one-token-approximation
View on GitHub
This repository contains the code for applying One-Token Approximation to a pretrained language model using subword-level tokenization.
☆12May 7, 2020Updated 6 years ago
iipc / jwarc
View on GitHub
Java library for reading and writing WARC files with a typed API
☆60Jun 27, 2026Updated 3 weeks ago
internetarchive / wayback-diff
View on GitHub
React components to render differences between captures at the Wayback Machine
☆43Jul 6, 2026Updated 2 weeks ago
superkojiman / dc416-exploitdev-intro
View on GitHub
Files for the Defcon Toronto Introduction to 64-bit Linux Exploitation
☆15Feb 23, 2018Updated 8 years ago
iai-group / arXivDigest
View on GitHub
☆26Feb 20, 2026Updated 5 months ago
ElecDeb60To16 / Dataset
View on GitHub
This projects hosts an annotated dataset of 39 transcripts of United States presidential election debates annotated with argument compone…
☆12Jun 3, 2019Updated 7 years ago