commoncrawl / cc-downloaderLinks
A polite and user-friendly downloader for Common Crawl data
☆47Updated last month
Alternatives and similar repositories for cc-downloader
Users that are interested in cc-downloader are comparing it to the libraries listed below
Sorting:
- A tool for detecting viruses and NSFW material in WARC files☆15Updated 9 months ago
- Next-generation Punkt sentence boundary detection with zero dependencies☆17Updated 2 months ago
- Command line tool for digging into WARC files☆40Updated last week
- Command-line tool and Rust library for handling Web ARChive (WARC) files☆19Updated last week
- Small python package to measure OCR quality and other related metrics.☆22Updated last year
- WARC and ARC indexing and discovery tools.☆124Updated 2 months ago
- Streaming WARC/ARC library for fast web archive IO☆415Updated 5 months ago
- Web application for distributed compute analysis of Archive-It web archive collections.☆18Updated 2 months ago
- search interface for scholarly works☆85Updated 10 months ago
- Centralised repository for WARC usage specifications.☆111Updated 6 months ago
- Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)☆161Updated last week
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆174Updated 5 months ago
- Tools to construct and process Common Crawl webgraphs☆91Updated last week
- Automated behaviors that run in browser to interact with complex sites automatically. Used by ArchiveWeb.page and Browsertrix Crawler.☆40Updated last month
- Command line tool to convert a file in the WARC format to a file in the ZIM format☆58Updated 2 months ago
- An index of PDF-centric corpora☆128Updated 2 months ago
- Library for fast text representation and classification.☆30Updated last year
- Sort-friendly URI Reordering Transform (SURT) python module☆42Updated 10 months ago
- Libraries, Archives and Museums (LAM)☆84Updated 2 years ago
- Benchmark scripts for comparing different tokenizers and sentence segmenters of German☆11Updated 2 years ago
- Index Common Crawl archives in tabular format☆122Updated 3 weeks ago
- Code for collecting, processing, and preparing datasets for the Common Pile☆27Updated this week
- A tool for collection archival slivers of the web and web archives☆13Updated 3 months ago
- A helper library full of URL-related heuristics.☆69Updated this week
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆42Updated last month
- Terminal tool that converts files encoding to UTF-8☆10Updated 5 years ago
- ☆67Updated last year
- A set of utilities for processing MediaWiki XML dump data.☆53Updated 3 months ago
- A Memento Aggregator CLI and Server in Go☆65Updated 3 months ago
- CDXJ Indexing of WARC/ARCs☆25Updated 5 months ago