commoncrawl / cc-downloader
A polite and user-friendly downloader for Common Crawl data
☆41Updated this week
Alternatives and similar repositories for cc-downloader:
Users that are interested in cc-downloader are comparing it to the libraries listed below
- A tool for detecting viruses and NSFW material in WARC files☆14Updated 8 months ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆39Updated last week
- Library for fast text representation and classification.☆28Updated last year
- Libraries, Archives and Museums (LAM)☆82Updated 2 years ago
- Command line tool for digging into WARC files☆39Updated 3 weeks ago
- search interface for scholarly works☆85Updated 8 months ago
- ☆67Updated last year
- Web application for distributed compute analysis of Archive-It web archive collections.☆18Updated last month
- Extracts plain text, language identification and more metadata from WARC records☆21Updated last month
- Small python package to measure OCR quality and other related metrics.☆21Updated last year
- Repo to hold code and track issues for the collection of permissively licensed data☆23Updated 2 weeks ago
- an experimental implementation of Burrow's delta in Python 3☆21Updated 3 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- A tool for collection archival slivers of the web and web archives☆13Updated 2 months ago
- Statistics of Common Crawl monthly archives mined from URL index files☆177Updated 2 weeks ago
- PhD Dissertation "Automated Extraction and Curation of Materials Information from Scientific Literature"☆9Updated last year
- Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archive…☆26Updated 2 years ago
- Code for SaGe subword tokenizer (EACL 2023)☆24Updated 4 months ago
- Tool for showing Freebase and Google Knowledge Graph entries☆19Updated last year
- BPE modification that implements removing of the intermediate tokens during tokenizer training.☆25Updated 5 months ago
- WARC and ARC indexing and discovery tools.☆123Updated last month
- An experiment replicating part of "Why Literary Time is Measured in Minutes" with GPT-4.☆32Updated 2 years ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆169Updated 3 months ago
- Command Line Interface for running 🤗 Transformers Image Classification locally☆19Updated 2 weeks ago
- Faster, modernized fork of the language identification tool langid.py☆55Updated 5 months ago
- Experimental proxy and wrapper for safely embedding Web Archives (warc, warc.gz, wacz) into web pages.☆31Updated last month
- Sort-friendly URI Reordering Transform (SURT) python module☆42Updated 8 months ago
- Tools to construct and process Common Crawl webgraphs☆90Updated 3 weeks ago
- Centralised repository for WARC usage specifications.☆110Updated 5 months ago
- CDXJ Indexing of WARC/ARCs☆25Updated 4 months ago