commoncrawl / cc-downloaderLinks
A polite and user-friendly downloader for Common Crawl data
☆51Updated 3 weeks ago
Alternatives and similar repositories for cc-downloader
Users that are interested in cc-downloader are comparing it to the libraries listed below
Sorting:
- Statistics of Common Crawl monthly archives mined from URL index files☆184Updated this week
- Tools to construct and process Common Crawl webgraphs☆92Updated this week
- Code for collecting, processing, and preparing datasets for the Common Pile☆213Updated last week
- Next-generation Punkt sentence boundary detection with zero dependencies☆17Updated 3 months ago
- Benchmark scripts for comparing different tokenizers and sentence segmenters of German☆12Updated 2 years ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆179Updated 7 months ago
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆47Updated this week
- Common crawl extractor☆78Updated last year
- Libraries, Archives and Museums (LAM)☆84Updated 2 years ago
- Streaming WARC/ARC library for fast web archive IO☆425Updated 7 months ago
- An index of PDF-centric corpora☆134Updated last month
- A set of utilities for processing MediaWiki XML dump data.☆57Updated 5 months ago
- Library for fast text representation and classification.☆30Updated last year
- Fast and robust date extraction from web pages, with Python or on the command-line☆136Updated this week
- ☆67Updated last year
- 🔢 Work with static vector models☆28Updated 3 months ago
- A robust web archive analytics toolkit☆112Updated 4 months ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆142Updated 7 months ago
- A tool for detecting viruses and NSFW material in WARC files☆15Updated 11 months ago
- Extracts plain text, language identification and more metadata from WARC records☆23Updated last week
- Efficiently computing & storing token n-grams from large corpora☆26Updated 9 months ago
- WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.☆257Updated 5 months ago
- Faster, modernized fork of the language identification tool langid.py☆56Updated 8 months ago
- Contextualized per-token embeddings☆27Updated 2 months ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆46Updated 7 years ago
- search interface for scholarly works☆86Updated last year
- A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-test…☆69Updated last month
- Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in t…☆127Updated last week
- A simple tool for splitting up an ebook into its chapters. Works well with Project Gutenberg texts. May also be used to clean up books fo…☆109Updated 6 years ago