commoncrawl / cc-downloader
A polite and user-friendly downloader for Common Crawl data
☆36Updated last week
Alternatives and similar repositories for cc-downloader:
Users that are interested in cc-downloader are comparing it to the libraries listed below
- A tool for detecting viruses and NSFW material in WARC files☆11Updated 7 months ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆38Updated this week
- Generate a SQLite database from Wikipedia & Wikidata dumps.☆33Updated last year
- Libraries, Archives and Museums (LAM)☆82Updated 2 years ago
- Small python package to measure OCR quality and other related metrics.☆21Updated last year
- Tools to construct and process webgraphs from Common Crawl data☆87Updated this week
- Repo to hold code and track issues for the collection of permissively licensed data☆23Updated this week
- Library for fast text representation and classification.☆28Updated last year
- Extracts plain text, language identification and more metadata from WARC records☆21Updated 3 weeks ago
- Loadable spellfix1 extension for sqlite as python package☆26Updated 11 months ago
- ☆12Updated 3 months ago
- 🌸 Train floret vectors☆18Updated last year
- ☆67Updated last year
- A tool for collection archival slivers of the web and web archives☆13Updated last month
- Web application for distributed compute analysis of Archive-It web archive collections.☆16Updated 2 weeks ago
- an experimental implementation of Burrow's delta in Python 3☆21Updated 3 years ago
- search interface for scholarly works☆84Updated 8 months ago
- Terminal tool that converts files encoding to UTF-8☆10Updated 5 years ago
- ☆50Updated last month
- Statistics of Common Crawl monthly archives mined from URL index files☆175Updated 2 weeks ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆44Updated 7 years ago
- A text analysis library for relevance and subtheme detection☆16Updated last month
- Tool to apply Legal Matter Specification Standard (LMSS) to documents☆13Updated 7 months ago
- Python IMage MIning☆13Updated 2 weeks ago
- Code for SaGe subword tokenizer (EACL 2023)☆24Updated 4 months ago
- Flask Interface to Thompson's Motif Index☆18Updated 5 years ago
- ☆14Updated last year
- ☆23Updated last year
- Command line tool for digging into WARC files☆39Updated this week
- image-to-text model for PDF.js☆36Updated 2 weeks ago