commoncrawl / cc-downloaderLinks
A polite and user-friendly downloader for Common Crawl data
☆67Updated 5 months ago
Alternatives and similar repositories for cc-downloader
Users that are interested in cc-downloader are comparing it to the libraries listed below
Sorting:
- Statistics of Common Crawl monthly archives mined from URL index files☆208Updated this week
- Common crawl extractor☆84Updated last year
- Small python package to measure OCR quality and other related metrics.☆26Updated last year
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆197Updated last week
- Tools to construct and process Common Crawl webgraphs☆104Updated last month
- An index of PDF-centric corpora☆158Updated 6 months ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆68Updated 3 weeks ago
- search interface for scholarly works☆85Updated last year
- A sentence segmentation library with wide language support optimized for speed and utility.☆84Updated 2 weeks ago
- Benchmark scripts for comparing different tokenizers and sentence segmenters of German☆12Updated 2 years ago
- Streaming WARC/ARC library for fast web archive IO☆445Updated last year
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆47Updated 8 years ago
- Code for collecting, processing, and preparing datasets for the Common Pile☆249Updated 4 months ago
- LLM plugin for embeddings using sentence-transformers☆74Updated 9 months ago
- Please note that the warc-indexer tool & code is now supported by NetArchiveSuite. The 'warc-indexer' directory and code that exists in t…☆132Updated 2 months ago
- A set of utilities for processing MediaWiki XML dump data.☆61Updated 11 months ago
- Simplified version of a common crawl fetcher☆17Updated last month
- A tool for detecting viruses and NSFW material in WARC files☆17Updated last month
- image-to-text model for PDF.js☆50Updated 10 months ago
- Faster, modernized fork of the language identification tool langid.py☆60Updated last year
- Next-generation Punkt sentence boundary detection with zero dependencies☆27Updated 2 months ago
- Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki☆28Updated last year
- Data cleaning and validation functions for names, languages, identifiers, etc.☆51Updated this week
- Fast and robust date extraction from web pages, with Python or on the command-line☆145Updated 2 months ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆158Updated last month
- Command line tool for digging into WARC files☆50Updated 2 weeks ago
- Datasette plugin for uploading CSV files and converting them to database tables☆27Updated 2 months ago
- A Memento Aggregator CLI and Server in Go☆76Updated 10 months ago
- Efficiently computing & storing token n-grams from large corpora☆26Updated last year
- 🔢 Work with static vector models☆36Updated 9 months ago