commoncrawl / cc-downloaderLinks
A polite and user-friendly downloader for Common Crawl data
☆64Updated 4 months ago
Alternatives and similar repositories for cc-downloader
Users that are interested in cc-downloader are comparing it to the libraries listed below
Sorting:
- Statistics of Common Crawl monthly archives mined from URL index files☆206Updated last week
- An index of PDF-centric corpora☆154Updated 6 months ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆68Updated this week
- A sentence segmentation library with wide language support optimized for speed and utility.☆82Updated last month
- Common crawl extractor☆84Updated last year
- Tools to construct and process Common Crawl webgraphs☆103Updated 2 weeks ago
- Code for collecting, processing, and preparing datasets for the Common Pile☆248Updated 3 months ago
- Next-generation Punkt sentence boundary detection with zero dependencies☆26Updated last month
- Benchmark scripts for comparing different tokenizers and sentence segmenters of German☆12Updated 2 years ago
- search interface for scholarly works☆85Updated last year
- Libraries, Archives and Museums (LAM)☆88Updated 3 years ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆142Updated 2 months ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆195Updated last week
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆156Updated 3 weeks ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆47Updated 8 years ago
- Library for fast text representation and classification.☆31Updated 2 years ago
- Faster, modernized fork of the language identification tool langid.py☆61Updated last year
- Efficiently computing & storing token n-grams from large corpora☆26Updated last year
- Streaming WARC/ARC library for fast web archive IO☆442Updated last year
- 📜 Dehyphenation of broken text (mainly German), i.e., extracted from a PDF☆39Updated 3 years ago
- ☆67Updated last year
- an experimental implementation of Burrow's delta in Python 3☆21Updated 4 years ago
- WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.☆267Updated 10 months ago
- LLM plugin for embeddings using sentence-transformers☆74Updated 8 months ago
- Fast Text Classification with Compressors dictionary☆150Updated 2 years ago
- A python package to simulate typographical errors.☆38Updated 2 years ago
- 🌸 Train floret vectors☆18Updated 2 years ago
- Python wrapper for the MediaWiki API to access and parse data from Wikipedia☆42Updated last month
- Index Common Crawl archives in tabular format☆124Updated 2 weeks ago