commoncrawl / cc-downloaderLinks
A polite and user-friendly downloader for Common Crawl data
☆50Updated last week
Alternatives and similar repositories for cc-downloader
Users that are interested in cc-downloader are comparing it to the libraries listed below
Sorting:
- Statistics of Common Crawl monthly archives mined from URL index files☆184Updated last week
- Next-generation Punkt sentence boundary detection with zero dependencies☆17Updated 3 months ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆46Updated this week
- Small python package to measure OCR quality and other related metrics.☆24Updated last year
- A sentence segmentation library with wide language support optimized for speed and utility.☆65Updated 2 weeks ago
- A set of utilities for processing MediaWiki XML dump data.☆56Updated 5 months ago
- Tools to construct and process Common Crawl webgraphs☆92Updated last week
- A tool for detecting viruses and NSFW material in WARC files☆15Updated 10 months ago
- Common crawl extractor☆77Updated last year
- LLM plugin for embeddings using sentence-transformers☆68Updated 2 months ago
- Code for collecting, processing, and preparing datasets for the Common Pile☆178Updated 3 weeks ago
- search interface for scholarly works☆85Updated 11 months ago
- ☆67Updated last year
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆178Updated 6 months ago
- Transform Unstructured Data into Synthetic Datasets☆27Updated 10 months ago
- A helper library full of URL-related heuristics.☆70Updated last month
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆142Updated 6 months ago
- Extracts plain text, language identification and more metadata from WARC records☆23Updated 4 months ago
- Streaming WARC/ARC library for fast web archive IO☆421Updated 7 months ago
- Web application for distributed compute analysis of Archive-It web archive collections.☆19Updated 3 months ago
- WARC + AI - Experimental Retrieval Augmented Generation Pipeline for Web Archive Collections.☆256Updated 5 months ago
- Download and attach provenance to public datasets☆33Updated 3 months ago
- Legal document classification with EuroVoc descriptors on 22 languages.☆26Updated 2 years ago
- Command line tool for digging into WARC files☆43Updated 2 weeks ago
- Efficiently computing & storing token n-grams from large corpora☆24Updated 9 months ago
- Libraries, Archives and Museums (LAM)☆84Updated 2 years ago
- an experimental implementation of Burrow's delta in Python 3☆21Updated 3 years ago
- Index Common Crawl archives in tabular format☆122Updated last month
- 🔢 Work with static vector models☆28Updated 2 months ago
- Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archive…☆26Updated 2 years ago