hynky1999 / CmonCrawlLinks
Common crawl extractor
☆78Updated last year
Alternatives and similar repositories for CmonCrawl
Users that are interested in CmonCrawl are comparing it to the libraries listed below
Sorting:
- Statistics of Common Crawl monthly archives mined from URL index files☆184Updated last week
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆142Updated 7 months ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆136Updated last week
- A News Article Collection Library☆22Updated 2 years ago
- Blazing fast fuzzy text search for Python.☆45Updated 3 months ago
- Spider ported to Python☆89Updated 6 months ago
- Various Jupyter notebooks about Common Crawl data☆55Updated 4 months ago
- UniSim is a package for efficient similarity computation, fuzzy matching, and clustering of data.☆139Updated 4 months ago
- Article extraction benchmark: dataset and evaluation scripts☆320Updated last year
- Index Common Crawl archives in tabular format☆123Updated last week
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…☆291Updated 2 months ago
- Completion After Prompt Probability. Make your LLM make a choice☆80Updated 9 months ago
- GPU-Powered Topic Modelling☆70Updated 2 years ago
- Query language for blending SQL and LLMs across structured + unstructured data, with type constraints.☆107Updated last week
- Utility for OpenAI GPT Functions☆14Updated 2 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆33Updated 2 years ago
- A polite and user-friendly downloader for Common Crawl data☆51Updated last month
- Demo example of consumer goods categorization☆28Updated last year
- This repository serves as a collection of scrapers procuring and structuring various legal datasets☆17Updated 2 years ago
- The Architecture of a Web Crawler: Building a Google-Inspired Distributed Web Crawler☆121Updated 7 months ago
- Python API for https://vespa.ai, the open big data serving engine☆135Updated this week
- 💙 Unstructured Data Connectors for Haystack 2.0☆17Updated last year
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆178Updated 7 months ago
- A research python package for detecting, categorizing, and assessing the severity of personal identifiable information (PII)☆89Updated 2 years ago
- Tutorial and template for a semantic search app powered by the Atlas Embedding Database, Langchain, OpenAI and FastAPI☆115Updated last year
- Efficient few-shot learning with cross-encoders.☆56Updated last year
- Chrome Extension for exploring Hugging Face datasets 🔎☆50Updated 10 months ago
- 📚 Datasets and models for instruction-tuning☆238Updated last year
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- A robust web archive analytics toolkit☆112Updated 4 months ago