hynky1999 / CmonCrawlLinks
Common crawl extractor
☆82Updated last year
Alternatives and similar repositories for CmonCrawl
Users that are interested in CmonCrawl are comparing it to the libraries listed below
Sorting:
- Extract web archive data using Wayback Machine and Common Crawl☆161Updated last year
- Statistics of Common Crawl monthly archives mined from URL index files☆199Updated this week
- Tools to construct and process Common Crawl webgraphs☆101Updated last week
- Blazing fast fuzzy text search for Python.☆47Updated 7 months ago
- A News Article Collection Library☆22Updated 2 years ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆150Updated 3 weeks ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆142Updated 2 weeks ago
- Spider ported to Python☆97Updated 9 months ago
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…☆296Updated 6 months ago
- Index Common Crawl archives in tabular format☆122Updated 2 weeks ago
- Chrome Extension for exploring Hugging Face datasets 🔎☆49Updated last year
- Article extraction benchmark: dataset and evaluation scripts☆339Updated last month
- UniSim is a package for efficient similarity computation, fuzzy matching, and clustering of data.☆144Updated 7 months ago
- This is the repo for the container that holds the models for the text2vec-transformers module☆57Updated last week
- GraphER: A Structure-aware Text-to-Graph Model for Entity and Relation Extraction☆81Updated last year
- A polite and user-friendly downloader for Common Crawl data☆59Updated 3 months ago
- 🖍️ Highlight text in documents☆109Updated 6 months ago
- LLM prompt language based on Jinja. Banks provides tools and functions to build prompts text and chat messages from generic blueprints. I…☆116Updated 4 months ago
- Various Jupyter notebooks about Common Crawl data☆59Updated last week
- 📚 Datasets and models for instruction-tuning☆237Updated 2 years ago
- A research python package for detecting, categorizing, and assessing the severity of personal identifiable information (PII)☆94Updated last month
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- GPU-Powered Topic Modelling☆69Updated 2 years ago
- Completion After Prompt Probability. Make your LLM make a choice☆81Updated last year
- Python API for https://vespa.ai, the open big data serving engine☆147Updated this week
- Tutorial and template for a semantic search app powered by the Atlas Embedding Database, Langchain, OpenAI and FastAPI☆113Updated 2 years ago
- Pre-train Static Word Embeddings☆90Updated 2 months ago
- Detect and redact PII locally with SOTA performance☆82Updated 7 months ago
- Demo example of consumer goods categorization☆30Updated last year
- Efficient few-shot learning with cross-encoders.☆59Updated last year