hynky1999 / CmonCrawlLinks
Common crawl extractor
☆78Updated last year
Alternatives and similar repositories for CmonCrawl
Users that are interested in CmonCrawl are comparing it to the libraries listed below
Sorting:
- Statistics of Common Crawl monthly archives mined from URL index files☆188Updated last week
- Blazing fast fuzzy text search for Python.☆46Updated 4 months ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆143Updated 8 months ago
- Spider ported to Python☆89Updated 7 months ago
- A polite and user-friendly downloader for Common Crawl data☆53Updated last week
- Python API for https://vespa.ai, the open big data serving engine☆137Updated this week
- A News Article Collection Library☆22Updated 2 years ago
- UniSim is a package for efficient similarity computation, fuzzy matching, and clustering of data.☆139Updated 4 months ago
- This is the repo for the container that holds the models for the text2vec-transformers module☆54Updated 3 weeks ago
- Chrome Extension for exploring Hugging Face datasets 🔎☆49Updated 11 months ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆138Updated 3 weeks ago
- NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, …☆84Updated 9 months ago
- Voyage AI Official Python Library☆71Updated last month
- Demo example of consumer goods categorization☆28Updated last year
- LLM prompt language based on Jinja. Banks provides tools and functions to build prompts text and chat messages from generic blueprints. I…☆114Updated last month
- Query language for blending SQL and LLMs across structured + unstructured data, with type constraints.☆108Updated last week
- Article extraction benchmark: dataset and evaluation scripts☆321Updated last year
- A pythonic library providing light-weighted interface with LLMs☆128Updated 3 months ago
- Utility for OpenAI GPT Functions☆14Updated 2 years ago
- 💙 Unstructured Data Connectors for Haystack 2.0☆17Updated last year
- 🖍️ Highlight text in documents☆108Updated 4 months ago
- Various Jupyter notebooks about Common Crawl data☆54Updated 4 months ago
- utilities for loading and running text embeddings with onnx☆44Updated last week
- S3 vector database for LLM Agents and RAG.☆48Updated 2 years ago
- ☆12Updated last week
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…☆292Updated 3 months ago
- Pre-train Static Word Embeddings☆84Updated 2 months ago
- A framework for converting natural language text inputs to corresponding Pandas, MongoDB, Kusto and Neo4j (Cypher) queries.☆88Updated last year
- Small python package to measure OCR quality and other related metrics.☆25Updated last year
- ☆19Updated last year