hynky1999 / CmonCrawlLinks
Common crawl extractor
☆77Updated last year
Alternatives and similar repositories for CmonCrawl
Users that are interested in CmonCrawl are comparing it to the libraries listed below
Sorting:
- UniSim is a package for efficient similarity computation, fuzzy matching, and clustering of data.☆137Updated 3 months ago
- Demo example of consumer goods categorization☆28Updated last year
- Statistics of Common Crawl monthly archives mined from URL index files☆184Updated last week
- 💙 Unstructured Data Connectors for Haystack 2.0☆17Updated last year
- Tools to construct and process Common Crawl webgraphs☆92Updated last week
- This repository is designed for deploying and managing server processes that handle embeddings using the Infinity Embedding model or Larg…☆23Updated 4 months ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆142Updated 6 months ago
- Python API for https://vespa.ai, the open big data serving engine☆127Updated last week
- ☆34Updated 5 months ago
- Pre-train Static Word Embeddings☆84Updated last month
- A personal knowledge base that I can dump information to and help me learn☆24Updated last month
- Chrome Extension for exploring Hugging Face datasets 🔎☆50Updated 9 months ago
- A CLI tool for managing OpenAI batch processing jobs with ease.☆37Updated 2 months ago
- Explore the use of DSPy for extracting features from PDFs 🔎☆43Updated last year
- A News Article Collection Library☆22Updated 2 years ago
- Tutorial and template for a semantic search app powered by the Atlas Embedding Database, Langchain, OpenAI and FastAPI☆115Updated last year
- Detect and redact PII locally with SOTA performance☆58Updated 3 months ago
- Efficient few-shot learning with cross-encoders.☆54Updated last year
- DSPy program/pipeline inspector widget for Jupyter/VSCode Notebooks.☆36Updated last year
- 🖍️ Highlight text in documents☆109Updated 2 months ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆133Updated 6 months ago
- LLM prompt language based on Jinja. Banks provides tools and functions to build prompts text and chat messages from generic blueprints. I…☆105Updated 2 weeks ago
- Vector Database with support for late interaction and token level embeddings.☆55Updated 3 weeks ago
- ☆32Updated last year
- Experimental Code for StructuredRAG: JSON Response Formatting with Large Language Models☆108Updated 3 months ago
- Various Jupyter notebooks about Common Crawl data☆54Updated 3 months ago
- A framework for converting natural language text inputs to corresponding Pandas, MongoDB, Kusto and Neo4j (Cypher) queries.☆83Updated last year
- H&M Fashion Image similarity search with Weaviate and DocArray☆43Updated last year
- Voyage AI Official Python Library☆61Updated 2 weeks ago
- A robust web archive analytics toolkit☆111Updated 3 months ago