hynky1999 / CmonCrawlLinks
Common crawl extractor
☆84Updated last year
Alternatives and similar repositories for CmonCrawl
Users that are interested in CmonCrawl are comparing it to the libraries listed below
Sorting:
- Tools to construct and process Common Crawl webgraphs☆103Updated last week
- Statistics of Common Crawl monthly archives mined from URL index files☆205Updated last week
- Blazing fast fuzzy text search for Python.☆50Updated 8 months ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆154Updated last week
- Fast and robust date extraction from web pages, with Python or on the command-line☆142Updated last month
- Chrome Extension for exploring Hugging Face datasets 🔎☆49Updated last year
- ☆20Updated last year
- A News Article Collection Library☆22Updated 2 years ago
- 📚 Datasets and models for instruction-tuning☆238Updated 2 years ago
- Spider ported to Python☆100Updated 11 months ago
- UniSim is a package for efficient similarity computation, fuzzy matching, and clustering of data.☆144Updated 8 months ago
- This repository is designed for deploying and managing server processes that handle embeddings using the Infinity Embedding model or Larg…☆26Updated 9 months ago
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…☆297Updated 7 months ago
- Query language for blending SQL and LLMs across structured + unstructured data, with type constraints.☆124Updated this week
- Completion After Prompt Probability. Make your LLM make a choice☆82Updated last year
- This is the repo for the container that holds the models for the text2vec-transformers module☆58Updated last month
- A robust web archive analytics toolkit☆126Updated 2 months ago
- Demo example of consumer goods categorization☆30Updated 2 years ago
- 💙 Unstructured Data Connectors for Haystack 2.0☆17Updated 2 years ago
- Python API for https://vespa.ai, the open big data serving engine☆154Updated last week
- A framework for converting natural language text inputs to corresponding Pandas, MongoDB, Kusto and Neo4j (Cypher) queries.☆92Updated last year
- 🖍️ Highlight text in documents☆110Updated 8 months ago
- Multi-threaded matrix multiplication and cosine similarity calculations for dense and sparse matrices. Appropriate for calculating the K …☆86Updated last year
- Pre-train Static Word Embeddings☆94Updated 3 months ago
- Ready-to-go containerized RAG service. Implemented with text-embedding-inference + Qdrant/LanceDB.☆73Updated last year
- Tutorial and template for a semantic search app powered by the Atlas Embedding Database, Langchain, OpenAI and FastAPI☆112Updated 2 years ago
- Retrieval of fully structured data made easy. Use LLMs or custom models. Specialized on PDFs and HTML files. Extensive support of tabular…☆79Updated this week
- Efficient few-shot learning with cross-encoders.☆60Updated last year
- Detect and redact PII locally with SOTA performance☆87Updated 9 months ago
- LLM prompt language based on Jinja. Banks provides tools and functions to build prompts text and chat messages from generic blueprints. I…☆119Updated 2 weeks ago