hynky1999 / CmonCrawlLinks
Common crawl extractor
☆79Updated last year
Alternatives and similar repositories for CmonCrawl
Users that are interested in CmonCrawl are comparing it to the libraries listed below
Sorting:
- Statistics of Common Crawl monthly archives mined from URL index files☆192Updated 2 weeks ago
- Blazing fast fuzzy text search for Python.☆47Updated 5 months ago
- A News Article Collection Library☆22Updated 2 years ago
- Tools to construct and process Common Crawl webgraphs☆96Updated 3 weeks ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆146Updated 8 months ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆140Updated last month
- Chrome Extension for exploring Hugging Face datasets 🔎☆49Updated last year
- Index Common Crawl archives in tabular format☆122Updated last month
- 🖍️ Highlight text in documents☆109Updated 4 months ago
- A polite and user-friendly downloader for Common Crawl data☆57Updated last month
- Various Jupyter notebooks about Common Crawl data☆58Updated 5 months ago
- Detect and redact PII locally with SOTA performance☆72Updated 5 months ago
- Python API for https://vespa.ai, the open big data serving engine☆141Updated this week
- Article extraction benchmark: dataset and evaluation scripts☆322Updated last year
- UniSim is a package for efficient similarity computation, fuzzy matching, and clustering of data.☆142Updated 5 months ago
- Completion After Prompt Probability. Make your LLM make a choice☆80Updated 10 months ago
- Spider ported to Python☆91Updated 7 months ago
- Demo example of consumer goods categorization☆28Updated last year
- A robust web archive analytics toolkit☆116Updated 5 months ago
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…☆293Updated 4 months ago
- NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, …☆85Updated 9 months ago
- Tutorial and template for a semantic search app powered by the Atlas Embedding Database, Langchain, OpenAI and FastAPI☆114Updated 2 years ago
- Pre-train Static Word Embeddings☆85Updated last week
- 💭 Build autonomous agents, retrieval augmented generation (RAG) processes and language model powered chat applications☆297Updated 4 months ago
- Query language for blending SQL and LLMs across structured + unstructured data, with type constraints.☆109Updated last week
- Use AWS Lambda functions as a proxy pool to scrape web pages.☆137Updated last year
- A research python package for detecting, categorizing, and assessing the severity of personal identifiable information (PII)☆89Updated 3 weeks ago
- This repository serves as a collection of scrapers procuring and structuring various legal datasets☆18Updated 2 years ago
- Lego AI Parser is an open-source application that uses OpenAI to parse visible text of HTML elements.☆236Updated last year
- ☆12Updated last week