hynky1999 / CmonCrawl
Common crawl extractor
☆74Updated 8 months ago
Alternatives and similar repositories for CmonCrawl:
Users that are interested in CmonCrawl are comparing it to the libraries listed below
- Extract web archive data using Wayback Machine and Common Crawl☆150Updated 3 months ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆133Updated last month
- Index Common Crawl archives in tabular format☆110Updated 2 months ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆122Updated last month
- Statistics of Common Crawl monthly archives mined from URL index files☆170Updated last week
- Tools to construct and process webgraphs from Common Crawl data☆85Updated 2 weeks ago
- Spider ported to Python☆66Updated 2 weeks ago
- Various Jupyter notebooks about Common Crawl data☆50Updated 2 weeks ago
- Python API for https://vespa.ai, the open big data serving engine☆113Updated this week
- A robust web archive analytics toolkit☆98Updated 2 months ago
- Article extraction benchmark: dataset and evaluation scripts☆300Updated 9 months ago
- Remote web browser automation.☆20Updated 7 months ago
- get structured output from LLM's☆32Updated last year
- UniSim is a package for efficient similarity computation, fuzzy matching, and clustering of data.☆129Updated last month
- Efficient few-shot learning with cross-encoders.☆48Updated 11 months ago
- This is the repo for the container that holds the models for the text2vec-transformers module☆48Updated 2 weeks ago
- Improve prompts for e.g. GPT3 and GPT-J using templates and hyperparameter optimization.☆41Updated 2 years ago
- Train a model, and detect gibberish strings with it.☆60Updated 2 years ago
- A flexible, adaptive classification system for dynamic text classification☆65Updated this week
- One Line To Build Zero-Data Classifiers in Minutes☆36Updated 4 months ago
- Python client for txtai☆11Updated this week
- LLM plugin for embeddings using sentence-transformers☆46Updated this week
- NLP with Rust for Python 🦀🐍☆61Updated 8 months ago
- A python utility for downloading Common Crawl data☆232Updated last year
- NLP Cloud serves high performance pre-trained or custom models for NER, sentiment-analysis, classification, summarization, paraphrasing, …☆78Updated 2 months ago
- A visual tool to interpret and understand PyTorch machine learning models☆16Updated last year
- Transform Unstructured Data into Synthetic Datasets☆25Updated 5 months ago
- pyppeteer stealth plugin, attempts to look like a normal browser☆18Updated 4 months ago
- Chrome Extension for exploring Hugging Face datasets 🔎☆49Updated 4 months ago
- Search google, bing, yahoo, and other search engines with python☆55Updated 3 years ago