rflynn / regroupLinks
Generate a regular expression that describes a set of strings.
☆31Updated 3 years ago
Alternatives and similar repositories for regroup
Users that are interested in regroup are comparing it to the libraries listed below
Sorting:
- A classifier for detecting soft 404 pages☆17Updated 3 years ago
- Common Crawl Index Server☆71Updated 9 months ago
- Adaptive crawler which uses Reinforcement Learning methods☆168Updated 7 years ago
- Find strings/words in text; convenience and C speed☆127Updated 3 years ago
- Textpipe: clean and extract metadata from text☆302Updated 4 years ago
- Python code and data for the post "Word Segmentation, or Makingsenseofthis"☆17Updated 3 years ago
- A python library detect and extract listing data from HTML page.☆108Updated 8 years ago
- ☆16Updated last year
- CoCrawler is a versatile web crawler built using modern tools and concurrency.☆191Updated 3 years ago
- Fast multi-keyword search engine for text strings☆258Updated last year
- Simhash and near-duplicate detection☆421Updated 2 years ago
- Web Content Extraction Through Machine Learning☆185Updated 11 years ago
- Detect and classify pagination links☆104Updated 2 months ago
- Lightning Fast Language Prediction 🚀☆167Updated 3 months ago
- Lightning fast spell correction / fuzzy search library based on SymSpell by Commerce-Experts☆81Updated 7 years ago
- A classifier for detecting soft 404 pages☆57Updated 2 months ago
- Search relevance evaluation toolkit☆74Updated 3 years ago
- Extract Unique Word Lists From Wikipedia Database☆13Updated 5 years ago
- Extract text from HTML☆135Updated 5 years ago
- Tools and other things for people who work on search relevance & information retrieval☆87Updated 2 years ago
- A generic crawler☆78Updated 7 years ago
- Formasaurus tells you the type of an HTML form and its fields using machine learning☆119Updated last year
- SuperMinHash: A New Minwise Hashing Algorithm for Jaccard Similarity Estimation, Simhash and SimhashIndex☆19Updated 3 years ago
- A small tool which uses the CommonCrawl URL Index to download documents with certain file types or mime-types. This is used for mass-test…☆72Updated 2 weeks ago
- NER toolkit for HTML data☆259Updated last year
- An efficient simhash implementation for python☆126Updated 6 years ago
- Automatically extracts and normalizes an online article or blog post publication date☆117Updated 2 years ago
- Script and sample dataset of all urban dictionary entry names (around 1.4 million total)☆95Updated 3 years ago
- extract difference between two html pages☆32Updated 7 years ago
- English word segmentation, written in pure-Python, and based on a trillion-word corpus.☆378Updated 2 years ago