rflynn / regroupLinks
Generate a regular expression that describes a set of strings.
☆30Updated 2 years ago
Alternatives and similar repositories for regroup
Users that are interested in regroup are comparing it to the libraries listed below
Sorting:
- Common Crawl Index Server☆70Updated 5 months ago
- A classifier for detecting soft 404 pages☆17Updated 2 years ago
- Fast multi-keyword search engine for text strings☆256Updated 11 months ago
- Python code and data for the post "Word Segmentation, or Makingsenseofthis"☆17Updated 2 years ago
- Find strings/words in text; convenience and C speed☆127Updated 2 years ago
- English word segmentation, written in pure-Python, and based on a trillion-word corpus.☆376Updated 2 years ago
- Package to facilitate URL clustering☆69Updated 9 years ago
- Lightning fast spell correction / fuzzy search library based on SymSpell by Commerce-Experts☆81Updated 7 years ago
- An efficient simhash implementation for python☆126Updated 5 years ago
- Creates github index for similar repositories discovery☆193Updated 9 years ago
- Simple heuristic for measuring web page similarity (& data set)☆91Updated 7 years ago
- Python module to generate regular all expression matches☆185Updated 8 months ago
- A natural language semantic parser☆112Updated 7 years ago
- Frontera backend to guide a crawl using PageRank, HITS or other ranking algorithms based on the link structure of the web graph, even whe…☆55Updated last year
- a pure python MurmurHash3 implementation.☆69Updated 5 years ago
- Simhash and near-duplicate detection☆417Updated 2 years ago
- Automatically extracts and normalizes an online article or blog post publication date☆117Updated 2 years ago
- An index data structure for approximate string search.☆23Updated 6 years ago
- Extract Unique Word Lists From Wikipedia Database☆13Updated 5 years ago
- A pure python implementation of locality sensitive hashing for text documents☆85Updated 9 years ago
- Homoglyphs: get similar letters, convert to ASCII, detect possible languages and UTF-8 group.☆82Updated 4 years ago
- Implementation of perceptual image hash calculation in Python☆133Updated last year
- Extracts the top level domain (TLD) from the URL given.☆181Updated 2 months ago
- A component that tries to avoid downloading duplicate content☆27Updated 7 years ago
- Non-Overlapping Aho-Corasick Python extension, for Python 2 (str and unicode) and Python 3☆51Updated 10 years ago
- Memory-based shallow parser for Python☆74Updated 6 years ago
- Nostril: Nonsense String Evaluator☆196Updated 3 years ago
- Locality-sensitive hashing algorithm for text similarity comparisons☆58Updated 4 months ago
- Show summary of a large number of URLs in a Jupyter Notebook☆17Updated 4 years ago
- Compare html similarity using structural and style metrics☆213Updated 2 years ago