domanchi / gibberish-detectorLinks
Train a model, and detect gibberish strings with it.
☆64Updated 3 years ago
Alternatives and similar repositories for gibberish-detector
Users that are interested in gibberish-detector are comparing it to the libraries listed below
Sorting:
- A python utility for downloading Common Crawl data☆243Updated 2 years ago
- 80x faster and 95% accurate language identification with Fasttext☆162Updated last year
- 🐍 A CPython extension for the Hyperscan regular expression matching library.☆183Updated 3 weeks ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆138Updated 3 weeks ago
- Pythonic search engine based on PyLucene.☆129Updated last week
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆145Updated 8 months ago
- Python3 bindings for the Compact Language Detector v3 (CLD3)☆154Updated 2 years ago
- ☆69Updated 3 years ago
- 🚂 Fine-tune OpenAI models for text classification, question answering, and more☆16Updated 2 years ago
- Multi-Langauge Identification☆28Updated last year
- Fuzzy matching and more functionality for spaCy.☆257Updated last year
- Statistics of Common Crawl monthly archives mined from URL index files☆188Updated last week
- 🖍️ Highlight text in documents☆108Updated 4 months ago
- ☆173Updated 5 months ago
- ☆16Updated last year
- A fully customisable language detection pipeline for spaCy☆93Updated 6 years ago
- ☆51Updated this week
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆183Updated 7 months ago
- A fast python implementation of the SimHash algorithm.☆27Updated 3 years ago
- Article extraction benchmark: dataset and evaluation scripts☆321Updated last year
- Hashformers is a framework for hashtag segmentation with Transformers and Large Language Models (LLMs).☆73Updated last year
- A python based HTML to text conversion library, command line client and Web service.☆315Updated 3 weeks ago
- Process Common Crawl data with Python and Spark☆440Updated this week
- A research python package for detecting, categorizing, and assessing the severity of personal identifiable information (PII)☆89Updated last week
- Targetted language identifier, based on FastText and Hunspell.☆37Updated 6 months ago
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.☆110Updated last year
- Curated list of awesome software and resources for Senzing, The First Real-Time AI for Entity Resolution.☆61Updated this week
- 💥 Use Hugging Face text and token classification pipelines directly in spaCy☆63Updated last year
- ☆76Updated 8 months ago
- Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further lang…☆126Updated last year