alan-turing-institute / ReadabiliPy
A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.
☆252Updated last month
Alternatives and similar repositories for ReadabiliPy:
Users that are interested in ReadabiliPy are comparing it to the libraries listed below
- A python based HTML to text conversion library, command line client and Web service.☆281Updated last week
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆132Updated 2 weeks ago
- Python port of Boilerpipe library☆86Updated 4 months ago
- Article extraction benchmark: dataset and evaluation scripts☆296Updated 8 months ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆121Updated 2 weeks ago
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…☆253Updated last year
- Python bindings for Tantivy☆308Updated this week
- Full text search in your Pandas dataframe☆211Updated last month
- Heuristic based boilerplate removal tool☆744Updated 8 months ago
- Multilingual syllable annotation pipeline component for spacy☆39Updated last year
- Parse numbers written in natural language☆109Updated 2 months ago
- hnsqlite integrates hnswlib and sqlite for simple text embedding search☆157Updated last year
- Search for words, documents, images, videos, news and maps using the Brave search engine. Downloading files and images to a local hard dr…☆48Updated 8 months ago
- 📄 ⚙️ ETL processes for medical and scientific papers☆372Updated last week
- Extract text from HTML☆133Updated 4 years ago
- https://verdad.app☆78Updated 2 weeks ago
- Parse natural language time and date expressions in python☆195Updated 10 months ago
- Python3 bindings for the Compact Language Detector v3 (CLD3)☆149Updated last year
- Information extraction from English and German texts based on predicate logic☆135Updated last year
- Ultimate Website Sitemap Parser☆189Updated this week
- Find the Python code for specified symbols☆250Updated last year
- A python wrapper to extract text from images on a mac system. Uses the vision framework from Apple.☆311Updated 2 months ago
- Easy-to-Use Apple Vision wrapper for text extraction, scalar representation and clustering using K-means.☆89Updated 11 months ago
- fast python port of arc90's readability tool, updated to match latest readability.js!☆2,699Updated this week
- A spaCy wrapper for GliNER☆101Updated 6 months ago
- A pythonic library providing light-weighted interface with LLMs☆122Updated 2 months ago
- This repository contains an easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-s…☆211Updated last month
- 80x faster and 95% accurate language identification with Fasttext☆143Updated 11 months ago
- ✨ Bootstrap annotation with zero- & few-shot learning via OpenAI GPT-3☆320Updated last year
- SpaCyEx allows the creation of spaCy Matcher patterns with RegEx like syntax.☆57Updated 8 months ago