sammyer / BoilerPy
Python port of Boilerpipe library
☆15Updated 6 years ago
Alternatives and similar repositories for BoilerPy:
Users that are interested in BoilerPy are comparing it to the libraries listed below
- Python's missing statistical Swiss Army knife☆15Updated 9 years ago
- extract difference between two html pages☆32Updated 6 years ago
- Traptor -- A distributed Twitter feed☆26Updated 2 years ago
- common data interchange format for document processing pipelines that apply natural language processing tools to large streams of text☆35Updated 8 years ago
- A component that tries to avoid downloading duplicate content☆27Updated 6 years ago
- Semanticizest: dump parser and client☆20Updated 8 years ago
- Paginating the web☆37Updated 11 years ago
- A python3 library for efficiently storing massive integers (stands for gzipped-integer).☆41Updated 4 years ago
- ☆18Updated 8 years ago
- WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…☆48Updated 3 years ago
- Wikipedia API wrapper for humans and elk. (en.wikipedia.org/w/api.php, get it?)☆36Updated 10 years ago
- A Python library for the Ion format☆12Updated 7 years ago
- Find which links on a web page are pagination links☆29Updated 8 years ago
- Python's extensions to thumbor. These are used to generate safe urls among others.☆62Updated last year
- Django feeds provides an extensive database model for RSS feeds and a fault tolerant parser.☆31Updated 12 years ago
- An attempt at creating a silver/gold standard dataset for backtesting yesterday & today's content-extractors☆34Updated 9 years ago
- Twitter crawler☆11Updated 10 years ago
- Python bindings for CLD2.☆16Updated 6 years ago
- Read natural language interactive queries. Great for bots.☆18Updated 8 years ago
- Scrapy middleware for the autologin☆37Updated 6 years ago
- This is a Python binding to the tokenizer Ucto. Tokenisation is one of the first step in almost any Natural Language Processing task, yet…☆29Updated 2 months ago
- Detect and classify pagination links☆15Updated 4 years ago
- A classifier for detecting soft 404 pages☆57Updated last year
- Modularly extensible semantic metadata validator☆83Updated 9 years ago
- A library for ranking collection☆38Updated 4 years ago
- A backport of the `yield from` semantic from Python 3.x to Python 2.7☆21Updated 5 years ago
- Stanford Tregex-inspired language for rule-based dependency tree manipulation.☆21Updated 7 years ago
- WSGI Profiling Middleware - capture cProfiles with request data.☆14Updated 10 years ago
- Document fingerprint generator☆29Updated 2 years ago
- Lightweight, multilingual natural language processing☆63Updated 11 years ago