cdimascio / essence
Automatically extract the main text content (and more) from an HTML document
☆116Updated 2 years ago
Related projects ⓘ
Alternatives and complementary repositories for essence
- A set of reusable Java components that implement functionality common to any web crawler☆237Updated last week
- Article extraction benchmark: dataset and evaluation scripts☆289Updated 6 months ago
- extJWNL (Extended Java WordNet Library) is a Java API for creating, reading and updating dictionaries in WordNet format.☆126Updated 8 months ago
- Readability clone in Java☆461Updated 4 years ago
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…☆234Updated 10 months ago
- Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm☆64Updated 3 years ago
- A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.☆230Updated 2 months ago
- A python based HTML to text conversion library, command line client and Web service.☆277Updated 8 months ago
- Lego AI Parser is an open-source application that uses OpenAI to parse visible text of HTML elements.☆230Updated 5 months ago
- Boilerplate Removal using Deep Learning☆82Updated 2 years ago
- Index Common Crawl archives in tabular format☆106Updated 2 weeks ago
- Java library for reading and writing WARC files with a typed API☆47Updated this week
- Lightning fast spell correction / fuzzy search library based on SymSpell by Commerce-Experts☆80Updated 6 years ago
- Distributed crawling infrastructure running on top of severless computation, cloud storage (such as S3) and sophisticated queues.☆415Updated last year
- Various utilities regarding Levenshtein transducers. (Java)☆56Updated 2 years ago
- The Sweble Wikitext Components module provides a parser for MediaWiki's wikitext and an engine trying to emulate the behavior of a MediaW…☆70Updated 7 months ago
- Python code for building a GPT-3 based technical blog post optimizer.☆84Updated 2 years ago
- Cloud crawler functions for scrapeulous☆44Updated 3 years ago
- A natural language event parser for java and android.☆103Updated 4 years ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆121Updated last week
- Apify actor that opens a web page in headless Chrome and analyzes the HTML and JavaScript objects, looks for schema.org microdata and JSO…☆150Updated last year
- Source code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18☆167Updated 3 years ago
- estela, an elastic web scraping cluster 🕸☆172Updated 2 weeks ago
- A port of the arclabs 'readability' package to Java☆72Updated 12 years ago
- Repackaging of Boilerpipe published on Maven Central Repository.☆53Updated 10 months ago
- PDF parser and converter to HTML☆83Updated last month
- A language detection library for the JVM☆36Updated last year
- Hunspell library for Java based on JNA☆62Updated last year
- Easily crawl news portals or blog sites using Storm Crawler.☆20Updated last year