cdimascio / essenceLinks
Automatically extract the main text content (and more) from an HTML document
☆117Updated 2 years ago
Alternatives and similar repositories for essence
Users that are interested in essence are comparing it to the libraries listed below
Sorting:
- Crux offers a flexible plugin-based API & implementation to extract interesting information from Web pages.☆240Updated 2 months ago
- A Kotlin port of Mozilla‘s Readability. It extracts a website‘s relevant content and removes all clutter from it.☆160Updated 3 years ago
- Kotlin/Java library and cli tool for scraping posts and media from various sources with neither authorization nor full page rendering (Fa…☆290Updated last week
- A set of reusable Java components that implement functionality common to any web crawler☆244Updated 2 weeks ago
- Life and collaboration assistant.☆36Updated last week
- Article extraction benchmark: dataset and evaluation scripts☆317Updated last year
- A Directory of Online Newspaper Sources for 70+ Languages☆32Updated 4 years ago
- ShoeBox is a Kotlin library for persistent data storage that supports views and the observer pattern. While often used with Kweb, it doe…☆49Updated 4 years ago
- SimpleDNN is a machine learning lightweight open-source library written in Kotlin designed to support relevant neural network architectur…☆99Updated 5 years ago
- Module that extracts structured information from a rendered html site and outputs JSON. HTML to JSON.☆70Updated 4 years ago
- Java library to extract links (URLs, email addresses) from plain text; fast, small and smart☆209Updated 2 weeks ago
- A Kotlin and Java library for RSS podcast feeds☆26Updated last year
- Cloud crawler functions for scrapeulous☆45Updated 4 years ago
- A language detection Web Service☆53Updated 8 years ago
- NeuralParser is a very simple to use dependency parser, based on the Latent Syntactic Structure encoding.☆21Updated 5 years ago
- Program used to split text into segments☆27Updated 7 months ago
- A Kotlin/Java API for generating .ts source files.☆47Updated last year
- A web crawling framework written in Kotlin☆130Updated 3 years ago
- Scraper for downloading the entire ebooks repository of project Gutenberg☆150Updated this week
- Multiplatform Kotlin Hello World (Android/iOS/Java/JavaScript/Native)☆78Updated 11 months ago
- Source code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18☆169Updated 3 years ago
- extJWNL (Extended Java WordNet Library) is a Java API for creating, reading and updating dictionaries in WordNet format.☆129Updated last year
- Index Common Crawl archives in tabular format☆122Updated last month
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆177Updated 5 months ago
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆193Updated 6 years ago
- Bindings to Google's Compact Language Detector 3 to JVM Based Languages☆22Updated last year
- Boilerplate Removal using Deep Learning☆82Updated 3 years ago
- Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm☆67Updated 4 years ago
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…☆284Updated last month
- A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.☆324Updated 6 months ago