mohaps / xtractorLinks

XTractor is an algorithmic text extractor from web pages written in Java. It builds upon the "commonly used web design practices" approach (from readability.js, goose and snacktory) to create a set of heuristics for fast article text extraction. It adds several features like paragraph preservation, better image detection heuristics, sibling sco…
44Updated 10 years ago

Alternatives and similar repositories for xtractor

Users that are interested in xtractor are comparing it to the libraries listed below

Sorting: