mohaps/xtractor

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/mohaps/xtractor)

mohaps / xtractor

XTractor is an algorithmic text extractor from web pages written in Java. It builds upon the "commonly used web design practices" approach (from readability.js, goose and snacktory) to create a set of heuristics for fast article text extraction. It adds several features like paragraph preservation, better image detection heuristics, sibling sco…

☆46

Alternatives and similar repositories for xtractor

Users that are interested in xtractor are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

mohaps / tldrzr
View on GitHub
Algorithmic summarizer for RSS/Atom Feeds, Web Urls and arbitrary text. Codebase for the application deployed at http://tldrzr.herokuapp.…
☆54Sep 4, 2016Updated 9 years ago
srijiths / readabilityBUNDLE
View on GitHub
A bundle of html content extraction algorithms
☆121Mar 27, 2015Updated 11 years ago
ogrodnek / java_fathom
View on GitHub
java library to measure the readability of english text
☆15May 13, 2026Updated 2 months ago
EbookFoundation / zimgutenberg
View on GitHub
Scraper for downloading the entire ebooks repository of project Gutenberg
☆18May 15, 2020Updated 6 years ago
serge-hulne / go_iter
View on GitHub
Go iter tools (for iterating , mapping, filtering, reducing streams -represented as channels-)
☆20Oct 13, 2022Updated 3 years ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
thfrei / infinite-drawing-canvas
View on GitHub
Infinite canvas that allows drawing with pen and pinch zoom. It should feel more ore less like OneNote drawing.
☆15Mar 5, 2023Updated 3 years ago
PraecantatioLabs / Asclepius
View on GitHub
Open Price Comparison for US Hospitals
☆21Apr 26, 2018Updated 8 years ago
apache / datasketches-pig
View on GitHub
Sketch adaptors for Pig.
☆10May 15, 2026Updated 2 months ago
fiatjaf / pf
View on GitHub
a framework for turning written sentences into structured data with simple parsers.
☆18Dec 13, 2017Updated 8 years ago
mediacloud / feed_seeker
View on GitHub
Find rss, atom, xml, and rdf feeds on webpages
☆31Nov 6, 2025Updated 8 months ago
GalkonLtd / JProxyChecker
View on GitHub
A free multithreaded proxy checking program written in Java. Load a proxy list and check each proxy to verify it's alive to create a new …
☆11Nov 5, 2015Updated 10 years ago
zhangw / phantomjs_search_weibo
View on GitHub
search topics of sina weibo by phantomjs
☆11Dec 20, 2015Updated 10 years ago
villeristi / koa-api-boilerplate
View on GitHub
A Boilerplate for modern API's with Koa
☆10Aug 24, 2017Updated 8 years ago
manateelazycat / css-sort
View on GitHub
An Emacs extension you can sort CSS attributables automatically.
☆14Nov 22, 2018Updated 7 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
izerui / spring-boot-actuator-monitor
View on GitHub
基于spring boot的监控平台
☆11Jun 17, 2015Updated 11 years ago
openva / video-indexer
View on GitHub
A process that allows video to be OCRed, as used on Richmond Sunlight.
☆36Dec 14, 2013Updated 12 years ago
mariuszs / hessian-boot-example
View on GitHub
Spring Boot Web with Hessian
☆11Jul 2, 2014Updated 12 years ago
mykite / pan-search
View on GitHub
基于搜索引擎实现网盘搜索
☆12Nov 15, 2018Updated 7 years ago
ysc / HtmlExtractor
View on GitHub
HtmlExtractor是一个Java实现的基于模板的网页结构化信息精准抽取组件。
☆154Aug 27, 2018Updated 7 years ago
joshlong-attic / the-operationalized-application
View on GitHub
this is the code to accompany my talk on building applications that are easily operationalized once in production
☆24Dec 4, 2015Updated 10 years ago
0b01 / bodine
View on GitHub
It finds best synonyms from Google Books when you press a hotkey
☆30Dec 24, 2014Updated 11 years ago
lublak / pdfdataextract
View on GitHub
Extract data from a pdf with pure javascript
☆31Mar 29, 2025Updated last year
rdgarce / bq
View on GitHub
A blazing fast, MT-safe, lockfree and branchless circular byte buffer for SPSC in 50 loc
☆13Sep 16, 2025Updated 10 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
jeanetienne / CollectionViewDragDrop
View on GitHub
Issue with NSCollectionView's default drag and drop implementation
☆12May 3, 2018Updated 8 years ago
liangyangtao / ubk_weixinbysogou
View on GitHub
一个根据搜狗微信进行微信公众号采集的程序
☆16Nov 12, 2015Updated 10 years ago
stefbehl / hawkbit-101
View on GitHub
Material for hawkBit 101
☆16Oct 30, 2019Updated 6 years ago
jgm / standalone-html
View on GitHub
Incorporates external dependencies into HTML file using data: URI scheme
☆21Nov 17, 2011Updated 14 years ago
scrapinghub / autopager
View on GitHub
Detect and classify pagination links
☆15Sep 9, 2020Updated 5 years ago
bencampion / reverse-country-code
View on GitHub
Java library for converting latitude and longitude coordinates into ISO 3166-1 two letter country codes.
☆36Aug 19, 2019Updated 6 years ago
dpdearing / nlp
View on GitHub
NLP Sandbox
☆14Nov 26, 2016Updated 9 years ago
s-kostyaev / go-fill-struct
View on GitHub
Fill golang struct in emacs
☆20Mar 8, 2023Updated 3 years ago
appleton / elixir-crawler
View on GitHub
A web crawler - my first Elixir project
☆13Jul 8, 2015Updated 11 years ago
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
jdormit / ob-graphql
View on GitHub
GraphQL execution backend for org-babel
☆19Dec 22, 2020Updated 5 years ago
revisitors / readimage
View on GitHub
Read a (jpg, png, gif) image into a standard binary format in memory.
☆14Sep 1, 2014Updated 11 years ago
tsiki / connectednotes
View on GitHub
A note taking app based on Zettelkasten.
☆27Jun 3, 2022Updated 4 years ago
taboola / async-profiler-actuator-endpoint
View on GitHub
☆39Feb 18, 2025Updated last year
Zeryther / country-to-aws-region
View on GitHub
JS library that helps get the closest AWS region from a country code
☆12Jan 14, 2023Updated 3 years ago
Immortalin / Simulacra
View on GitHub
Simple and Ideal Circuit Simulation
☆13Dec 4, 2017Updated 8 years ago
gip / resque-telework
View on GitHub
A Resque plugin aimed at workers management on remote hosts
☆24Mar 25, 2014Updated 12 years ago