nik0spapp/sdalg

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/nik0spapp/sdalg)

nik0spapp / sdalg

Web page segmentation and noise removal

☆55

Alternatives and similar repositories for sdalg

Users that are interested in sdalg are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

asanoja / segmentations
View on GitHub
Tools for web page segmentation. In development
☆17Nov 7, 2018Updated 7 years ago
liaocyintl / web-segment
View on GitHub
Segment a HTML document into structural data
☆12Jan 15, 2019Updated 7 years ago
asanoja / web-segmentation-evaluation
View on GitHub
Tools for web page segmentation evaluation
☆13Nov 6, 2019Updated 6 years ago
tpopela / vips_java
View on GitHub
Implementation of Vision Based Page Segmentation algorithm in Java
☆107Oct 25, 2019Updated 6 years ago
wushuartgaro / VipsPython
View on GitHub
Implementation of Microsoft Vips algorithm in Python
☆19Oct 9, 2019Updated 6 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
openpreserve / pagelyzer
View on GitHub
Suite of tools for detecting changes in web pages and their rendering
☆56Dec 17, 2023Updated 2 years ago
nik0spapp / wmil
View on GitHub
Weighted multiple-instance learning algorithm
☆18Oct 9, 2018Updated 7 years ago
linuxlizard / page_segmentation
View on GitHub
Page Segmentation Code. I'm working with OCRopus and the UW-III data set to test how the page segmentation algorithms work with smaller s…
☆20Feb 23, 2013Updated 13 years ago
rkrzr / dataset-popular
View on GitHub
A dataset of popular pages (taken from <dir.yahoo.com>) with manually marked up semantic blocks.
☆15Feb 9, 2014Updated 12 years ago
superisaac / pycetr
View on GitHub
Python implementation of CETR: Content Extraction via Tag Ratios
☆13Jan 18, 2012Updated 14 years ago
nickstanisha / CIFAR-10_data
View on GitHub
Python 2/3 compatible .npz CIFAR-10 dataset
☆10Mar 1, 2017Updated 9 years ago
lqtri / WebPage-Segmentation--WPS-
View on GitHub
Webpage segmentation use DBSCAN
☆13Apr 4, 2023Updated 3 years ago
webis-de / cikm20-web-page-segmentation-revisited-evaluation-framework-and-dataset
View on GitHub
Code for "Web Page Segmentation Revisited: Evaluation Framework and Dataset", accepted as resources paper to CIKM 2020
☆14Jan 13, 2023Updated 3 years ago
luyug / MORES
View on GitHub
☆10Apr 16, 2021Updated 5 years ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
TeamHG-Memex / soft404
View on GitHub
A classifier for detecting soft 404 pages
☆65Apr 8, 2026Updated 3 months ago
ziyan / spider
View on GitHub
Web Content Extraction Through Machine Learning
☆185Apr 4, 2014Updated 12 years ago
onnovalkering / vscode-singularity
View on GitHub
Provides syntax highlighting for Apptainer/Singularity definition files.
☆10Dec 24, 2025Updated 6 months ago
siddhantgoel / flask-filealchemy
View on GitHub
YAML-formatted plain-text file based models for Flask backed by Flask-SQLAlchemy
☆23Jan 14, 2025Updated last year
scrapinghub / mdr
View on GitHub
A python library detect and extract listing data from HTML page.
☆110May 5, 2017Updated 9 years ago
nikitautiu / learnhtml
View on GitHub
Web content extraction using machine learning
☆34Mar 3, 2021Updated 5 years ago
rafaelcapucho / scrapy-eagle
View on GitHub
Scrapy Eagle is a tool that allow us to run any Scrapy based project in a distributed fashion and monitor how it is going on and how many…
☆24Sep 4, 2020Updated 5 years ago
rmax / databrewer
View on GitHub
The missing datasets manager. Like hombrew but for datasets. CLI-tool for search and discover datasets!
☆41May 29, 2017Updated 9 years ago
scrapinghub / product-extraction-benchmark
View on GitHub
☆16Apr 10, 2026Updated 3 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
ppasupat / web-entity-extractor-ACL2014
View on GitHub
☆13Jun 14, 2016Updated 10 years ago
TeamHG-Memex / url-summary
View on GitHub
Show summary of a large number of URLs in a Jupyter Notebook
☆19Apr 8, 2026Updated 3 months ago
commonsearch / gumbocy
View on GitHub
Python binding for gumbo-parser using Cython
☆14Aug 16, 2016Updated 9 years ago
blaze / datafabric
View on GitHub
A distributed in-memory fabric based on shared-memory blocks and datashape. Any language can operate on the data.
☆13Feb 12, 2016Updated 10 years ago
scrapinghub / page_finder
View on GitHub
Find which links on a web page are pagination links
☆29Jan 12, 2017Updated 9 years ago
xtannier / WebAnnotator
View on GitHub
WebAnnotator is a tool for annotating Web pages. WebAnnotator is implemented as a Firefox extension (https://addons.mozilla.org/en-US/fi…
☆48Dec 17, 2021Updated 4 years ago
seagatesoft / webdext
View on GitHub
Intelligent Web Data Extractor
☆74Dec 5, 2022Updated 3 years ago
dragnet-org / dragnet_data
View on GitHub
code and data used to build a training dataset for dragnet models
☆10Nov 29, 2020Updated 5 years ago
TeamHG-Memex / arachnado
View on GitHub
Web Crawling UI and HTTP API, based on Scrapy and Tornado
☆162Apr 8, 2026Updated 3 months ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
commonsearch / urlparse4
View on GitHub
Faster replacement for Python's urlparse module
☆46Apr 13, 2026Updated 3 months ago
trec-kba / streamcorpus
View on GitHub
common data interchange format for document processing pipelines that apply natural language processing tools to large streams of text
☆35Sep 30, 2016Updated 9 years ago
bhavishya235 / Web-Classification
View on GitHub
This project deals with hierarchical classification of web pages based on dmoz dataset.
☆14Apr 10, 2014Updated 12 years ago
rsling / texrex
View on GitHub
texrex web page cleaning & ClaraX random walk crawler
☆11Dec 13, 2021Updated 4 years ago
rmax / databrewer-recipes
View on GitHub
DataBrewer Recipes Repository.
☆21Jul 5, 2016Updated 10 years ago
zycdev / L2R2
View on GitHub
PyTorch implementation of L2R2 in SIGIR 2020
☆17Jun 12, 2023Updated 3 years ago
clips / yarn
View on GitHub
Disambiguating biomedical and clinical concepts with word embeddings
☆15Apr 17, 2018Updated 8 years ago