dalab/web2text

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/dalab/web2text)

dalab / web2text

Source code for the paper "Web2Text: Deep Structured Boilerplate Removal", full paper @ ECIR'18

☆169

Alternatives and similar repositories for web2text

Users that are interested in web2text are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

nikitautiu / learnhtml
View on GitHub
Web content extraction using machine learning
☆34Mar 3, 2021Updated 5 years ago
FeiSun / ContentExtraction
View on GitHub
Content Extraction via Text Density (SIGIR11)
☆24Sep 21, 2015Updated 10 years ago
dragnet-org / dragnet
View on GitHub
Just the facts -- web page content extraction
☆1,274Jul 8, 2025Updated last year
rsling / texrex
View on GitHub
texrex web page cleaning & ClaraX random walk crawler
☆11Dec 13, 2021Updated 4 years ago
seomoz / dragnet_data
View on GitHub
Training/test data for Dragnet
☆42Jan 29, 2015Updated 11 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
gogartom / TextMaps
View on GitHub
☆91Jun 2, 2016Updated 10 years ago
scrapinghub / article-extraction-benchmark
View on GitHub
Article extraction benchmark: dataset and evaluation scripts
☆376May 29, 2026Updated last month
hpclab / efficient-query-expansion
View on GitHub
Official repository of "Efficient and Effective Query Expansion for Web Search", Short Paper @ CIKM 2018
☆15Nov 17, 2019Updated 6 years ago
seagatesoft / webdext
View on GitHub
Intelligent Web Data Extractor
☆74Dec 5, 2022Updated 3 years ago
miso-belica / jusText
View on GitHub
Heuristic based boilerplate removal tool
☆818Feb 25, 2025Updated last year
iai-group / webtables-tutorial
View on GitHub
Tutorial on Web Table Extraction, Retrieval and Augmentation
☆11Mar 28, 2020Updated 6 years ago
xnancy / russ
View on GitHub
☆16Apr 9, 2021Updated 5 years ago
dbmdz / deep-eos
View on GitHub
General-Purpose Neural Networks for Sentence Boundary Detection
☆74Mar 27, 2023Updated 3 years ago
currentslab / extractnet
View on GitHub
A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…
☆299May 19, 2025Updated last year
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
heshenghuan / ContextFeatureExtractor
View on GitHub
A neural text process python lib for context-based feature extraction on Seq-Tagging data.
☆10Jul 27, 2018Updated 7 years ago
cisnlp / semi-markov-crf
View on GitHub
Code for paper "Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging"
☆16May 31, 2019Updated 7 years ago
wjbmattingly / gliner-finetune
View on GitHub
A package for generating synthetic data and fine-tuning a gliner model.
☆14Jun 5, 2024Updated 2 years ago
duytinvo / acl2016
View on GitHub
Don't Count, Predict! An Automatic Approach to Learning Sentiment Lexicons for Short Text
☆13Jul 20, 2016Updated 10 years ago
hiroshi-manabe / CRFSegmenter
View on GitHub
A multi-language segmenter using high-order CRF.
☆17Feb 27, 2020Updated 6 years ago
bjut-hz / py-mate-tools
View on GitHub
python interface for mate tools
☆17Jan 23, 2018Updated 8 years ago
abenton / deep-dmr
View on GitHub
Implementation of Deep Dirichlet Multinomial Regression in python + cython.
☆16Mar 7, 2018Updated 8 years ago
xflows / textflows
View on GitHub
TextFlows is an open-source online platform for composition, execution, and sharing of interactive text mining and natural language proce…
☆19Dec 1, 2017Updated 8 years ago
areejokaili / topic_labelling
View on GitHub
☆21Jun 12, 2023Updated 3 years ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
dkpro / dkpro-c4corpus
View on GitHub
DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate…
☆53Jun 12, 2020Updated 6 years ago
microsoft / LeadingConversationalSearchbySuggestingUsefulQuestions
View on GitHub
WebConf 2020 paper Leading Conversational Search by Suggesting Useful Questions
☆33May 4, 2020Updated 6 years ago
misja / python-boilerpipe
View on GitHub
Python interface to Boilerpipe, Boilerplate Removal and Fulltext Extraction from HTML pages
☆542Jul 17, 2021Updated 5 years ago
richardpaulhudson / holmes-extractor
View on GitHub
Information extraction from English and German texts based on predicate logic
☆144Jun 6, 2023Updated 3 years ago
sebastian-hofstaetter / neural-ir-explorer
View on GitHub
Neural-IR-Explorer: A Content-Focused Tool to Explore Neural Re-Ranking Results
☆31Dec 13, 2019Updated 6 years ago
pydepta / pydepta
View on GitHub
A python implementation of DEPTA
☆84Jan 14, 2017Updated 9 years ago
allenai / pybart
View on GitHub
Converter from UD-trees to BART representation
☆35Mar 6, 2024Updated 2 years ago
eXascaleInfolab / TRank
View on GitHub
Ranking Entity Types using the Web of Data
☆30Nov 22, 2016Updated 9 years ago
TeamHG-Memex / autologin-middleware
View on GitHub
Scrapy middleware for the autologin
☆36Apr 8, 2026Updated 3 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
lucidworks / simple-category-extraction-component
View on GitHub
Simple FieldCache based query introspection Solr Search Component - solves the 'red sofa' problem
☆11Jan 27, 2025Updated last year
google-research-datasets / ccpe
View on GitHub
A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences…
☆28Jan 20, 2021Updated 5 years ago
scrapinghub / product-extraction-benchmark
View on GitHub
☆16Apr 10, 2026Updated 3 months ago
zezhix / html-extractor
View on GitHub
基于行块分布函数的通用网页正文抽取算法优化，Python实现
☆61Feb 17, 2020Updated 6 years ago
nickvosk / acl2015-dataset-learning-to-explain-entity-relationships
View on GitHub
Dataset for the ACL 2015 paper : Learning to Explain Entity Relationships in Knowledge Graphs
☆11Oct 22, 2015Updated 10 years ago
cantab / patentscope
View on GitHub
Gem to allow easy access to data from the WIPO PATENTSCOPE Web Service
☆19Jun 11, 2026Updated last month
TeamHG-Memex / url-summary
View on GitHub
Show summary of a large number of URLs in a Jupyter Notebook
☆19Apr 8, 2026Updated 3 months ago