chatnoir-eu/web-content-extraction-benchmark

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/chatnoir-eu/web-content-extraction-benchmark)

chatnoir-eu / web-content-extraction-benchmark

Web Content Extraction Benchmark

☆27

Alternatives and similar repositories for web-content-extraction-benchmark

Users that are interested in web-content-extraction-benchmark are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

omwn / omwn.github.io
View on GitHub
The Open Multilingual Wordnet Project Page
☆18Jun 3, 2026Updated last month
webis-de / ir_axioms
View on GitHub
↕️ Intuitive axiomatic retrieval experimentation.
☆31Jun 15, 2026Updated 3 weeks ago
irlabamsterdam / iKAT
View on GitHub
☆21Jul 25, 2025Updated 11 months ago
webis-de / acl22-identifying-the-human-values-behind-arguments
View on GitHub
Machine Learning scripts for the identification of human values behind arguments.
☆24Mar 12, 2024Updated 2 years ago
commoncrawl / ia-web-commons
View on GitHub
Web archiving utility library
☆11Jun 19, 2026Updated 2 weeks ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
EleutherAI / pilev2
View on GitHub
☆13Jan 20, 2023Updated 3 years ago
jason9693 / ETA4LLMs
View on GitHub
Calculating Expected Time for training LLM.
☆39Apr 17, 2023Updated 3 years ago
OpenMatch / NeuScraper
View on GitHub
[ACL 2024] This is the code repo for our ACL’24 paper "Cleaner Pretraining Corpus Curation with Neural Web Scraping".
☆229Aug 28, 2024Updated last year
H-TayyarMadabushi / SemEval_2022_Task2-idiomaticity
View on GitHub
Data and preprocessing scripts for SemEval 2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding
☆16Feb 3, 2022Updated 4 years ago
myeomans / doc2concrete
View on GitHub
Detecting Concreteness in Natural Language
☆16Jan 25, 2024Updated 2 years ago
sczzz3 / EHRDiff
View on GitHub
An offical implementation of EHRDiff [TMLR]
☆33Jun 25, 2024Updated 2 years ago
Nardien / KALA
View on GitHub
Official Code Repository for the paper "KALA: Knowledge-Augmented Language Model Adaptation" (NAACL 2022)
☆35Oct 17, 2023Updated 2 years ago
ArthurSpirling / JustDoOLS
View on GitHub
☆13Dec 16, 2024Updated last year
decred / dcrtimegui
View on GitHub
Timestamp files with blockchain
☆14Sep 2, 2025Updated 10 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
lemurproject / ClueWeb22
View on GitHub
☆17Dec 11, 2024Updated last year
sczzz3 / SDI
View on GitHub
[npj Digital Medicine'25] Continuous sleep depth index annotation with deep learning yields novel digital biomarkers for sleep health
☆16Apr 13, 2025Updated last year
thu-unicorn / Doctor-R1
View on GitHub
This is the official repository for our paper "Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning" pu…
☆47Apr 11, 2026Updated 2 months ago
shammur / SemEval2022Task3
View on GitHub
The PreTENS shared task hosted at SemEval 2022 aims at focusing on semantic competence with specific attention on the evaluation of langu…
☆12Feb 5, 2022Updated 4 years ago
graphrag / ms-graphrag
View on GitHub
A modular graph-based Retrieval-Augmented Generation (RAG) system
☆16May 28, 2026Updated last month
mneunhoe / ds3_gan
View on GitHub
☆12Aug 3, 2022Updated 3 years ago
Selbosh / index0
View on GitHub
Zero-based indexing in R
☆16Dec 6, 2021Updated 4 years ago
chkla / NLP2CSS-Tutorial
View on GitHub
Tutorial on Transformers 🤖, HuggingFace 🤗 and Social Science Applications 👥 @ IC2S2
☆17Aug 8, 2021Updated 4 years ago
lancopku / SAPO
View on GitHub
C# code for "Towards Easier and Faster Sequence Labeling for Natural Language Processing: A Search-based Probabilistic Online Learning Fr…
☆13Nov 19, 2018Updated 7 years ago
Managed Kubernetes at scale on DigitalOcean • Ad
DigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
lancopku / nndep
View on GitHub
Transition-based Dependency Parser with neural networks and hybrid oracle
☆13May 14, 2018Updated 8 years ago
VegB / iNLG
View on GitHub
Implementation of "Visualize Before You Write: Imagination-Guided Open-Ended Text Generation".
☆17Feb 3, 2023Updated 3 years ago
BCHSI / philter-deidstable1_mirror
View on GitHub
UCSF Philter for UC
☆15Jul 8, 2024Updated 2 years ago
pqpq17 / Awesome-LLM-Reasoning-on-Medicine
View on GitHub
The Official Repo for Paper: Aligning Clinical Needs and AI Capabilities: A Survey on LLMs for Medical Reasoning
☆24Apr 7, 2026Updated 3 months ago
carverauto / threadr
View on GitHub
🌎 OSS Real-time AI Data Analysis with GraphDB integration. 🔍
☆23Mar 10, 2026Updated 3 months ago
TimotheeMickus / codwoe
View on GitHub
The CODWOE shared task invites you to compare two types of semantic descriptions: dictionary glosses and word embedding representations. …
☆12Jul 13, 2022Updated 3 years ago
benvancalster / PerfMeasuresOverview
View on GitHub
R code and predictions for the case study from Van Calster et al (Validation Studies of Predictive AI for Use in Medical Practice: Overv…
☆22Dec 15, 2025Updated 6 months ago
keirp / OpenWebMath
View on GitHub
☆173May 2, 2024Updated 2 years ago
Yuanhy1997 / HyPe
View on GitHub
HyPe: Better Pre-trained Language Model Fine-tuning with Hidden Representation Perturbation [ACL 2023]
☆14Jul 11, 2023Updated 2 years ago
Deploy open-source AI quickly and easily - Special Bonus Offer • Ad
Runpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
Wang-ML-Lab / TSDA
View on GitHub
[ICML 2023] Taxonomy-Structured Domain Adaptation
☆12Oct 6, 2023Updated 2 years ago
lancopku / ChineseNER
View on GitHub
Code for "Cross-Domain and Semi-Supervised Named Entity Recognition in Chinese Social Media: A Unified Model"
☆18Feb 14, 2022Updated 4 years ago
lancopku / Decode-CRF
View on GitHub
Conditional Random Fields with Decode-based Learning
☆14May 15, 2018Updated 8 years ago
Keiji-AI / TrialPanorama
View on GitHub
TrialPanorama: Developing Large Language Models Using One Million Clinical Trials
☆27Jul 1, 2026Updated last week
HenryPengZou / ImplicitAVE
View on GitHub
[ACL 2024] Dataset and Code of "ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction…
☆17Jun 10, 2024Updated 2 years ago
lancopku / ACA4NMT
View on GitHub
Code of a novel model for NMT
☆11May 13, 2018Updated 8 years ago
CASIA-LM / ChineseWebText
View on GitHub
☆186Nov 13, 2023Updated 2 years ago