adbar/htmldate

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/adbar/htmldate)

adbar / htmldate

Fast and robust date extraction from web pages, with Python or on the command-line

☆154

Alternatives and similar repositories for htmldate

Users that are interested in htmldate are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

adbar / courlan
View on GitHub
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
☆176Updated this week
adbar / trafilatura
View on GitHub
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XM…
☆6,318Updated this week
zentrum-lexikographie / dwdsmor
View on GitHub
SFST/SMOR/DWDS-based German Morphology
☆21Jun 25, 2026Updated 3 weeks ago
originell / smaz-py3
View on GitHub
Small string compression using smaz compression algorithm. Fast, because it's in C. Supports Python 3+
☆13Oct 18, 2025Updated 9 months ago
adbar / simplemma
View on GitHub
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
☆209Updated this week
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
jmriebold / BoilerPy3
View on GitHub
Python port of Boilerpipe library
☆96Aug 20, 2024Updated last year
ybracke / transnormer
View on GitHub
A lexical normalizer for historical spelling variants using a transformer architecture.
☆10Mar 12, 2025Updated last year
alea-institute / nupunkt
View on GitHub
Next-generation Punkt sentence boundary detection with zero dependencies
☆32Nov 18, 2025Updated 8 months ago
rsling / texrex
View on GitHub
texrex web page cleaning & ClaraX random walk crawler
☆11Dec 13, 2021Updated 4 years ago
harvard-lil / WARC-diff-tools
View on GitHub
Comparing warc files
☆17Feb 21, 2019Updated 7 years ago
stefan-it / gc4lm
View on GitHub
GC4LM: A Colossal (Biased) language model for German
☆13May 2, 2021Updated 5 years ago
yuzhaouoe / pretraining-data-packing
View on GitHub
[ACL'24 Oral] Analysing The Impact of Sequence Composition on Language Model Pre-Training
☆24Aug 18, 2024Updated last year
cceyda / common-voice-explorer
View on GitHub
Common Voice Dataset explorer
☆27Jul 4, 2022Updated 4 years ago
Kungbib / kblab
View on GitHub
KB data lab
☆10Dec 8, 2020Updated 5 years ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
performant-software / faircopy
View on GitHub
FairCopy is a word processor for the humanities scholar.
☆16May 26, 2026Updated last month
ubbdst / elasticsearch-rdf-river
View on GitHub
RDF river plugin for harvesting metadata from Jena TDB, SPARQL endpoints or plain RDF files into Elasticsearch
☆10May 20, 2022Updated 4 years ago
cdpierse / pyinfer
View on GitHub
Pyinfer is a model agnostic tool for ML developers and researchers to benchmark the inference statistics for machine learning models or f…
☆25Feb 19, 2021Updated 5 years ago
aws-samples / amazon-automated-forecast
View on GitHub
☆10Apr 16, 2021Updated 5 years ago
oneai-nlp / oneai-python
View on GitHub
Python SDK for One AI APIs. One AI is an NLP-as-a-service platform. Our APIs enables language comprehension in context, transforming text…
☆38Aug 24, 2023Updated 2 years ago
plkumjorn / GrASP
View on GitHub
An implementation of GrASP (Shnarch et. al., 2017)
☆24Aug 29, 2022Updated 3 years ago
gregjasonroberts / NLP_EquityMarkets_10K
View on GitHub
Applying NLP framework to 10-K filings in equity markets
☆15Jul 26, 2021Updated 4 years ago
NorskRegnesentral / NeuralTextSanitizer
View on GitHub
Neural models for detecting and masking personal information from texts
☆16Nov 25, 2022Updated 3 years ago
kotartemiy / topic-labeled-news-dataset
View on GitHub
100k+ topic labeled news articles published from thousands of news websites
☆19Aug 18, 2020Updated 5 years ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
akoumjian / datefinder
View on GitHub
Find dates inside text using Python and get back datetime objects
☆663Mar 25, 2026Updated 3 months ago
ChrisHayduk / QLoRA-for-MLM
View on GitHub
QLoRA for Masked Language Modeling
☆23Sep 11, 2023Updated 2 years ago
SkBlaz / rakun2
View on GitHub
RaKUn 2.0 - A fast keyword detection algorithm
☆73Aug 5, 2025Updated 11 months ago
ninoseki / arq-dashboard
View on GitHub
A dashboard for ARQ built with FastAPI
☆44Dec 15, 2023Updated 2 years ago
CyberZHG / wiki-dump-reader
View on GitHub
Extract corpora from Wikipedia dumps
☆26Mar 26, 2019Updated 7 years ago
goose3 / goose3
View on GitHub
A Python 3 compatible version of goose http://goose3.readthedocs.io/en/latest/index.html
☆913Jun 22, 2026Updated 3 weeks ago
bucky2177 / dRiftDM
View on GitHub
dRiftDM
☆15Jun 6, 2026Updated last month
midas-research / bhaav
View on GitHub
Dataset of sentences from Hindi stories tagged with different emotion tags
☆11Nov 26, 2019Updated 6 years ago
thustorage / ccnvme
View on GitHub
ccNVMe: crash consistent non-volatile memory express
☆14Aug 17, 2021Updated 4 years ago
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
prob140 / prob140
View on GitHub
A Berkeley library for probability theory.
☆15Jan 14, 2025Updated last year
UniversalDependencies / UD_German-HDT
View on GitHub
☆14May 29, 2026Updated last month
performant-software / juxta-desktop
View on GitHub
Juxta Desktop Application
☆24May 20, 2022Updated 4 years ago
a-l-e-x-k / data_clustering_contest
View on GitHub
Solution for the 2nd place in Telegram Data Clustering Contest (https://contest.com/docs/data_clustering2).
☆12Nov 19, 2020Updated 5 years ago
NationalLibraryOfNorway / warchaeology
View on GitHub
Command line tool for digging into WARC files
☆50Updated this week
NbAiLab / nostram
View on GitHub
Norwegian Speech Transformer Models
☆19Mar 26, 2026Updated 3 months ago
KorAP / Koral
View on GitHub
Translation of query languages to serialized KoralQuery protocol
☆15Jul 8, 2026Updated last week