adbar/courlan

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/adbar/courlan)

adbar / courlan

Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters

☆177

Alternatives and similar repositories for courlan

Users that are interested in courlan are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

adbar / htmldate
View on GitHub
Fast and robust date extraction from web pages, with Python or on the command-line
☆154Updated this week
adbar / trafilatura
View on GitHub
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XM…
☆6,334Updated this week
chatnoir-eu / chatnoir-resiliparse
View on GitHub
A robust web archive analytics toolkit
☆144Updated this week
adbar / py3langid
View on GitHub
Faster, modernized fork of the language identification tool langid.py
☆63Nov 22, 2024Updated last year
hizkifw / bong
View on GitHub
ChatGPT with access to the internet
☆25Jun 16, 2023Updated 3 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
chatnoir-eu / web-content-extraction-benchmark
View on GitHub
Web Content Extraction Benchmark
☆28Dec 16, 2025Updated 7 months ago
KorAP / Krill
View on GitHub
A Corpus Data Retrieval Index using Lucene for Look-Ups
☆20Updated this week
zytedata / clear-html
View on GitHub
Remove DIVs, style stuff and normalize HTML preserving structure information
☆14Oct 24, 2025Updated 9 months ago
alea-institute / nupunkt
View on GitHub
Next-generation Punkt sentence boundary detection with zero dependencies
☆32Nov 18, 2025Updated 8 months ago
weblyzard / inscriptis
View on GitHub
A python based HTML to text conversion library, command line client and Web service.
☆345Updated this week
cisnlp / GlotWeb
View on GitHub
[WWW 2026] 🕸 GlotWeb: Web Indexing for Minority Languages
☆17Apr 14, 2026Updated 3 months ago
liao961120 / concordancer
View on GitHub
Searching in-memory corpus with Corpus Query Language (CQL)
☆19Dec 2, 2024Updated last year
Intsights / PyDomainExtractor
View on GitHub
A blazingly fast domain extraction library written in Rust
☆68Aug 11, 2025Updated 11 months ago
chrisstiles / PublishDateBot
View on GitHub
A reddit bot that finds original publish dates on linked articles.
☆10Nov 30, 2024Updated last year
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
mbanon / fastspell
View on GitHub
Targetted language identifier, based on FastText and Hunspell.
☆38Sep 4, 2025Updated 10 months ago
scrapinghub / article-extraction-benchmark
View on GitHub
Article extraction benchmark: dataset and evaluation scripts
☆376May 29, 2026Updated last month
saintzema / legal-ai-agent
View on GitHub
An AI solution that interprets legal documents such as Contract Review, Legal Research, Risk Assessment, Compliance Check built with pyth…
☆15Dec 23, 2024Updated last year
telekom / lazy-imports
View on GitHub
Python tool to support lazy imports.
☆31Jun 9, 2025Updated last year
flyingcircusio / pytest-patterns
View on GitHub
pytest-patterns is a plugin for pytest that provides a pattern matching engine optimized for testing.
☆30Oct 23, 2024Updated last year
snat-s / m
View on GitHub
☆22Feb 6, 2026Updated 5 months ago
internetarchive / sandcrawler
View on GitHub
Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki
☆28Jul 31, 2024Updated last year
stefan-it / gc4lm
View on GitHub
GC4LM: A Colossal (Biased) language model for German
☆13May 2, 2021Updated 5 years ago
zzstoatzz / raggy
View on GitHub
scraping and querying documents for LLMs
☆24Oct 6, 2025Updated 9 months ago
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
Lucaterre / spacyfishing
View on GitHub
A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata
☆173Nov 7, 2022Updated 3 years ago
ubbdst / elasticsearch-rdf-river
View on GitHub
RDF river plugin for harvesting metadata from Jena TDB, SPARQL endpoints or plain RDF files into Elasticsearch
☆10May 20, 2022Updated 4 years ago
UKPLab / eacl2024-lagonn
View on GitHub
Source code and data for Like a Good Nearest Neighbor
☆30Jan 12, 2025Updated last year
oneai-nlp / oneai-python
View on GitHub
Python SDK for One AI APIs. One AI is an NLP-as-a-service platform. Our APIs enables language comprehension in context, transforming text…
☆38Aug 24, 2023Updated 2 years ago
scrapinghub / extruct
View on GitHub
Extract embedded metadata from HTML markup
☆966Apr 1, 2026Updated 3 months ago
arangodb / python-arango-async
View on GitHub
The official ArangoDB async Python driver
☆14Jun 21, 2026Updated last month
NorskRegnesentral / NeuralTextSanitizer
View on GitHub
Neural models for detecting and masking personal information from texts
☆16Nov 25, 2022Updated 3 years ago
flairNLP / fabricator
View on GitHub
[EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.
☆110May 16, 2024Updated 2 years ago
allenai / smashed
View on GitHub
SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…
☆35May 24, 2024Updated 2 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
networkdynamics / seldonite
View on GitHub
A News Article Collection Library
☆22Mar 31, 2023Updated 3 years ago
sanders41 / meilisearch-tui
View on GitHub
A TUI for Managing and Searching with Meilisearch
☆20Aug 26, 2025Updated 10 months ago
appeler / clean-names
View on GitHub
Deduplicate and parse list of `dirty names'
☆22Nov 4, 2020Updated 5 years ago
dwillis / shot-scraper-nicar24
View on GitHub
☆15Mar 11, 2024Updated 2 years ago
dataesr / works-magnet
View on GitHub
Works-magnet: Retrieve and promote the scholarly works of your institution.
☆28Jun 12, 2026Updated last month
rcarmo / newsfeed-corpus
View on GitHub
A Dockerized RSS feed fetcher for NLP work, using asyncio
☆19Sep 16, 2022Updated 3 years ago
kailas-v / human-ai-interactions
View on GitHub
☆11Oct 28, 2022Updated 3 years ago