Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
☆161Dec 19, 2025Updated 3 months ago
Alternatives and similar repositories for courlan
Users that are interested in courlan are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Fast and robust date extraction from web pages, with Python or on the command-line☆146Nov 4, 2025Updated 4 months ago
- Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XM…☆5,517Sep 12, 2025Updated 6 months ago
- A robust web archive analytics toolkit☆134Oct 15, 2025Updated 5 months ago
- Next-generation Punkt sentence boundary detection with zero dependencies☆29Nov 18, 2025Updated 4 months ago
- ChatGPT with access to the internet☆26Jun 16, 2023Updated 2 years ago
- Remove DIVs, style stuff and normalize HTML preserving structure information☆14Oct 24, 2025Updated 4 months ago
- A python based HTML to text conversion library, command line client and Web service.☆339Feb 27, 2026Updated 3 weeks ago
- A helper library full of URL-related heuristics.☆76Feb 11, 2026Updated last month
- Structured outputs from DSPy and Jinja2☆27Jun 27, 2025Updated 8 months ago
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…☆300May 19, 2025Updated 10 months ago
- ☆26May 31, 2024Updated last year
- A reddit bot that finds original publish dates on linked articles.☆10Nov 30, 2024Updated last year
- Targetted language identifier, based on FastText and Hunspell.☆38Sep 4, 2025Updated 6 months ago
- texrex web page cleaning & ClaraX random walk crawler☆11Dec 13, 2021Updated 4 years ago
- Article extraction benchmark: dataset and evaluation scripts☆356Mar 1, 2026Updated 3 weeks ago
- Python library for converting HTML to markup or plain text☆18Aug 30, 2025Updated 6 months ago
- Replicate interface for IF☆10May 17, 2023Updated 2 years ago
- Simple multilingual lemmatizer for Python, especially useful for speed and efficiency☆189Jun 6, 2025Updated 9 months ago
- SDK to access ZenRows API directly from Python. We handle proxies rotation, headless browsers and CAPTCHAs for you.☆18Jan 22, 2026Updated 2 months ago
- GC4LM: A Colossal (Biased) language model for German☆13May 2, 2021Updated 4 years ago
- Source code and data for Like a Good Nearest Neighbor☆30Jan 12, 2025Updated last year
- A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata☆169Nov 7, 2022Updated 3 years ago
- Extract embedded metadata from HTML markup☆956Oct 1, 2025Updated 5 months ago
- scraping and querying documents for LLMs☆24Oct 6, 2025Updated 5 months ago
- CAMeL Dataset☆15Apr 15, 2025Updated 11 months ago
- The official ArangoDB async Python driver☆13Feb 11, 2026Updated last month
- ☆11Mar 18, 2024Updated 2 years ago
- 💬 Language Identification with Support for More Than 2000 Labels -- EMNLP 2023☆191Mar 12, 2026Updated last week
- Migrated to: https://codeberg.org/openculinary/knowledge-graph☆11Aug 21, 2025Updated 7 months ago
- A News Article Collection Library☆22Mar 31, 2023Updated 2 years ago
- docker:dind with NVIDIA GPU support via NVIDIA container toolkit☆13Mar 4, 2026Updated 2 weeks ago
- The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.☆14Mar 30, 2024Updated last year
- A TUI for Managing and Searching with Meilisearch☆20Aug 26, 2025Updated 6 months ago
- Music generation using Elementary Cellular Automata.☆13Nov 23, 2015Updated 10 years ago
- Neural models for detecting and masking personal information from texts☆16Nov 25, 2022Updated 3 years ago
- How to guides on web-crawling or scraping☆27Apr 26, 2025Updated 10 months ago
- A fast python implementation of the SimHash algorithm.☆27Oct 27, 2021Updated 4 years ago
- news-please - an integrated web crawler and information extractor for news that just works☆2,401Sep 21, 2025Updated 6 months ago
- 🧹 Python package for text cleaning☆1,003Jan 28, 2026Updated last month