Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
☆172Jun 1, 2026Updated last week
Alternatives and similar repositories for courlan
Users that are interested in courlan are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Fast and robust date extraction from web pages, with Python or on the command-line☆150Jun 1, 2026Updated last week
- Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XM…☆6,087Updated this week
- Next-generation Punkt sentence boundary detection with zero dependencies☆31Nov 18, 2025Updated 6 months ago
- Faster, modernized fork of the language identification tool langid.py☆62Nov 22, 2024Updated last year
- A python based HTML to text conversion library, command line client and Web service.☆342May 4, 2026Updated last month
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- A Python scraping module, that extracts text from articles found in RSS feeds. Uses SQLite as database.☆20Jul 5, 2024Updated last year
- A helper library full of URL-related heuristics.☆77Feb 11, 2026Updated 4 months ago
- Searching in-memory corpus with Corpus Query Language (CQL)☆19Dec 2, 2024Updated last year
- Targetted language identifier, based on FastText and Hunspell.☆38Sep 4, 2025Updated 9 months ago
- texrex web page cleaning & ClaraX random walk crawler☆11Dec 13, 2021Updated 4 years ago
- Now included in rigour☆150Nov 24, 2025Updated 6 months ago
- Article extraction benchmark: dataset and evaluation scripts☆373May 29, 2026Updated 2 weeks ago
- ☆22Feb 6, 2026Updated 4 months ago
- pytest-patterns is a plugin for pytest that provides a pattern matching engine optimized for testing.☆27Oct 23, 2024Updated last year
- Wordpress hosting with auto-scaling - Free Trial Offer • AdFully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
- Python tool to support lazy imports.☆31Jun 9, 2025Updated last year
- Comparing warc files☆17Feb 21, 2019Updated 7 years ago
- Homebrew tap for my repos☆40Jun 1, 2026Updated last week
- Adds read support for Excel files (xls and xlsx) to agate.☆18Updated this week
- GC4LM: A Colossal (Biased) language model for German☆13May 2, 2021Updated 5 years ago
- This collection of general purpose python magic was too good to keep for ourselves!☆20May 19, 2026Updated 3 weeks ago
- Source code and data for Like a Good Nearest Neighbor☆30Jan 12, 2025Updated last year
- A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata☆172Nov 7, 2022Updated 3 years ago
- scraping and querying documents for LLMs☆24Oct 6, 2025Updated 8 months ago
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Extract embedded metadata from HTML markup☆966Apr 1, 2026Updated 2 months ago
- The official ArangoDB async Python driver☆13Jun 1, 2026Updated last week
- A tiny library for Python text normalisation. Useful for ad-hoc text processing.☆157Mar 8, 2026Updated 3 months ago
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.☆110May 16, 2024Updated 2 years ago
- ☆15Mar 11, 2024Updated 2 years ago
- SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…☆35May 24, 2024Updated 2 years ago
- Deduplicate and parse list of `dirty names'☆22Nov 4, 2020Updated 5 years ago
- A Drafts App URL action to create gists.☆20Jan 27, 2013Updated 13 years ago
- Python SDK for One AI APIs. One AI is an NLP-as-a-service platform. Our APIs enables language comprehension in context, transforming text…☆38Aug 24, 2023Updated 2 years ago
- Deploy to Railway using AI coding agents - Free Credits Offer • AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- [EMNLP 2023] 💬 Language Identification with Support for More Than 2000 Labels☆207Apr 15, 2026Updated last month
- A News Article Collection Library☆22Mar 31, 2023Updated 3 years ago
- Neural models for detecting and masking personal information from texts☆16Nov 25, 2022Updated 3 years ago
- Incorporates external dependencies into HTML file using data: URI scheme☆21Nov 17, 2011Updated 14 years ago
- Implementation of NodeJS FS interface using Amazon Simple Storage Service (S3).☆18Sep 29, 2021Updated 4 years ago
- ☆13May 29, 2026Updated 2 weeks ago
- The CleanCoNLL dataset from our EMNLP 2023 paper where we corrected annotation errors and inconsistencies in CoNLL-03.☆25Jul 2, 2024Updated last year