Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
☆169Dec 19, 2025Updated 4 months ago
Alternatives and similar repositories for courlan
Users that are interested in courlan are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Fast and robust date extraction from web pages, with Python or on the command-line☆148Nov 4, 2025Updated 5 months ago
- Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XM…☆5,807Sep 12, 2025Updated 7 months ago
- Remove DIVs, style stuff and normalize HTML preserving structure information☆14Oct 24, 2025Updated 6 months ago
- A python based HTML to text conversion library, command line client and Web service.☆341Feb 27, 2026Updated 2 months ago
- Alternative robots parser module for Python☆22Apr 8, 2026Updated 3 weeks ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- A helper library full of URL-related heuristics.☆76Feb 11, 2026Updated 2 months ago
- Searching in-memory corpus with Corpus Query Language (CQL)☆19Dec 2, 2024Updated last year
- Targetted language identifier, based on FastText and Hunspell.☆38Sep 4, 2025Updated 7 months ago
- texrex web page cleaning & ClaraX random walk crawler☆11Dec 13, 2021Updated 4 years ago
- ☆22Feb 6, 2026Updated 2 months ago
- pytest-patterns is a plugin for pytest that provides a pattern matching engine optimized for testing.☆27Oct 23, 2024Updated last year
- Backend, IA-specific tools for crawling and processing the scholarly web. Content ends up in https://fatcat.wiki☆28Jul 31, 2024Updated last year
- encode and decode between polylines and geojson☆13Dec 27, 2025Updated 4 months ago
- Adds read support for Excel files (xls and xlsx) to agate.☆18Mar 27, 2026Updated last month
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- GC4LM: A Colossal (Biased) language model for German☆13May 2, 2021Updated 5 years ago
- Source code and data for Like a Good Nearest Neighbor☆30Jan 12, 2025Updated last year
- A spaCy wrapper of Entity-Fishing (component) for named entity disambiguation and linking on Wikidata☆170Nov 7, 2022Updated 3 years ago
- scraping and querying documents for LLMs☆24Oct 6, 2025Updated 6 months ago
- Extract embedded metadata from HTML markup☆962Apr 1, 2026Updated last month
- Provides painless access to namespaced environment variables.☆13Apr 20, 2021Updated 5 years ago
- A tiny library for Python text normalisation. Useful for ad-hoc text processing.☆157Mar 8, 2026Updated last month
- [EMNLP 2023 Demo] fabricator - annotating and generating datasets with large language models.☆110May 16, 2024Updated last year
- SMASHED is a toolkit designed to apply transformations to samples in datasets, such as fields extraction, tokenization, prompting, batchi…☆35May 24, 2024Updated last year
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- A Drafts App URL action to create gists.☆20Jan 27, 2013Updated 13 years ago
- Python SDK for One AI APIs. One AI is an NLP-as-a-service platform. Our APIs enables language comprehension in context, transforming text…☆38Aug 24, 2023Updated 2 years ago
- A TUI for Managing and Searching with Meilisearch☆20Aug 26, 2025Updated 8 months ago
- Music generation using Elementary Cellular Automata.☆13Nov 23, 2015Updated 10 years ago
- Scalable task execution orchestrator for CodeOcean.☆10Mar 1, 2026Updated 2 months ago
- Neural models for detecting and masking personal information from texts☆16Nov 25, 2022Updated 3 years ago
- news-please - an integrated web crawler and information extractor for news that just works☆2,443Apr 14, 2026Updated 2 weeks ago
- 🧹 Python package for text cleaning☆1,010Jan 28, 2026Updated 3 months ago
- ☆13Apr 10, 2026Updated 3 weeks ago
- Bare Metal GPUs on DigitalOcean Gradient AI • AdPurpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
- The CleanCoNLL dataset from our EMNLP 2023 paper where we corrected annotation errors and inconsistencies in CoNLL-03.☆25Jul 2, 2024Updated last year
- a collection of scripts to maintain a server running rtorrent, autodl-irssi, sonarr, radarr, rclone upload to Google Drive, plexdrive, Pl…☆12Aug 26, 2018Updated 7 years ago
- Build applications that make decisions (chatbots, agents, simulations, etc...). Monitor, trace, persist, and execute on your own infrastr…☆1,980Updated this week
- Curated list of apps and tools that not only use the new ChatGPT API, but also allow users to configure their own API keys, enabling free…☆34May 17, 2023Updated 2 years ago
- 💙 Unstructured Data Connectors for Haystack 2.0☆17Sep 21, 2023Updated 2 years ago
- Semantic Search + Keyword Search + Hybrid Search + Filtering + Faceting on 300K HN Comments☆57Dec 16, 2024Updated last year
- Simple GUI to load a PDF/Docx/txt file and have LM Studio Answer based off of it.☆14Jul 31, 2024Updated last year