adbar / courlanLinks
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
☆141Updated 5 months ago
Alternatives and similar repositories for courlan
Users that are interested in courlan are comparing it to the libraries listed below
Sorting:
- Fast and robust date extraction from web pages, with Python or on the command-line☆130Updated 5 months ago
- A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.☆324Updated 6 months ago
- 💥 Use Hugging Face text and token classification pipelines directly in spaCy☆63Updated last year
- A python based HTML to text conversion library, command line client and Web service.☆311Updated 3 weeks ago
- Article extraction benchmark: dataset and evaluation scripts☆317Updated last year
- Python port of Boilerpipe library☆88Updated 10 months ago
- A component orchestration engine☆28Updated last year
- Search for words, documents, images, videos, news and maps using the Brave search engine. Downloading files and images to a local hard dr…☆62Updated last year
- 🔢 Work with static vector models☆28Updated 2 months ago
- SpaCyEx allows the creation of spaCy Matcher patterns with RegEx like syntax.☆59Updated last year
- 🦦 weasel: A small and easy workflow system☆84Updated 11 months ago
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…☆284Updated last month
- Pinecone text client library☆62Updated 3 months ago
- Next-generation Punkt sentence boundary detection with zero dependencies☆17Updated 2 months ago
- A python utility for downloading Common Crawl data☆240Updated 2 years ago
- Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further lang…☆124Updated last year
- Parse numbers written in natural language☆117Updated 8 months ago
- 80x faster and 95% accurate language identification with Fasttext☆157Updated last year
- Efficient few-shot learning with cross-encoders.☆53Updated last year
- spaCy-wrap is a wrapper library for spaCy for including fine-tuned transformers from Huggingface in your spaCy pipeline allowing you to i…☆46Updated last year
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 5 years ago
- An AI extension for IPython that makes it work like Cursor☆67Updated 5 months ago
- This repository contains an easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-s…☆216Updated 5 months ago
- An open-source package for python to clean raw text data☆70Updated last year
- 🖍️ Highlight text in documents☆109Updated 2 months ago
- Information extraction from English and German texts based on predicate logic☆137Updated 2 years ago
- This repository provides usage examples for the Python module Newspaper3k.☆147Updated last year
- ☆55Updated last year
- Extract text from HTML☆134Updated 4 years ago
- Index Common Crawl archives in tabular format☆122Updated last month