DBeath / feedsearch-crawlerLinks
Crawl sites for RSS, Atom, and JSON feeds.
☆87Updated last week
Alternatives and similar repositories for feedsearch-crawler
Users that are interested in feedsearch-crawler are comparing it to the libraries listed below
Sorting:
- LLM plugin for embeddings using sentence-transformers☆74Updated 9 months ago
- A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.☆352Updated last year
- Unofficial wrapper for Substack's API☆159Updated 2 months ago
- This is a proof-of-concept of using an LLM to find and extract meaningful data without parsing the html too much.☆30Updated 2 years ago
- Find rss, atom, xml, and rdf feeds on webpages☆31Updated 2 months ago
- Add website scraping abilities to Datasette☆66Updated 2 years ago
- Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more☆20Updated 7 years ago
- 🥐 Open-source LLM-friendly Markdown/JSON generator☆94Updated 3 weeks ago
- https://verdad.app☆84Updated last week
- A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one pac…☆298Updated 8 months ago
- A Firefox and Google Chrome extension to clip websites and download them into a readable markdown file.☆41Updated 7 years ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆158Updated last month
- Search for words, documents, images, videos, news and maps using the Brave search engine. Downloading files and images to a local hard dr…☆78Updated 6 months ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆47Updated 8 years ago
- Extract text from HTML☆134Updated this week
- Scrape HN to track links from specific domains☆72Updated this week
- Automated behaviors that run in browser to interact with complex sites automatically. Used by ArchiveWeb.page and Browsertrix Crawler.☆55Updated 2 months ago
- Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)☆205Updated last year
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 6 years ago
- Readable YouTube Transcripts using Gemini 1.5 Flash 8B☆64Updated 8 months ago
- Wikidata's QRank as a SQLite DB.☆28Updated 2 years ago
- Dataset of approximately 10,000 podcasts from iTunes.☆91Updated 7 years ago
- The Python script for downloading new mp3 from RSS given channels☆141Updated 10 months ago
- The little things give you away... A collection of various small helper stuff – Mirror repo only, no longer kept in sync, refer to gitea.…☆24Updated 5 years ago
- This repository provides usage examples for the Python module Newspaper3k.☆150Updated 2 years ago
- Tools for running enrichments against data stored in Datasette☆26Updated 2 months ago
- A collective list of free APIs☆18Updated 3 years ago
- A helper library full of URL-related heuristics.☆73Updated 4 months ago
- LLM access to pplx-api☆39Updated last month
- Use OpenAI Embeddings to visualize Kindle Highlights from Readwise!☆30Updated last year