DBeath / feedsearch-crawler
Crawl sites for RSS, Atom, and JSON feeds.
☆67Updated 7 months ago
Alternatives and similar repositories for feedsearch-crawler:
Users that are interested in feedsearch-crawler are comparing it to the libraries listed below
- Find rss, atom, xml, and rdf feeds on webpages☆30Updated 3 months ago
- Search sites for RSS, Atom, and JSON feeds.☆18Updated 2 years ago
- Extract text from HTML☆133Updated 4 years ago
- This Python package can be used to systematically extract multiple data elements (e.g., title, keywords, text) from news sources around t…☆32Updated last year
- ☆13Updated 5 years ago
- A News Article Collection Library☆22Updated last year
- This is a proof-of-concept of using an LLM to find and extract meaningful data without parsing the html too much.☆28Updated last year
- 📖👓🏷Tag your getpocket.com articles automatically using natural language processing☆43Updated 5 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆42Updated 5 years ago
- Python port of Boilerpipe library☆86Updated 4 months ago
- Fast and robust date extraction from web pages, with Python or on the command-line☆121Updated 2 weeks ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆132Updated 2 weeks ago
- Building a Job Dataset☆21Updated 2 years ago
- A Scrapy middleware for scraping time series data from Archive.org's Wayback Machine.☆111Updated 11 months ago
- Add website scraping abilities to Datasette☆62Updated last year
- Gets your upvoted posts from Hacker News and imports them to raindrop.io☆25Updated last year
- This repository provides usage examples for the Python module Newspaper3k.☆144Updated last year
- Wikidata's QRank as a SQLite DB.☆28Updated last year
- A simple HTML content extractor in Python. Can be run as a wrapper for Mozilla's Readability.js package or in pure-python mode.☆253Updated last month
- Dockerfile to run n8n (automation) on Dokku (mini-Heroku)☆19Updated this week
- API - extract a list of keywords from a text.☆18Updated 7 years ago
- Easy extraction of keywords and engines from search engine results pages (SERPs).☆90Updated 3 years ago
- Python package for converting xml and epubs to text files☆34Updated 4 years ago
- News API - fetch news from CommonCrawl, parse with NewsPlease, enrich with pre-trained machine-learning models, to structured searchable …☆28Updated 2 years ago
- Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai☆40Updated 2 years ago
- Datasette plugin for rendering Markdown☆26Updated last year
- 📑 Scripts to repair, verify, OCR, compress, wrangle, crop (etc.) PDFs☆63Updated 8 months ago
- Python code to scrape and collect data from the RSS feeds Facebook uses to augment its Trending Section☆57Updated 6 years ago
- Tag news stories based on models trained on the NYT corpus.☆42Updated last year