rcarmo / newsfeed-corpusLinks
A Dockerized RSS feed fetcher for NLP work, using asyncio
☆20Updated 3 years ago
Alternatives and similar repositories for newsfeed-corpus
Users that are interested in newsfeed-corpus are comparing it to the libraries listed below
Sorting:
- An OPML file with 22 of the top 25 US newspapers RSS feeds☆56Updated 7 years ago
- Aviation grade news article metadata extraction☆36Updated 2 years ago
- Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")☆109Updated this week
- A dockerized, queued high fidelity web archiver based on Squidwarc☆61Updated last year
- A self-hosted news reader.☆473Updated this week
- Apify actor that opens a web page in headless Chrome and analyzes the HTML and JavaScript objects, looks for schema.org microdata and JSO…☆153Updated 2 years ago
- Your body's dashboard.☆94Updated 10 months ago
- Organize your meme image cluster in a better format using OCR from the meme to sort them using tesseract along with editing memes by segm …☆80Updated 2 years ago
- Personal Knowledge Management System. Capture your ideas using plain old text files. Make a journal that lasts 100 years.☆29Updated last year
- FeedCrunch.IO - Take RSS Feeds to the next level with personnalized recommendations☆15Updated 3 years ago
- Lightweight web scraping toolkit for documents and structured data.☆314Updated last year
- DIY Atom feeds in times of social media and paywalls☆85Updated last year
- python library for extracting html microdata☆166Updated 2 years ago
- WarcMiddleware lets users seamlessly download a mirror copy of a website when running a web crawl with the Python web crawler Scrapy.☆47Updated 7 years ago
- Web RSS aggregator and reader compatible with the Fever API☆147Updated last year
- Tag-based bookmark manager inspired by delicious and Pinboard☆34Updated 3 years ago
- Automatic text summarization☆243Updated 6 years ago
- Word analysis, by domain, on the Common Crawl data set for the purpose of finding industry trends☆57Updated last year
- CoCrawler is a versatile web crawler built using modern tools and concurrency.☆189Updated 3 years ago
- Primary LocalWiki backend server environment☆47Updated 7 years ago
- Wikipedia citation tool for Google Books, New York Times, ISBN, DOI and more☆22Updated 8 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 6 years ago
- Paginating the web☆37Updated 11 years ago
- Analyze topics and trends in news with NLP☆48Updated 2 years ago
- linkbak is a web page archiver : it reads a list of links and dumps the corresponding pages in HTML and PDF.☆13Updated 2 years ago
- Now included in rigour☆152Updated last month
- A list of things related to software, literature, and other content for 🕣 Memento☆99Updated last year
- 🗄 Bot powering the @LinkArchiver Twitter tool to send tweeted URLs to the Wayback Machine☆46Updated 8 years ago
- Web archiving using Google Chrome☆47Updated 5 years ago
- A component based data flow framework with a drag-n-drop Web 2.0 interface. Based on Stackless Python and inspired by Yahoo! Pipes.☆150Updated 13 years ago