rcarmo / newsfeed-corpusLinks
A Dockerized RSS feed fetcher for NLP work, using asyncio
☆20Updated 3 years ago
Alternatives and similar repositories for newsfeed-corpus
Users that are interested in newsfeed-corpus are comparing it to the libraries listed below
Sorting:
- An OPML file with 22 of the top 25 US newspapers RSS feeds☆56Updated 7 years ago
- Save data from Google Takeout to a SQLite database☆118Updated 2 years ago
- DIY Atom feeds in times of social media and paywalls☆85Updated last year
- Check out https://github.com/webrecorder/webrecorder for newer version matching https://webrecorder.io☆38Updated 10 years ago
- Tag-based bookmark manager inspired by delicious and Pinboard☆34Updated 3 years ago
- python library for extracting html microdata☆167Updated 2 years ago
- A python library detect and extract listing data from HTML page.☆108Updated 8 years ago
- Aviation grade news article metadata extraction☆36Updated 2 years ago
- Create and deploy a RESTful API with a few lines of YAML☆32Updated 7 years ago
- linkbak is a web page archiver : it reads a list of links and dumps the corresponding pages in HTML and PDF.☆13Updated 3 years ago
- An eBook tool to extract ISBN or Metadata form eBook and rename them by using ISBN database and Metadata☆29Updated 10 years ago
- A dockerized, queued high fidelity web archiver based on Squidwarc☆61Updated last year
- Personal Knowledge Management System. Capture your ideas using plain old text files. Make a journal that lasts 100 years.☆29Updated 2 years ago
- Your body's dashboard.☆94Updated last year
- Create a SQLite database containing data from your Pocket account☆107Updated 2 years ago
- CoCrawler is a versatile web crawler built using modern tools and concurrency.☆192Updated 3 years ago
- A simple Python wrapper for the archive.is capturing service☆210Updated 11 months ago
- An expandable and scalable OCR pipeline☆89Updated 8 years ago
- Code and data belonging to our CSCW 2019 paper: "Dark Patterns at Scale: Findings from a Crawl of 11K Shopping Websites".☆136Updated 6 years ago
- Automatically extracts and normalizes an online article or blog post publication date☆117Updated 2 years ago
- The missing datasets manager. Like hombrew but for datasets. CLI-tool for search and discover datasets!☆41Updated 8 years ago
- A tiny library for Python text normalisation. Useful for ad-hoc text processing.☆157Updated 4 months ago
- a simple interface from extracting texts from (almost) any url☆53Updated 6 years ago
- 🗄 Bot powering the @LinkArchiver Twitter tool to send tweeted URLs to the Wayback Machine☆46Updated 8 years ago
- Lightweight web scraping toolkit for documents and structured data.☆315Updated 2 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆47Updated 8 years ago
- WarcMiddleware lets users seamlessly download a mirror copy of a website when running a web crawl with the Python web crawler Scrapy.☆47Updated 7 years ago
- Apify actor that opens a web page in headless Chrome and analyzes the HTML and JavaScript objects, looks for schema.org microdata and JSO…☆153Updated 2 years ago
- A queue-controlled browser automation tool for improving web crawl quality☆64Updated 5 months ago
- Grabbing all news.☆61Updated 6 years ago