daveshap / PlainTextWikipedia
Convert Wikipedia database dumps into plaintext files
☆311Updated 3 years ago
Alternatives and similar repositories for PlainTextWikipedia:
Users that are interested in PlainTextWikipedia are comparing it to the libraries listed below
- Nearly a thousand bash and python scripts I've written over the years.☆120Updated 3 weeks ago
- Sick of that "Save as PDF" link on Wikipedia? Why not just have Python do it for you?☆27Updated 5 years ago
- Dolores is a Python library designed to improve the developer experience when working with pretrained language models. Dolores provides p…☆34Updated 4 years ago
- A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.☆215Updated last year
- Conversational text Analysis using various NLP techniques☆181Updated last year
- A Reddit bot that generates new context-aware comments using Markov chains trained from a set of given users or subreddits comments histo…☆73Updated 3 years ago
- Library of Alexandria (LoA in short) is a project that aims to collect and archive documents from the internet.☆114Updated 7 months ago
- Example scripts for the pushshift dump files☆326Updated last week
- The world's largest profanity list.☆214Updated 10 months ago
- Downloader for submissions to reddit.com. Supports both subreddits and users.☆48Updated 4 years ago
- An on-going dataset consisting of hashtags, n-gram counts and other misc NLP things for covid-19 analysis, stemming from over 100 000 000…☆57Updated 3 years ago
- Reddit image scraper made in Python☆47Updated 2 years ago
- The world's largest social media toxicity dataset.☆177Updated 2 years ago
- ☆79Updated 6 years ago
- The subreddit archiver☆173Updated last year
- The reddit Data Extractor is a cross-platform GUI tool for downloading almost any content posted to reddit. Downloads from specific users…☆234Updated 2 months ago
- Workshop material for the AMLD 2020 workshop on "Meet your Artificial Self: Generate text that sounds like you"☆81Updated last year
- Reddit archiver☆164Updated last year
- Download subreddit comments☆93Updated 2 years ago
- Distributed crawler, database and web frontend for public directories indexing☆139Updated 5 years ago
- A python utility for downloading Common Crawl data☆232Updated last year
- Unreliable News Index (for Columbia Journalism Review)☆56Updated 3 years ago
- 📊 Semantic search for headlines and story text☆359Updated last year
- Code for the paper: "Large Language Models as Corporate Lobbyists" (2023).☆171Updated 2 years ago
- A broad family of utilities for organising files based on hierarchical tagging, from web server to a computer vision dataset creation pip…☆97Updated 4 years ago
- Offline Internet Archive project☆281Updated last year
- Data and information related to the Books3 dataset included as part of The Pile, and used to train Meta's LLaMA among others☆26Updated this week
- archive reddit data as offline friendly web pages☆171Updated 4 years ago
- A command-line interface to generate textual and conversational datasets with LLMs.☆294Updated last year
- Automation for the serious data hoarder that wants to have their data and use it☆100Updated 3 years ago