daveshap / PlainTextWikipediaLinks
Convert Wikipedia database dumps into plaintext files
☆320Updated 4 years ago
Alternatives and similar repositories for PlainTextWikipedia
Users that are interested in PlainTextWikipedia are comparing it to the libraries listed below
Sorting:
- Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine☆242Updated 2 years ago
- Python code for building a GPT-3 based technical blog post optimizer.☆85Updated 2 years ago
- ☆81Updated 6 years ago
- Index Common Crawl archives in tabular format☆122Updated last month
- A python utility for downloading Common Crawl data☆240Updated 2 years ago
- Reddit takeout: export your account data as JSON: comments, submissions, upvotes etc. 🦖☆171Updated 7 months ago
- Code for the paper "Language Models are Unsupervised Multitask Learners"☆108Updated 3 years ago
- Download subreddit comments☆94Updated 3 years ago
- Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. Fo…☆102Updated last month
- Code for the paper: "Large Language Models as Corporate Lobbyists" (2023).☆171Updated 2 years ago
- A library to extract a publication date from a web page, along with a measure of the accuracy.☆41Updated 5 years ago
- Streaming WARC/ARC library for fast web archive IO☆416Updated 6 months ago
- Conversational text Analysis using various NLP techniques☆180Updated 2 years ago
- archive reddit data as offline friendly web pages☆175Updated 4 years ago
- Chat interface to gpt-j. Runs in Google Colab.☆57Updated last year
- Example scripts for the pushshift dump files☆371Updated last week
- Sick of that "Save as PDF" link on Wikipedia? Why not just have Python do it for you?☆27Updated 5 years ago
- GPT2Explorer is bringing GPT2 OpenAI langage models playground to run locally on standard windows computers.☆29Updated 2 years ago
- The subreddit archiver☆178Updated last year
- Stylometry library for Burrows' Delta method☆42Updated last year
- ☆60Updated 2 years ago
- YT_subtitles - extracts subtitles from YouTube videos to raw text for Language Model training☆43Updated 4 years ago
- Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts☆18Updated 4 years ago
- A multithread Pushshift.io API Wrapper for reddit.com comment and submission searches.☆221Updated 2 years ago
- Inference code for LLaMA models☆188Updated 2 years ago
- A Reddit bot that generates new context-aware comments using Markov chains trained from a set of given users or subreddits comments histo…☆73Updated 3 years ago
- Neural Search☆332Updated last year
- Python Pushshift.io API Wrapper (for comment/submission search)☆361Updated 2 years ago
- 🖼A python package to download all the public posts of an Instagram account.(deprecated)☆20Updated 4 years ago
- Library of Alexandria (LoA in short) is a project that aims to collect and archive documents from the internet.☆117Updated 11 months ago