jon-edward / wiki_dumpLinks
A library that assists in traversing and downloading from Wikimedia Data Dumps and their mirrors.
☆11Updated last year
Alternatives and similar repositories for wiki_dump
Users that are interested in wiki_dump are comparing it to the libraries listed below
Sorting:
- The AI Knowledge Editor☆184Updated 3 years ago
- Libraries, Archives and Museums (LAM)☆88Updated 3 years ago
- Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. See https://github.com/commoncrawl/web-languages-code/ …☆69Updated last month
- Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further lang…☆134Updated last year
- 🧌 Parsing structured information from OCR outputs☆20Updated 2 years ago
- spaCy extension for Visual Studio Code☆31Updated 11 months ago
- Tag grants with MeSH and other tags☆17Updated 2 years ago
- Citron is an experimental quote extraction system created by BBC R&D☆36Updated 4 years ago
- Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.☆92Updated 4 years ago
- Email Datasets can be found here☆80Updated last month
- Data and information related to the Books3 dataset included as part of The Pile, and used to train Meta's LLaMA among others☆35Updated 9 months ago
- Information extraction from English and German texts based on predicate logic☆141Updated 2 years ago
- My personal frontpage app☆108Updated last week
- A public release of TimelineBuilder for building personal digital data timelines.☆370Updated last year
- ☆12Updated 3 weeks ago
- ☆51Updated 7 months ago
- CAP database scripts.☆192Updated last year
- Code for collecting, processing, and preparing datasets for the Common Pile☆249Updated 5 months ago
- Pipeline to generate the Standardized Project Gutenberg Corpus☆208Updated 2 years ago
- Python SDK for Galileo's NLP and CV Studio.☆17Updated last week
- 🍏 Make Thinc faster on macOS by calling into Apple's native Accelerate library☆103Updated 7 months ago
- A spaCy wrapper for GliNER☆129Updated last year
- Statistics of Common Crawl monthly archives mined from URL index files☆208Updated last week
- Web-scale retrieval for knowledge-intensive NLP☆554Updated 3 years ago
- An on-going dataset consisting of hashtags, n-gram counts and other misc NLP things for covid-19 analysis, stemming from over 100 000 000…☆59Updated 3 years ago
- Find legal citations in any block of text☆208Updated 4 months ago
- GPU-Powered Topic Modelling☆70Updated 3 years ago
- Tools to construct and process Common Crawl webgraphs☆105Updated last week
- Code and data to support "Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4"☆69Updated 2 years ago
- Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters☆158Updated last month