spencermountain / dumpster-dive
roll a wikipedia dump into mongo
☆241Updated 7 months ago
Alternatives and similar repositories for dumpster-dive:
Users that are interested in dumpster-dive are comparing it to the libraries listed below
- a pretty-committed wikipedia markup parser☆793Updated 2 weeks ago
- varied english texts for modern NLP testing☆75Updated 2 years ago
- 🎀 JavaScript API for spaCy with Python REST API☆196Updated last year
- Expose Spacy nlp text parsing to Nodejs (and other languages) via socketIO☆225Updated 2 years ago
- ⚙️ [Processor] A better English POS tagger written in JavaScript☆53Updated 7 years ago
- English NLP for Node.js and the browser.☆89Updated last year
- JS utils functions to query a Wikibase instance and simplify its results☆328Updated 4 months ago
- Index Common Crawl archives in tabular format☆110Updated 3 months ago
- LDA topic modeling for node.js☆294Updated 6 months ago
- wpcorpus - NLP corpus based on Wikipedia's full article dump☆97Updated 9 years ago
- NLP Functions for amplifying negations, managing elisions, creating ngrams, stems, phonetic codes to tokens and more.☆125Updated 11 months ago
- WordNet in JSON format.☆90Updated 4 years ago
- command-line tool to extract taxonomies from Wikidata☆126Updated 5 years ago
- text mining utilities for Node.js☆141Updated 2 years ago
- read and edit a Wikibase instance from the command line☆230Updated this week
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆166Updated 2 years ago
- fasttag part of speech tagger javascript implementation☆279Updated 4 years ago
- Sentence Boundary Detection in javascript for node. http://tessmore.github.io/sbd/☆209Updated last year
- wordpos for the web/browser☆43Updated 3 years ago
- CoreNLP @ NodeJS☆65Updated 2 years ago
- Mechanical Turk on your own machine.☆205Updated 3 months ago
- Json Wikipedia, contains code to convert the Wikipedia xml dump into a json/avro dump☆253Updated last year
- spaCy REST API, wrapped in a Docker container.☆266Updated 2 years ago
- ☆97Updated 3 years ago
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆188Updated 6 years ago
- A Wordnet API in pure JavaScript☆108Updated 2 years ago
- displaCy-ent.js: An open-source named entity visualiser for the modern web☆199Updated 6 years ago
- A client for the Stanford Part of Speech Tagger XMLRPC server.☆72Updated 7 years ago
- Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head☆171Updated 4 years ago
- FastText for Node.js☆196Updated last year