spencermountain / dumpster-dive
roll a wikipedia dump into mongo
☆243Updated 9 months ago
Alternatives and similar repositories for dumpster-dive:
Users that are interested in dumpster-dive are comparing it to the libraries listed below
- a pretty-committed wikipedia markup parser☆805Updated 2 months ago
- Filter and format a newline-delimited JSON stream of Wikibase entities☆97Updated 6 months ago
- command-line tool to extract taxonomies from Wikidata☆126Updated 5 years ago
- Index Common Crawl archives in tabular format☆117Updated last month
- 🎀 JavaScript API for spaCy with Python REST API☆196Updated last year
- A command-line tool for using CommonCrawl Index API at http://index.commoncrawl.org/☆189Updated 6 years ago
- Scripts and microservice to feed an ElasticSearch with Wikidata and Inventaire entities, and keep those up-to-date☆41Updated 4 years ago
- TextRank algorithm implementation in Javascript☆41Updated 10 years ago
- creates a docker image with Virtuoso preloaded with the latest DBpedia dataset☆126Updated 5 months ago
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆166Updated 3 years ago
- Sentence Boundary Detection in javascript for node. http://tessmore.github.io/sbd/☆211Updated last year
- Expose Spacy nlp text parsing to Nodejs (and other languages) via socketIO☆225Updated 2 years ago
- Multilingual tokenizer that automatically tags each token with its type☆61Updated 2 years ago
- Creates a Neo4j graph of Wikipedia links.☆255Updated 7 years ago
- varied english texts for modern NLP testing☆75Updated 2 years ago
- NLP Functions for amplifying negations, managing elisions, creating ngrams, stems, phonetic codes to tokens and more.☆126Updated last year
- AmbiverseNLU: A Natural Language Understanding suite by Max Planck Institute for Informatics☆210Updated last year
- Visualize Wikidata items using d3.js☆196Updated last week
- JS utils functions to query a Wikibase instance and simplify its results☆328Updated 2 weeks ago
- Automatically extracts structured information from webpages☆108Updated 2 years ago
- read and edit a Wikibase instance from the command line☆230Updated last month
- Another next-generation event coding platform.☆73Updated 6 years ago
- A modular annotation system that supports complex, interactive annotation graphs embedded on top of sequences of text.☆95Updated 3 years ago
- Extraction of the journalistic five W and one H questions (5W1H) from news articles: who did what, when, where, why, and how?☆518Updated 5 months ago
- CoreNLP @ NodeJS☆66Updated 2 years ago
- Wikidata + GraphQL (Dream API for everything)☆46Updated 2 years ago
- Wikidata client library for Python☆353Updated 9 months ago
- Process Common Crawl data with Python and Spark☆428Updated 2 months ago
- Json Wikipedia, contains code to convert the Wikipedia xml dump into a json/avro dump☆253Updated last year
- Outputs a list of ranked DBpedia resources for a search string.☆186Updated 3 years ago