spencermountain / dumpster-diveLinks
roll a wikipedia dump into mongo
☆243Updated 11 months ago
Alternatives and similar repositories for dumpster-dive
Users that are interested in dumpster-dive are comparing it to the libraries listed below
Sorting:
- a pretty-committed wikipedia markup parser☆815Updated last month
- tools for working with Princeton's lexical database WordNet☆73Updated 6 years ago
- command-line tool to extract taxonomies from Wikidata☆126Updated 6 years ago
- English NLP for Node.js and the browser.☆87Updated last year
- Creates a Neo4j graph of Wikipedia links.☆255Updated 7 years ago
- Index Common Crawl archives in tabular format☆122Updated last month
- NLP Functions for amplifying negations, managing elisions, creating ngrams, stems, phonetic codes to tokens and more.☆131Updated last year
- Streaming WARC/ARC library for fast web archive IO☆416Updated 6 months ago
- text mining utilities for Node.js☆141Updated 2 years ago
- LDA topic modeling for node.js☆297Updated 10 months ago
- Multilingual tokenizer that automatically tags each token with its type☆62Updated 2 years ago
- Json Wikipedia, contains code to convert the Wikipedia xml dump into a json/avro dump☆253Updated last year
- Node bindings for Annoy, an efficient Approximate Nearest Neighbors implementation written in C++.☆82Updated last year
- Word embeddings for the web☆28Updated 2 years ago
- Get n-grams from text☆82Updated 2 years ago
- A semi-unsupervised language independent morphological analyzer useful for stemming unknown language text, or getting a rough estimate of…☆21Updated 7 years ago
- 🎀 JavaScript API for spaCy with Python REST API☆197Updated last year
- CLDR text segmentation for JavaScript☆38Updated last year
- WordNet Database files (previously WNdb)☆216Updated 5 years ago
- A modular annotation system that supports complex, interactive annotation graphs embedded on top of sequences of text.☆95Updated 3 years ago
- varied english texts for modern NLP testing☆75Updated 3 years ago
- Visualize Wikidata items using d3.js☆198Updated 2 months ago
- Filter and format a newline-delimited JSON stream of Wikibase entities☆97Updated 2 weeks ago
- Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. Fo…☆102Updated last month
- Imports WikiData JSON dumps into Neo4j in a meaningful way.☆66Updated 6 years ago
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machine☆177Updated 5 months ago
- Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head☆170Updated 5 years ago
- English lemmatizer☆67Updated 2 years ago
- Demonstration of using Python to process the Common Crawl dataset with the mrjob framework☆167Updated 3 years ago
- Article extraction benchmark: dataset and evaluation scripts☆317Updated last year