mpacula / AutoCorpus
AutoCorpus is a set of utilities that enable automatic extraction of language corpora and language models from publicly available datasets. Autocorpus utilities follow the Unix design philosophy and integrate easily into custom data processing pipelines.
☆37Updated 12 years ago
Related projects ⓘ
Alternatives and complementary repositories for AutoCorpus
- A visualizer for multi-dimensional semantic data☆38Updated 13 years ago
- Uses a distributed word representation to finds words along the hyperchord of two input words.☆101Updated 4 years ago
- Generalized Language Modeling toolkit☆51Updated 2 years ago
- Hierarchical phrase-based machine translation system☆32Updated 9 years ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Updated 6 years ago
- Standalone Semanticizer☆32Updated 9 years ago
- Basic dataset for the linguistic data collection.☆15Updated 7 years ago
- A Python library for learning from dimensionality reduction, supporting sparse and dense matrices.☆78Updated 7 years ago
- Compute association strength over semantic networks in a dimensionality-reduced form.☆33Updated 9 years ago
- A web application for exploring documents topically.☆26Updated 8 years ago
- Updates to Zope's keyphrase extractor (forked from 1.1.0)☆67Updated 7 years ago
- Speech modeling using code by Kratarth Goel http://dblp.uni-trier.de/pers/hd/g/Goel:Kratarth☆9Updated 9 years ago
- Read natural language interactive queries. Great for bots.☆18Updated 8 years ago
- rapid nlp prototyping☆72Updated 2 years ago
- Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.☆158Updated 2 years ago
- Turning Javascript into a probabilistic programming language☆58Updated 7 years ago
- Semanticizest: dump parser and client☆20Updated 8 years ago
- Parsito: Fast non-projective transition-based dependency parser☆14Updated last year
- Random fun with statistical language models.☆65Updated 5 years ago
- This is a fork of the Stanford Named Entity Recognizer with added support for deploying in Java servlet mode. See github.com/dat/pyner fo…☆90Updated 11 years ago
- Open-source tools for morphological tagging, segmentation and stemming.☆41Updated 5 years ago
- A Recurrent Neural Network trained on all existing TED Talk Transcripts. The model outputs machine generated TED Talks.☆50Updated 6 years ago
- Topic Modelling the Enron Emails☆22Updated 12 years ago
- Grapheme to phoneme toolkit using joint-modelling + CRFs in java☆13Updated 6 years ago
- An implementation of word2vec applied to [stanford philosophy encyclopedia](http://plato.stanford.edu/)☆35Updated 8 years ago
- NLTK Contrib☆166Updated 8 months ago
- Lightweight, multilingual natural language processing☆63Updated 11 years ago
- Jitar HMM part of speech tagger☆22Updated 8 years ago
- ☆62Updated 10 years ago
- bigram / trigram analysis of wikipedia; mainly mutual info☆22Updated 12 years ago