mpacula / AutoCorpus
AutoCorpus is a set of utilities that enable automatic extraction of language corpora and language models from publicly available datasets. Autocorpus utilities follow the Unix design philosophy and integrate easily into custom data processing pipelines.
☆37Updated 12 years ago
Related projects ⓘ
Alternatives and complementary repositories for AutoCorpus
- A visualizer for multi-dimensional semantic data☆38Updated 13 years ago
- A web application for exploring documents topically.☆26Updated 8 years ago
- Uses a distributed word representation to finds words along the hyperchord of two input words.☆101Updated 4 years ago
- rapid nlp prototyping☆72Updated 2 years ago
- Compute association strength over semantic networks in a dimensionality-reduced form.☆33Updated 9 years ago
- Visualization for hidden Markov model computations☆14Updated 9 years ago
- Random fun with statistical language models.☆65Updated 5 years ago
- Turning Javascript into a probabilistic programming language☆58Updated 7 years ago
- Generalized Language Modeling toolkit☆51Updated 2 years ago
- Standalone Semanticizer☆32Updated 9 years ago
- finite-state toolkit, EM and Bayesian (Gibbs sampling) training for FST and context-free derivation forests☆41Updated 2 years ago
- Speech modeling using code by Kratarth Goel http://dblp.uni-trier.de/pers/hd/g/Goel:Kratarth☆9Updated 9 years ago
- Discussion Summarization is the process of condensing a text document which is a collection of discussion threads, using CBS (Cluster Bas…☆12Updated 10 years ago
- Updates to Zope's keyphrase extractor (forked from 1.1.0)☆67Updated 7 years ago
- code referenced in "Towards universal neural nets: Gibbs machines and ACE", Galin Georgiev, http://arxiv.org/abs/1508.06585☆14Updated 9 years ago
- A fork of the sofia ml machine learning program☆14Updated 13 years ago
- Read natural language interactive queries. Great for bots.☆18Updated 8 years ago
- Turbo topics find significant multiword phrases in topics.☆46Updated 9 years ago
- Topic modeling with first-order logic (FOL) domain knowledge☆33Updated 12 years ago
- clone of https://code.google.com/p/splitta/ so it can be a git submodule☆34Updated 11 years ago
- The Community-enRiched Open WordNet (CROWN)☆19Updated 8 years ago
- A Recurrent Neural Network trained on all existing TED Talk Transcripts. The model outputs machine generated TED Talks.☆51Updated 6 years ago
- MiTextExplorer - interactive browser of text and document covariates.☆24Updated 9 years ago
- NLTK Contrib☆166Updated 8 months ago
- Basic dataset for the linguistic data collection.☆15Updated 7 years ago
- Lightweight, multilingual natural language processing☆63Updated 11 years ago
- Theano implementation of the Neural GPU☆15Updated 8 years ago
- Embedding data into immersive environments☆24Updated 7 years ago
- Apache Pig utilities to build training corpora for machine learning / NLP out of public Wikipedia and DBpedia dumps.☆158Updated 2 years ago