mpacula / AutoCorpusLinks
AutoCorpus is a set of utilities that enable automatic extraction of language corpora and language models from publicly available datasets. Autocorpus utilities follow the Unix design philosophy and integrate easily into custom data processing pipelines.
☆37Updated 13 years ago
Alternatives and similar repositories for AutoCorpus
Users that are interested in AutoCorpus are comparing it to the libraries listed below
Sorting:
- finite-state toolkit, EM and Bayesian (Gibbs sampling) training for FST and context-free derivation forests☆41Updated 2 years ago
- Speech modeling using code by Kratarth Goel http://dblp.uni-trier.de/pers/hd/g/Goel:Kratarth☆9Updated 10 years ago
- A Recurrent Neural Network trained on all existing TED Talk Transcripts. The model outputs machine generated TED Talks.☆51Updated 7 years ago
- Generalized Language Modeling toolkit☆51Updated 3 years ago
- Speech Processing & Linguistic Analysis Tool☆11Updated 6 years ago
- NLTK Contrib☆166Updated last year
- This is EllaVator project to build Ella the talking eleVator as part of a Saarland University software project class.☆17Updated 9 years ago
- The Community-enRiched Open WordNet (CROWN)☆18Updated 9 years ago
- Barista is an open-source framework for concurrent speech processing.☆36Updated 11 years ago
- Hadoop jobs for WikiReverse project. Parses Common Crawl data for links to Wikipedia articles.☆38Updated 6 years ago
- The Kyoyo Language Modeling Toolkit☆27Updated 10 years ago
- Json Wikipedia, contains code to convert the Wikipedia xml dump into a json/avro dump☆253Updated last year
- Fast Word Clustering Software☆78Updated 6 months ago
- Visualization for hidden Markov model computations☆14Updated 10 years ago
- Zurich Morphological Lexicon for German: a tool to extract a morphological lexicon from Wiktionary☆11Updated last year
- Utilities for manipulating finite state transducers with the OpenFst library.☆31Updated 7 years ago
- Excitement Open Platform for Recognizing Textual Entailments☆88Updated 7 years ago
- Uses a distributed word representation to finds words along the hyperchord of two input words.☆102Updated 5 years ago
- pronunciation LEXicons for Any Low-resource Language☆21Updated 5 years ago
- Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic pr…☆70Updated last month
- English Dependency Relationship Extractor☆85Updated 7 months ago
- bigram / trigram analysis of wikipedia; mainly mutual info☆22Updated 13 years ago
- http://www.ark.cs.cmu.edu/ARKref/☆32Updated 11 years ago
- Compute the most likely permutation of a lattice given an LM☆10Updated 12 years ago
- A simple toolkit for speaker segmentation and identification☆30Updated 12 years ago
- Normalizes lexically ill-formed text to its most likely clean text, e.g. "c u thr 2nite!" -> "see you there tonight!".☆63Updated 9 years ago
- NLP tools developed by Emory University.☆60Updated 9 years ago
- Parsito: Fast non-projective transition-based dependency parser☆14Updated 2 years ago
- bilingual dictionary extractor from parallel corpora☆22Updated 11 years ago
- This is a fork of the Stanford Named Entity Recognizer with added support for deploying in Java servlet mode. See github.com/dat/pyner fo…☆90Updated 12 years ago