newca12 / dictionary-builder
Real world example to demonstrate advanced techniques to unmarshall very large xml document with very low memory footprint.
☆58Updated last year
Related projects ⓘ
Alternatives and complementary repositories for dictionary-builder
- Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. Fo…☆94Updated this week
- an approximate string matching or fuzzy-matching system for spelling correction, normalisation or post-OCR correction☆31Updated last month
- FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (inclu…☆61Updated 6 months ago
- Generation of bilingual dictionaries from Wiktionary/dbnary data for the WikDict project☆44Updated 3 weeks ago
- Sort-friendly URI Reordering Transform (SURT) python module☆40Updated 3 months ago
- Put together a multilingual corpus from a variety of sources. Used for wordfreq and word embeddings.☆52Updated 3 years ago
- Simple multilingual lemmatizer for Python, especially useful for speed and efficiency☆144Updated this week
- An LL parser for extracting information from Wiki text, particularly Wiktionary.☆48Updated last year
- Lexical data at Unicode☆66Updated 2 months ago
- A part-of-speech tagger with support for domain adaptation and external resources.☆22Updated 2 years ago
- German part-of-speech dictionary☆43Updated last year
- Python package for harvesting records from OAI-PMH provider(s).☆62Updated 2 years ago
- Archived Python/Rust hybrid codebase - see divvun/kbdgen for v3☆26Updated 2 years ago
- A cloud-based, open-source system for writing and publishing dictionaries.☆86Updated 10 months ago
- Fast PDF generation and compression. Deals with millions of pages daily.☆102Updated 3 months ago
- A set of workflows for corpus building through OCR, post-correction and normalisation☆48Updated 2 years ago
- Java Wiktionary Library☆57Updated 2 years ago
- command-line tool to extract taxonomies from Wikidata☆125Updated 5 years ago
- Command line tool for digging into WARC files☆34Updated 3 weeks ago
- Search engine benchmark (Tantivy, Lucene, PISA, ...)☆79Updated last month
- Comparing warc files☆15Updated 5 years ago
- This is a new backend implementation of the ANNIS linguistic search and visualization system.☆17Updated last month
- Neural syntax annotator, supporting sequence labeling, lemmatization, and dependency parsing.☆71Updated last year
- PurePos is an open source hybrid morphological tagger.☆15Updated 4 years ago
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.☆42Updated 6 years ago
- A tool to analyse, browse and query Wikidata☆84Updated last month
- The code, training pipeline, and models that power Firefox Translations☆155Updated this week
- Pure Rust port of CRFsuite: a fast implementation of Conditional Random Fields (CRFs)☆29Updated 3 weeks ago
- A database of languages and their Wikidata id, Wikimedia language code, ISO 639-1, ISO 639-2, ISO 639-3, ISO 639-6 codes☆16Updated 4 months ago
- Link Wikidata items to large catalogs☆96Updated 8 months ago