newca12 / dictionary-builder
Real world example to demonstrate advanced techniques to unmarshall very large xml document with very low memory footprint.
☆58Updated last year
Alternatives and similar repositories for dictionary-builder:
Users that are interested in dictionary-builder are comparing it to the libraries listed below
- Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. Fo…☆97Updated last week
- Simple multilingual lemmatizer for Python, especially useful for speed and efficiency☆151Updated 2 months ago
- ☆44Updated 2 years ago
- Put together a multilingual corpus from a variety of sources. Used for wordfreq and word embeddings.☆51Updated 3 years ago
- German part-of-speech dictionary☆43Updated last year
- Java Wiktionary Library☆57Updated 2 years ago
- An advanced, extensible web front-end for the Manatee-open corpus search engine☆63Updated this week
- FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (inclu…☆61Updated 8 months ago
- CLI tool for importing entities from Wikidata / Wikibase☆23Updated 2 years ago
- Generation of bilingual dictionaries from Wiktionary/dbnary data for the WikDict project☆45Updated 2 months ago
- Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic pr…☆66Updated last month
- Search engine benchmark (Tantivy, Lucene, PISA, ...)☆79Updated 2 months ago
- Pure Rust port of CRFsuite: a fast implementation of Conditional Random Fields (CRFs)☆29Updated 2 months ago
- A Rust library for reading and writing WARC files☆46Updated last month
- Helsinki Finite-State Technology (library and application suite)☆125Updated this week
- This is a new backend implementation of the ANNIS linguistic search and visualization system.☆17Updated this week
- ISO 639 library for Python☆32Updated 4 months ago
- Offline bilingual dictionaries made using data from Wiktionary☆52Updated 9 years ago
- Python package for harvesting records from OAI-PMH provider(s).☆62Updated 2 years ago
- Context-sensitive word embeddings with subwords. In Rust.☆86Updated last year
- 📂 Additional lookup tables and data resources for spaCy☆99Updated last year
- A set of utilities for processing MediaWiki XML dump data.☆49Updated 5 months ago
- Loadable spellfix1 extension for sqlite as python package☆25Updated 8 months ago
- Rust crate for entity parsing☆16Updated 2 years ago
- An intelligent reading agent that understands text and translates it into Wikidata statements.☆113Updated 8 years ago
- A general purpose processing framework for corpora of scientific documents☆58Updated 8 months ago
- wabac.js - Web Archive Browsing Augmentation Client☆104Updated 2 weeks ago
- Stand-off Text Annotation Model (STAM) is a data model for stand-off-text annotation where any information on a text is represented as an…☆17Updated 2 months ago
- The Global WordNet Association Collaborative Inter-Lingual Index☆41Updated 2 months ago
- Wiktionary parser tool for many language editions.☆53Updated 2 years ago