commoncrawl / language-detection-cld2
Natural language detection, Java bindings for CLD2
β14Updated last week
Related projects β
Alternatives and complementary repositories for language-detection-cld2
- Lightning Fast Language Prediction πβ165Updated 5 years ago
- Lightning fast spell correction / fuzzy search library based on SymSpell by Commerce-Expertsβ80Updated 6 years ago
- πΈ fastText + Bloom embeddings for compact, full-coverage vectors with spaCyβ287Updated last year
- Java port of SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithmβ65Updated 3 years ago
- Named Entity Recognition data for Europeana Newspapersβ173Updated last year
- DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplateβ¦β50Updated 4 years ago
- Search relevance evaluation toolkitβ73Updated 2 years ago
- Set of Jupyter notebooks demonstrating Learning to Rank integrated with Solr and Elasticsearchβ165Updated 2 months ago
- Indra is a Web Service which allows easy access to different distributional semantics models in several languages.β47Updated 3 years ago
- Json Wikipedia, contains code to convert the Wikipedia xml dump into a json dump. Questions? https://gitter.im/idio-opensource/Lobbyβ17Updated 2 years ago
- Dice.com repo to accompany the dice.com 'Vectors in Search' talk by Simon Hughes, from the Activate 2018 search conference, and the 'Searβ¦β85Updated 3 years ago
- Index Common Crawl archives in tabular formatβ106Updated this week
- Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.β42Updated 6 years ago
- CuVS integration for Luceneβ29Updated 5 months ago
- Java implementation of the TextRank algorithm by Mihalcea, et al.β75Updated 3 years ago
- Performance evaluation of nearest neighbor search using Vespa, Elasticsearch and Open Distro for Elasticsearch K-NNβ116Updated 3 years ago
- Colibri core is an NLP tool as well as a C++ and Python library for working with basic linguistic constructions such as n-grams and skipgβ¦β124Updated this week
- Source code for the split annotations project.β53Updated last year
- GSDMM: Short text clustering (Rust implementation)β23Updated last year
- A toolkit for CDX indices such as Common Crawl and the Internet Archive's Wayback Machineβ159Updated last month
- A fast and simple JavaScript library specifically targeted at collecting search and search related browser events.β41Updated 3 months ago
- Solr Dictionary Annotator (Microservice for Spark)β70Updated 4 years ago
- Program used to split text into segmentsβ25Updated 3 weeks ago
- Open-Source Information Retrieval Reproducibility Challengeβ50Updated 8 years ago
- Search relevance evaluation toolkitβ30Updated 2 years ago
- β16Updated 3 years ago
- FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (incluβ¦β61Updated 6 months ago
- An unsupervised compound splitterβ40Updated 5 years ago
- Faster, modernized fork of the language identification tool langid.pyβ48Updated this week
- A machine learning tool for fishing entitiesβ249Updated last week