Maximax67 / Words-CEFR-DatasetLinks
A dataset mapping English words to CEFR levels based on the CEFR-J dataset, word lemmas, stems, parts of speech (POS), and frequency data from the N-Gram Google dataset. Ideal for NLP tasks, language proficiency assessment, and linguistic research.
☆48Updated last year
Alternatives and similar repositories for Words-CEFR-Dataset
Users that are interested in Words-CEFR-Dataset are comparing it to the libraries listed below
Sorting:
- NLP system for predicting the reading difficulty level of a text in terms of its CEFR level.☆76Updated last year
- Simple package to extract text with coordinates from programmatic PDFs☆236Updated this week
- Open language modeling toolkit based on PyTorch☆173Updated this week
- This is a project that translates a .pdf file, preserving the original layout of that .pdf file. [UPDATED] We have achieved the Second Pr…☆110Updated last year
- Enhancing Translation with RAG-Powered Large Language Models☆89Updated last month
- Docling core data types and transformations☆223Updated last week
- Repository for CEFR-SP corpus and sentence level assessment☆56Updated last year
- Open Language Profiles — English profile datasets from CEFR-J☆167Updated 5 years ago
- SmolDocling OCR App built using SmolDocling 256M Model and Streamlit.☆233Updated 10 months ago
- Parse PDFs into markdown using Vision LLMs☆456Updated 3 months ago
- ⚡️ 80x faster Fasttext language detection out of the box | Split text by language☆283Updated 4 months ago
- [NAACL'25] TEaR framework for paper "TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement"☆50Updated last year
- Toolkit for training/converting LibreTranslate compatible language models 🚂☆77Updated 7 months ago
- A repo for the Formula Recognition Model (im2latex) based on Vision Encoder Decoder Model☆19Updated last year
- Multilingual sentence alignment using sentence embeddings☆139Updated last year
- Liquid Audio - Speech-to-Speech audio models by Liquid AI☆382Updated last week
- ☆159Updated 9 months ago
- A RAG system designed to process documents with multimodal content. It can generate factual, context-aware answers to user queries, based…☆26Updated last year
- Repo housing the open sourced code for the ai2 scholar qa app and also the corresponding library☆253Updated this week
- Synthetic Data Generator for Machine Learning Pipelines☆32Updated 5 months ago
- State-of-the-art LLM-based translation models.☆576Updated 9 months ago
- Python Implementation of MUVERA (Multi-Vector Retrieval via Fixed Dimensional Encodings)☆394Updated last month
- Deep Reasoning Translation (DRT) Project☆240Updated 5 months ago
- Extract tables from PDFs using LLMWhisperer and extract structured information from those tables using Langchain☆49Updated last year
- A Docker-powered service for PDF document layout analysis. This service provides a powerful and flexible PDF analysis service. The servic…☆1,068Updated 3 weeks ago
- Query Expension for Better Query Embedding using LLMs☆64Updated 11 months ago
- ☆681Updated last month
- ArabicaQA: Comprehensive Dataset for Arabic Question Answering accepted at SIGIR 2024☆18Updated last year
- A high-precision RAG framework leveraging Baidu ERNIE and Milvus. Features hybrid search and reranking algorithms for accurate PDF parsin…☆48Updated last month
- Generates a quiz from a URL. You can play the quiz, or let the LLM play it.☆68Updated 7 months ago