umarbutler / semchunk

A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.

☆222

Alternatives and similar repositories for semchunk:

Users that are interested in semchunk are comparing it to the libraries listed below

aurelio-labs / semantic-chunkers
☆196Updated last month
jina-ai / late-chunking
Code for explaining and evaluating late chunking (chunked pooling)
☆307Updated 3 weeks ago
mixedbread-ai / baguetter
Baguetter is a flexible, efficient, and hackable search engine library implemented in Python. It's designed for quickly benchmarking, imp…
☆170Updated 4 months ago
stephenleo / llm-structured-output-benchmarks
Benchmark various LLM Structured Output frameworks: Instructor, Mirascope, Langchain, LlamaIndex, Fructose, Marvin, Outlines, etc on task…
☆139Updated 3 months ago
MinishLab / semhash
Fast Semantic Text Deduplication
☆280Updated this week
castorini / rank_llm
RankLLM is a Python toolkit for reproducible information retrieval research using rerankers, with a focus on listwise reranking.
☆388Updated 2 weeks ago
Unstructured-IO / unstructured-inference
☆167Updated this week
hitz-zentroa / GoLLIE
Guideline following Large Language Model for Information Extraction
☆328Updated 2 months ago
huggingface / text-clustering
Easily embed, cluster and semantically label text datasets
☆488Updated 9 months ago
lightonai / pylate
Late Interaction Models Training & Retrieval
☆223Updated this week
brandonstarxel / chunking_evaluation
This package, developed as part of our research detailed in the Chroma Technical Report, provides tools for text chunking and evaluation.…
☆204Updated 3 months ago
chrisammon3000 / dspy-neo4j-knowledge-graph
LLM-driven automated knowledge graph construction from text using DSPy and Neo4j.
☆161Updated 9 months ago
KarelDO / xmc.dspy
In-Context Learning for eXtreme Multi-Label Classification (XMC) using only a handful of examples.
☆401Updated 11 months ago
davanstrien / awesome-synthetic-datasets
awesome synthetic (text) datasets
☆253Updated 2 months ago
CYQIQ / MultiCoT
Repository to demonstrate Chain of Table reasoning with multiple tables powered by LangGraph
☆145Updated 9 months ago
chentong0 / factoid-wiki
Dense X Retrieval: What Retrieval Granularity Should We Use?
☆141Updated last year
agamm / semantic-split
A Python library to chunk/group your texts based on semantic similarity.
☆90Updated 6 months ago
sarthakrastogi / graph-rag
☆258Updated 6 months ago
illuin-tech / vidore-benchmark
Vision Document Retrieval (ViDoRe): Benchmark. Evaluation code for the ColPali paper.
☆164Updated last month
jackboyla / GLiREL
Generalist and Lightweight Model for Relation Extraction (Extract any relationship types from text)
☆169Updated last week
IAAR-Shanghai / Meta-Chunking
Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception
☆103Updated last month
tonywu71 / colpali-cookbooks
Recipes for learning, fine-tuning, and adapting ColPali to your multimodal RAG use cases. 👨🏻‍🍳
☆246Updated last month
dswang2011 / DocLLM
DocLLM: A layout-aware generative language model for multimodal document understanding
☆119Updated last year
IBM / fastfit
FastFit ⚡ When LLMs are Unfit Use FastFit ⚡ Fast and Effective Text Classification with Many Classes
☆181Updated 3 months ago
TIGER-AI-Lab / LongRAG
Official repo for "LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs".
☆209Updated 4 months ago
puppetm4st3r / baai_m3_simple_server
This code sets up a simple yet robust server using FastAPI for handling asynchronous requests for embedding generation and reranking task…
☆57Updated 8 months ago
xhluca / bm25s
Fast lexical search implementing BM25 in Python using Numpy, Numba and Scipy
☆970Updated this week
cohere-ai / DiskVectorIndex
☆206Updated 6 months ago
AnswerDotAI / rerankers
A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.
☆1,238Updated last month
denser-org / denser-retriever
An enterprise-grade AI retriever designed to streamline AI integration into your applications, ensuring cutting-edge accuracy.
☆279Updated this week