isaacus-dev/semchunk

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/isaacus-dev/semchunk)

isaacus-dev / semchunk

A fast, lightweight and easy-to-use Python library for splitting text into semantically meaningful chunks.

☆660

Alternatives and similar repositories for semchunk

Users that are interested in semchunk are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

isaacus-dev / text2markdown
View on GitHub
text2markdown is a Python library for intelligently converting plain text into Markdown.
☆19Jun 1, 2026Updated last month
benbrandt / text-splitter
View on GitHub
Split text into semantic chunks, up to a desired chunk size. Supports calculating length by characters and tokens, and is callable from R…
☆621Updated this week
isaacus-dev / open-australian-legal-corpus-creator
View on GitHub
The code used to create and update the Open Australian Legal Corpus, the first and only multijurisdictional open corpus of Australian leg…
☆122May 26, 2025Updated last year
feyninc / chonkie
View on GitHub
🦛 CHONK docs with Chonkie ✨ — The lightweight ingestion library for fast, efficient and robust RAG pipelines
☆4,606Updated this week
isaacus-dev / cookbooks
View on GitHub
Guides and code illustrating how to use Isaacus AI models in practice to solve real problems.
☆36Jun 30, 2026Updated 3 weeks ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
isaacus-dev / mleb
View on GitHub
The code used to evaluate embedding models on the Massive Legal Embedding Benchmark (MLEB).
☆39Feb 24, 2026Updated 5 months ago
JustlyAI / lmss_entity_extractor
View on GitHub
Tool to apply Legal Matter Specification Standard (LMSS) to documents
☆12Aug 15, 2024Updated last year
hamelsmu / ft-drift
View on GitHub
Check for data drift between two OpenAI multi-turn chat jsonl files.
☆39Apr 11, 2024Updated 2 years ago
speedyk-005 / chunklet-py
View on GitHub
One library to split them all: Sentence, Code, Docs. Chunk smarter, not harder — built for LLMs, RAG pipelines, and beyond.
☆82Updated this week
xhluca / bm25s
View on GitHub
Fast BM25 search in Python, powered by Numpy and Numba
☆1,751Jul 22, 2026Updated last week
AnswerDotAI / rerankers
View on GitHub
A lightweight, low-dependency, unified API to use all common reranking and cross-encoder models.
☆1,625Dec 20, 2025Updated 7 months ago
MinishLab / semhash
View on GitHub
Fast Multimodal Semantic Deduplication & Filtering
☆954May 24, 2026Updated 2 months ago
qdrant / fastembed
View on GitHub
Fast, Accurate, Lightweight Python library to make State of the Art Embedding
☆3,113Jul 22, 2026Updated last week
dottxt-ai / outlines
View on GitHub
Structured Outputs
☆15,419Updated this week
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
Unstructured-IO / unstructured
View on GitHub
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean…
☆15,210Updated this week
MinishLab / model2vec
View on GitHub
Fast State-of-the-Art Static Embeddings
☆2,167Jun 6, 2026Updated last month
stanfordnlp / dspy
View on GitHub
DSPy: The framework for programming—not prompting—language models
☆36,460Updated this week
alea-institute / kl3m-data
View on GitHub
KL3M training data collection and preprocessing
☆22Apr 14, 2025Updated last year
D-Star-AI / dsRAG
View on GitHub
High-performance retrieval engine for unstructured data
☆1,589Nov 10, 2025Updated 8 months ago
aurelio-labs / semantic-chunkers
View on GitHub
☆255Jun 10, 2025Updated last year
docling-project / docling
View on GitHub
Get your documents ready for gen AI
☆63,950Updated this week
AnswerDotAI / RAGatouille
View on GitHub
Easily use and train state of the art late-interaction retrieval methods (ColBERT) in any RAG pipeline. Designed for modularity and ease-…
☆3,944May 17, 2025Updated last year
ucbepic / docetl
View on GitHub
A system for agentic LLM-powered data processing and ETL
☆3,950Jul 21, 2026Updated last week
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
jina-ai / late-chunking
View on GitHub
Code for explaining and evaluating late chunking (chunked pooling)
☆533Dec 23, 2024Updated last year
michaelfeil / infinity
View on GitHub
Infinity is a high-throughput, low-latency serving engine for text-embeddings, reranking models, clip, clap and colpali
☆2,899Mar 24, 2026Updated 4 months ago
Knowledgator / FlashDeBERTa
View on GitHub
Trully flash implementation of DeBERTa disentangled attention mechanism.
☆90Feb 10, 2026Updated 5 months ago
confident-ai / deepeval
View on GitHub
The LLM Evaluation Framework
☆17,260Updated this week
neuml / txtai
View on GitHub
💡 All-in-one AI framework for semantic search, LLM orchestration and language model workflows
☆12,765Updated this week
argilla-io / distilabel
View on GitHub
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…
☆3,346Updated this week
alea-institute / FOLIO
View on GitHub
FOLIO: Federated Open Legal Information Ontology
☆41May 27, 2026Updated 2 months ago
567-labs / instructor
View on GitHub
structured outputs for llms
☆13,650Updated this week
davidberenstein1957 / dataset-viber
View on GitHub
Dataset Viber is your chill repo for data collection, annotation and vibe checks.
☆47Sep 5, 2024Updated last year
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
mixedbread-ai / batched
View on GitHub
The Batched API provides a flexible and efficient way to process multiple requests in a batch, with a primary focus on dynamic batching o…
☆161Jul 14, 2025Updated last year
noslegal / taxonomy
View on GitHub
noslegal taxonomy facets and release notes
☆44May 29, 2026Updated 2 months ago
urchade / GLiNER
View on GitHub
Generalist and Lightweight Model for Named Entity Recognition (Extract any entity types from texts)
☆3,462Updated this week
google / langextract
View on GitHub
A Python library for extracting structured information from unstructured text using LLMs with precise source grounding and interactive vi…
☆37,920Updated this week
AnswerDotAI / toolslm
View on GitHub
Tools to make language models a bit easier to use
☆67Updated this week
Beomi / exbert-transformers
View on GitHub
exBERT on Transformers🤗
☆10Jun 14, 2021Updated 5 years ago
denser-org / denser-retriever
View on GitHub
An enterprise-grade AI retriever designed to streamline AI integration into your applications, ensuring cutting-edge accuracy.
☆295Jun 26, 2025Updated last year