AI4Bharat / setu
Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
☆11Updated 6 months ago
Related projects ⓘ
Alternatives and complementary repositories for setu
- Data extraction with LLM on CPU☆66Updated last year
- OpenMindedChatbot is a Proof Of Concept that leverages the power of Open source Large Language Models (LLM) with Function Calling capabil…☆28Updated 11 months ago
- Official repo for the paper PHUDGE: Phi-3 as Scalable Judge. Evaluate your LLMs with or without custom rubric, reference answer, absolute…☆48Updated 4 months ago
- ☆31Updated 8 months ago
- A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks☆33Updated 5 months ago
- Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafte…☆53Updated 3 weeks ago
- Build reliable, secure, and production-ready AI apps easily.☆46Updated this week
- Low latency, High Accuracy, Custom Query routers for Co-pilots and Agents. Built by Prithivi Da☆52Updated this week
- Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)☆74Updated 2 months ago
- Code interpreter support for o1☆31Updated 2 months ago
- ☆19Updated 3 months ago
- High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡☆62Updated 2 weeks ago
- Routing on Random Forest (RoRF)☆84Updated last month
- Example implementation of Iteration of Tought - Gives a star if you like the project☆33Updated this week
- Writing Blog Posts with Generative Feedback Loops!☆43Updated 8 months ago
- Dynamic Metadata based RAG Framework☆71Updated 3 months ago
- DSPy program/pipeline inspector widget for Jupyter/VSCode Notebooks.☆28Updated 9 months ago
- Explore the use of DSPy for extracting features from PDFs 🔎☆33Updated 8 months ago
- Repository containing awesome resources regarding Hugging Face tooling.☆43Updated 10 months ago
- Dataset Viber is your chill repo for data collection, annotation and vibe checks.☆44Updated 2 months ago
- Experimental Code for StructuredRAG: Structured Outputs in Retrieval-Augmented Generation☆94Updated this week
- ☆49Updated this week
- Verbosity control for AI agents☆59Updated 6 months ago
- Seamless Voice Interactions with LLMs☆11Updated last year
- A blueprint for creating Pretraining and Fine-Tuning datasets for Indic languages☆90Updated last month
- Docutron Toolkit: detection and segmentation analysis for legal data extraction over documents.☆26Updated last year
- Self-host LLMs with vLLM and BentoML☆74Updated last week
- LlamaWorksDB is a Retrieval Augmented Generation (RAG) product designed to interact with the documentation of various products such as Ll…☆15Updated 6 months ago
- ☆18Updated this week
- Super performant RAG pipeline for AI apps.☆15Updated 8 months ago