AI4Bharat / setuLinks
Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
☆16Updated last year
Alternatives and similar repositories for setu
Users that are interested in setu are comparing it to the libraries listed below
Sorting:
- Code for evaluating with Flow-Judge-v0.1 - an open-source, lightweight (3.8B) language model optimized for LLM system evaluations. Crafte…☆83Updated last year
- Writing Blog Posts with Generative Feedback Loops!☆50Updated last year
- A lightweight evaluation suite tailored specifically for assessing Indic LLMs across a diverse range of tasks☆38Updated last year
- Supervised instruction finetuning for LLM with HF trainer and Deepspeed☆36Updated 2 years ago
- Experimental Code for StructuredRAG: JSON Response Formatting with Large Language Models☆114Updated 9 months ago
- ☆21Updated last year
- ☆53Updated 11 months ago
- 💙 Unstructured Data Connectors for Haystack 2.0☆17Updated 2 years ago
- Solving data for LLMs - Create quality synthetic datasets!☆151Updated last year
- High level library for batched embeddings generation, blazingly-fast web-based RAG and quantized indexes processing ⚡☆69Updated 2 months ago
- Data extraction with LLM on CPU☆68Updated 2 years ago
- ☆31Updated last year
- Agentic RAG to help you build a startup🚀☆55Updated 9 months ago
- Own your AI, search the web with it🌐😎☆94Updated last year
- Simple Graph Memory for AI applications☆90Updated 8 months ago
- ☆55Updated 5 months ago
- A personal knowledge base that I can dump information to and help me learn☆24Updated 8 months ago
- Verbosity control for AI agents☆66Updated last year
- Quick Notebook Tutorials☆36Updated 6 months ago
- ☆125Updated 11 months ago
- Explore the use of DSPy for extracting features from PDFs 🔎☆52Updated last year
- Median is an open-source flashcard application that leverages the power of spaced repetition and artificial intelligence to transform the…☆22Updated last year
- Recipes and resources for building, deploying, and fine-tuning generative AI with Fireworks.☆133Updated last week
- purpose of this repo is to Implement LLMOPs as shared in Deeplearning AI course☆49Updated this week
- Official homepage for "Self-Harmonized Chain of Thought" (NAACL 2025)☆92Updated last year
- ☆210Updated 7 months ago
- ☆30Updated last year
- Testing speed and accuracy of RAG with, and without Cross Encoder Reranker.☆50Updated 2 years ago
- A multimodal RAG application that enables semantic search on multimedia sources like audio, video and images☆41Updated 2 years ago
- Docutron Toolkit: detection and segmentation analysis for legal data extraction over documents.☆26Updated 2 years ago