A minimalist but optimized Python package for deduplication tasks leveraging RapidFuzz internally, enabling super-fast approximate duplicate detection within a dataset with minimal config.
☆18Apr 2, 2025Updated 11 months ago
Alternatives and similar repositories for fast-dedupe
Users that are interested in fast-dedupe are comparing it to the libraries listed below
Sorting:
- EmbedDB is an ultra-lightweight vector database designed for rapid prototyping of semantic search and RAG applications. The entire implem…☆21Mar 24, 2025Updated 11 months ago
- synthetic data for ml☆25Jan 30, 2025Updated last year
- ☆10Nov 12, 2024Updated last year
- ☆15May 12, 2025Updated 9 months ago
- Code for COLING 2022 accepted paper titled "MuCDN: Mutual Conversational Detachment Network for Emotion Recognition in Multi-Party Conver…☆10Jul 21, 2023Updated 2 years ago
- The tool to visualise architecture of python packages☆10Aug 16, 2023Updated 2 years ago
- An AI-powered literature review assistant for researchers☆22Apr 18, 2025Updated 10 months ago
- RAG-based Chatbot that helps answer questions around healthy eating & lifestyle choices, based on 1200+ science-backed blog posts of Nutr…☆13Sep 15, 2025Updated 5 months ago
- Fast search index for SPLADE sparse retrieval models implemented in Python using Numpy and Numba☆36Oct 16, 2025Updated 4 months ago
- ☆12Feb 24, 2026Updated last week
- Notes on how to set up your backend instance☆12May 29, 2024Updated last year
- Example files used in the DuckDB - Unity Catalog blog☆10Dec 6, 2024Updated last year
- ☆12Dec 29, 2021Updated 4 years ago
- A code-free AutoML pipeline with AutoGluon, Amazon SageMaker, and AWS Lambda.☆11Aug 5, 2021Updated 4 years ago
- 🎈 A series of lightweight GPT models featuring TinyGPT Base (~51M params) and TinyGPT-MoE (~85M params), TinyGPT2 (~95M params). Fast, c…☆15Feb 21, 2026Updated 2 weeks ago
- CodeRepoQA dataset☆15Feb 19, 2025Updated last year
- Apache Arrow Guide☆17Oct 10, 2021Updated 4 years ago
- ☆12Apr 22, 2024Updated last year
- ☆11Dec 22, 2022Updated 3 years ago
- LUMIN: Your data analysis companion that turns natural language questions into powerful insights through AI-driven visualizations and cle…☆15Nov 11, 2024Updated last year
- Course Scheduling Management LMS - Low level design with standard design patterns using Java.☆11Jul 27, 2022Updated 3 years ago
- R and Python solutions to applied exercises in An Introduction to Statistical Learning with Applications in R (corrected 7th printing)☆16Jun 4, 2025Updated 9 months ago
- GlotEval: a unified evaluation toolkit designed to benchmark multilingual Large Language Models (LLMs) in a language-specific way☆18Nov 4, 2025Updated 4 months ago
- Multi-Agent Deep RAG☆39Feb 25, 2026Updated last week
- Dataset of sentences from Hindi stories tagged with different emotion tags☆11Nov 26, 2019Updated 6 years ago
- ☆18Dec 6, 2024Updated last year
- ☆19Oct 1, 2025Updated 5 months ago
- ☆17Apr 19, 2024Updated last year
- Playing with Python Bluesky SDK☆15Nov 18, 2024Updated last year
- Table detection with Florence.☆15Jul 11, 2024Updated last year
- ☆22Jan 13, 2025Updated last year
- Examples of demo deployment using Gradio. Image Classification, Live Webcam Segmentation, APIs , Tunneling etc.☆17Oct 17, 2022Updated 3 years ago
- Resources to learn data processing with GPT and other language models☆21Dec 10, 2024Updated last year
- Analyze, Detect and Remove Gender Stereotyping from Bollywood Movie Trailers.☆13Mar 27, 2018Updated 7 years ago
- ☆25Jun 10, 2025Updated 8 months ago
- SynthTextEval: A Toolkit for Generating and Evaluating Synthetic Data For High-Stakes Domains (EMNLP 2025 System Demonstration)☆26Nov 3, 2025Updated 4 months ago
- ☆21Jun 12, 2024Updated last year
- Making of cuda kernel☆17May 27, 2025Updated 9 months ago
- Python implementation of METEOR☆16Nov 20, 2018Updated 7 years ago