OpenMatch / NeuScraper
[ACL 2024] This is the code repo for our ACL’24 paper "Cleaner Pretraining Corpus Curation with Neural Web Scraping".
☆224Updated 7 months ago
Alternatives and similar repositories for NeuScraper:
Users that are interested in NeuScraper are comparing it to the libraries listed below
- A lightweight script for processing HTML page to markdown format with support for code blocks☆79Updated 11 months ago
- Offical Repo for "Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale"☆232Updated last month
- This is the code repo for our paper "Autonomously Knowledge Assimilation and Accommodation through Retrieval-Augmented Agents".☆104Updated 5 months ago
- Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception☆149Updated 3 weeks ago
- DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought☆212Updated 3 months ago
- AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark☆133Updated 3 months ago
- Official repository for RAGViz: Diagnose and Visualize Retrieval-Augmented Generation [EMNLP 2024]☆82Updated 2 months ago
- [Preprint] Learning to Filter Context for Retrieval-Augmented Generaton☆191Updated last year
- ☆51Updated 8 months ago
- ☆148Updated last week
- 🚢 Data Toolkit for Sailor Language Models☆88Updated last month
- Large Language Models Are Reasoning Teachers (ACL 2023)☆329Updated last month
- The official repository for the paper: Evaluation of Retrieval-Augmented Generation: A Survey.☆148Updated 6 months ago
- [ICLR 2025] The official implementation of paper "ToolGen: Unified Tool Retrieval and Calling via Generation"☆133Updated 2 weeks ago
- Open replication of DeepSeek R1 for text-to-graph extraction.☆92Updated 2 months ago
- [ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.☆238Updated 5 months ago
- ☆142Updated 9 months ago
- [EMNLP 2024: Demo Oral] RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation☆293Updated 5 months ago
- OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation☆72Updated 2 weeks ago
- Light local website for displaying performances from different chat models.☆86Updated last year
- Leveraging passage embeddings for efficient listwise reranking with large language models.☆39Updated 4 months ago
- [NeurlPS D&B 2024] Generative AI for Math: MathPile☆410Updated last week
- We aim to provide the best references to search, select, and synthesize high-quality and large-quantity data for post-training your LLMs.☆54Updated 6 months ago
- Imitate OpenAI with Local Models☆88Updated 7 months ago
- Evaluation for AI apps and agent☆36Updated last year
- ☆122Updated last year
- ☆109Updated 8 months ago
- Reformatted Alignment☆115Updated 6 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆132Updated 5 months ago
- ☆97Updated 10 months ago