OpenMatch / NeuScraperLinks
[ACL 2024] This is the code repo for our ACL’24 paper "Cleaner Pretraining Corpus Curation with Neural Web Scraping".
☆226Updated 9 months ago
Alternatives and similar repositories for NeuScraper
Users that are interested in NeuScraper are comparing it to the libraries listed below
Sorting:
- A lightweight script for processing HTML page to markdown format with support for code blocks☆79Updated last year
- [ICML 2025] Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale☆248Updated 3 weeks ago
- [Preprint] Learning to Filter Context for Retrieval-Augmented Generaton☆192Updated last year
- This is the code repo for our paper "Autonomously Knowledge Assimilation and Accommodation through Retrieval-Augmented Agents".☆106Updated 7 months ago
- [EMNLP 2024: Demo Oral] RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation☆299Updated 7 months ago
- ☆51Updated 10 months ago
- LongEmbed: Extending Embedding Models for Long Context Retrieval (EMNLP 2024)☆135Updated 6 months ago
- Official repo for "LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs".☆231Updated 9 months ago
- Dense X Retrieval: What Retrieval Granularity Should We Use?☆157Updated last year
- [NeurlPS D&B 2024] Generative AI for Math: MathPile☆412Updated 2 months ago
- [ACL 2024] LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement☆183Updated last year
- ☆171Updated 2 months ago
- Open replication of DeepSeek R1 for text-to-graph extraction.☆94Updated 4 months ago
- Deep Reasoning Translation via Reinforcement Learning (arXiv preprint 2025); DRT: Deep Reasoning Translation via Long Chain-of-Thought (a…☆224Updated last week
- Meta-Chunking: Learning Efficient Text Segmentation via Logical Perception☆171Updated this week
- Evaluation for AI apps and agent☆41Updated last year
- ☆142Updated 11 months ago
- Official repository for RAGViz: Diagnose and Visualize Retrieval-Augmented Generation [EMNLP 2024]☆83Updated 4 months ago
- AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark☆142Updated 5 months ago
- Code for KaLM-Embedding models☆78Updated 2 months ago
- The official repository for the paper: Evaluation of Retrieval-Augmented Generation: A Survey.☆158Updated last month
- [ICLR 2025] The official implementation of paper "ToolGen: Unified Tool Retrieval and Calling via Generation"☆142Updated 2 months ago
- ☆99Updated last year
- Qwen GRPO Graph Extraction RL Finetune☆49Updated 2 months ago
- This is a repository of RALM surveys containing a summary of state-of-the-art RAG and other technologies☆202Updated 11 months ago
- This is the repository for our paper "INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning"☆203Updated 5 months ago
- Compress your input to ChatGPT or other LLMs, to let them process 2x more content and save 40% memory and GPU time.☆383Updated last year
- ☆94Updated 6 months ago
- We aim to provide the best references to search, select, and synthesize high-quality and large-quantity data for post-training your LLMs.☆55Updated 8 months ago
- A pipeline for LLM knowledge distillation☆104Updated 2 months ago