[ACL 2024] This is the code repo for our ACL’24 paper "Cleaner Pretraining Corpus Curation with Neural Web Scraping".
☆230Aug 28, 2024Updated last year
Alternatives and similar repositories for NeuScraper
Users that are interested in NeuScraper are comparing it to the libraries listed below
Sorting:
- This is the code repo for the paper "RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards".☆24Oct 28, 2024Updated last year
- [ACL 2024 Oral] This is the code repo for our ACL‘24 paper "MARVEL: Unlocking the Multi-Modal Capability of Dense Retrieval via Visual Mo…☆39Jun 30, 2024Updated last year
- This is the code repo for our paper "Say More with Less: Understanding Prompt Learning Behaviors through Gist Compression".☆12Feb 27, 2024Updated 2 years ago
- [CIKM 2023 Oral] This is the code repo for our CIKM‘23 paper "Text Matching Improves Sequential Recommendation by Reducing Popularity Bia…☆39Mar 17, 2024Updated last year
- This is the code repo for our paper "Learning More Effective Representations for Dense Retrieval through Deliberate Thinking Before Searc…☆27Mar 2, 2025Updated 11 months ago
- This is the code repo for our paper "Enhancing Knowledge Integration and Utilization of Large Language Models via Constructivist Cognitio…☆110Oct 9, 2025Updated 4 months ago
- ☆16Dec 11, 2024Updated last year
- ☆30Dec 27, 2024Updated last year
- [ACL 2025 Findings] Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts (As Huggingface Daily Papers: …☆90Nov 23, 2025Updated 3 months ago
- [ICML 2024] Selecting High-Quality Data for Training Language Models☆201Dec 8, 2025Updated 2 months ago
- An Open-Source Package for Information Retrieval☆168Updated this week
- ☆13Jul 13, 2023Updated 2 years ago
- ☆18Mar 23, 2025Updated 11 months ago
- Language models scale reliably with over-training and on downstream tasks☆100Apr 2, 2024Updated last year
- ☆64Apr 9, 2024Updated last year
- [EMNLP 2022] This is the code repo for our EMNLP‘22 paper "Dimension Reduction for Efficient Dense Retrieval via Conditional Autoencoder"…☆12Oct 20, 2022Updated 3 years ago
- Source code for paper "ExpandR: Teaching Dense Retrievers Beyond Queries with LLM Guidance"☆39Aug 13, 2025Updated 6 months ago
- HIPPO: Enhancing the Table Understanding Capability of Large Language Models through Hybrid-Modal Preference Optimization☆17May 29, 2025Updated 9 months ago
- Source code for paper: INTERVENOR : Prompt the Coding Ability of Large Language Models with the Interactive Chain of Repairing☆29Nov 25, 2024Updated last year
- Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]☆149Oct 27, 2024Updated last year
- [ACL2025 Findings] Benchmarking Multihop Multimodal Internet Agents☆48Feb 27, 2025Updated last year
- Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.☆2,903Updated this week
- Code repo for SIGIR 2021 paper "Few-Shot Conversational Dense Retrieval"☆42Dec 9, 2021Updated 4 years ago
- ☆39Jul 25, 2024Updated last year
- Codebase for ICML submission "DOGE: Domain Reweighting with Generalization Estimation"☆21Feb 29, 2024Updated 2 years ago
- Official repository for the paper "COAST: Enhancing the Code Debugging Ability of LLMs through Communicative Agent Based Data Synthesis".☆18Feb 19, 2025Updated last year
- This is the code repo for our paper "Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts".☆44Sep 27, 2025Updated 5 months ago
- ☆566Nov 20, 2024Updated last year
- In-Context Learning for eXtreme Multi-Label Classification (XMC) using only a handful of examples.☆449Feb 13, 2024Updated 2 years ago
- This repository provides an original implementation of Detecting Pretraining Data from Large Language Models by *Weijia Shi, *Anirudh Aji…☆242Nov 3, 2023Updated 2 years ago
- [ICML 2024] LESS: Selecting Influential Data for Targeted Instruction Tuning☆512Oct 20, 2024Updated last year
- The collection of bulding blocks building fine-tunable metric learning models☆36Jan 5, 2026Updated last month
- A framework for evaluating Machine Translation models.☆12May 26, 2025Updated 9 months ago
- ☆22Dec 11, 2025Updated 2 months ago
- A platform aimed at creating websites that perform self-optimization☆12May 4, 2024Updated last year
- [NAACL 2024 Outstanding Paper] Source code for the NAACL 2024 paper entitled "R-Tuning: Instructing Large Language Models to Say 'I Don't…☆130Jul 10, 2024Updated last year
- ☆167May 2, 2024Updated last year
- [NeurlPS D&B 2024] Generative AI for Math: MathPile☆418Apr 4, 2025Updated 10 months ago
- Code for the paper "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"☆316Dec 20, 2023Updated 2 years ago