EternityYW / TRAM-BenchmarkLinks
TRAM: Benchmarking Temporal Reasoning for Large Language Models (Findings of ACL 2024)
☆26Updated last year
Alternatives and similar repositories for TRAM-Benchmark
Users that are interested in TRAM-Benchmark are comparing it to the libraries listed below
Sorting:
- AbstainQA, ACL 2024☆27Updated 9 months ago
- ☆46Updated 8 months ago
- Code for reproducing the ACL'23 paper: Don't Generate, Discriminate: A Proposal for Grounding Language Models to Real-World Environments☆74Updated 2 months ago
- Methods and evaluation for aligning language models temporally☆29Updated last year
- [ICLR 2023] Code for our paper "Selective Annotation Makes Language Models Better Few-Shot Learners"☆108Updated 2 years ago
- Code for the ACL-2022 paper "Knowledge Neurons in Pretrained Transformers"☆170Updated last year
- Supporting code for ReCEval paper☆29Updated 10 months ago
- Source code for InBedder, an instruction-following text embedder☆27Updated 9 months ago
- ☆45Updated last year
- [EMNLP 2023] MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions☆114Updated 10 months ago
- [EMNLP-2022 Findings] Code for paper “ProGen: Progressive Zero-shot Dataset Generation via In-context Feedback”.☆27Updated 2 years ago
- [ICML 2023] Code for our paper “Compositional Exemplars for In-context Learning”.☆102Updated 2 years ago
- ☆99Updated last year
- WikiWhy is a new benchmark for evaluating LLMs' ability to explain between cause-effect relationships. It is a QA dataset containing 9000…☆47Updated last year
- ☆27Updated 2 years ago
- ☆28Updated last year
- Github repository for "FELM: Benchmarking Factuality Evaluation of Large Language Models" (NeurIPS 2023)☆59Updated last year
- ☆50Updated last year
- Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model☆68Updated 2 years ago
- The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning (NeurIPS 2022)☆15Updated 2 years ago
- ☆87Updated 2 years ago
- [ICLR'24 Spotlight] "Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts"☆70Updated last year
- Evaluate the Quality of Critique☆36Updated last year
- Code and Data for NeurIPS2021 Paper "A Dataset for Answering Time-Sensitive Questions"☆72Updated 3 years ago
- ☆178Updated 11 months ago
- Data and code for the ICLR 2023 paper "Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning".☆154Updated last year
- [EMNLP 2024] Official implementation of "Hierarchical Deconstruction of LLM Reasoning: A Graph-Based Framework for Analyzing Knowledge Ut…☆21Updated 7 months ago
- [NAACL'25 Oral] Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering☆61Updated 7 months ago
- ☆86Updated 2 years ago
- ☆29Updated last year