1

FaStfact: Faster, Stronger Long-Form Factuality Evaluations in LLMs

FaStfact is a multi-agent pipeline for evaluating long-form generation factuality that achieves the highest alignment with human evaluation and time/token efficiency among existing baselines. An annotated FaStfact-bench is also open-sourced.

Yingjia Wan, Haochen Tan, Xiao Zhu, Xinyu Zhou, Zhiwei Li, Qingsong Lv, Changxuan Sun, Jiaqi Zeng, Yi Xu, Jianqiao Lu, Yinhong Liu, Zhijiang Guo

SATBench: Benchmarking LLMs’ Logical Reasoning via Automated Puzzle Generation from SAT Formulas

SATBench is a benchmark for evaluating LLMs logical reasoning through logical puzzles derived from Boolean satisfiability (SAT) problems.

Anjiang Wei, Yuheng Wu, Yingjia Wan, Tarun Suresh, Huanmi Tan, Zhanke Zhou, Sanmi Koyejo, Ke Wang, Alex Aiken

SATBench: Benchmarking LLMs’ Logical Reasoning via Automated Puzzle Generation from SAT Formulas

MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs

MR-BEN is a comprehensive process-based benchmark to evaluate advanced `meta-reasoning’ skills, where models are asked to locate and analyse errors in the provided CoT solutions. It comprises 5,975 multi-domain samples with annotated groundtruths.

Zhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, Linling Shen, Jianqiao Lu, Haochen Tan, Yukang Chen, Hao Zhang, Zhan Shi, Bailin Wang, Zhijiang Guo, Jiaya Jia

MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs

AutoPSV: Automated Process-Supervised Verifier

AutoPSV proposes a simple, effective, and efficient method to automatically annotate reasoning steps (even without requiring grountruth answers).

Jianqiao Lu, Zhiyang Dou, Hongru Wang, Zeyu Cao, Jianbo Dai, Yingjia Wan, Yinya Huang, Zhijiang Guo

AutoPSV: Automated Process-Supervised Verifier

Reading-While-Listening vs. Reading-Only in A Second Language at Different Language Proficiencies: an Eye-Tracking Study

Reading-while-listening (R/L) has a facilitation effect on second language (L2) reading comprehension after longitudinal R/L training …

Yingjia Wan, Matthew Wallace

Last updated on July 7, 12127

Pedagogy in a Pandemic: College Instructor Perspectives on Online Instruction during COVID-19 at Universities in USA and China

Higher education institutions globally saw a collective mandate to move classes online, where afforded, at the onset of the COVID- 19 …

Sarah Stilwell, Anjli Narwani, Jessica Pelton, Xi Zhang, Qi Zeng, Qi Zhao, Yingjia Wan, Kevin Miller

Last updated on July 7, 12127