Publications

FormalAlign: Automated Alignment Evaluation for Autoformalization
Process-Driven Autoformalization in Lean 4
Process-Driven Autoformalization in Lean 4
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs

Large language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, it has been increasingly challenging to evaluate the reasoning capability of LLMs. Concretely, existing outcome-based benchmarks begin to saturate and become less sufficient to monitor the progress. To this end, we present a process-based benchmark MR-BEN that demands a meta reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. MR-BEN is a comprehensive benchmark comprising 5,975 questions collected from human experts, covering various subjects such as physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). For example, open-source models are seemingly comparable to GPT-4 on outcome-based benchmarks, but they lag far behind on our benchmark, revealing the underlying reasoning capability gap between them. Our dataset and codes are available on this https URL.

AutoPSV: Automated Process-Supervised Verifier
Multimodal Prompt Tuning for Cognition-Enhanced NLP

Human cognitive signals reflect humans’ attention distribution and neural activation regarding different parts of the input, which are crucial in understanding the mechanism behind their language processing behaviour. Computational linguistics research aims to optimize language models to achieve a human-like level of performance in natural language processing tasks, ideally in an accountable fashion. This renders integrating human cognitive signals into language models an intriguing research area to optimize their downstream task performances in an accountable fashion. Previous works exploring how cognition data could enhance natural language processing (NLP) tasks bore limitations such as weak accuracy increase, heavy engineering bias, and limited generalizability of conclusions drawn from experiments on outdated models. This thesis addresses these issues by introducing a novel approach that leverages prompt-based fine-tuning. In particular, two methods were proposed. (1) Inspired by ‘hard prompting’, Method 1 uses gaze and electroencephalography (EEG) features as discrete prompt tokens to modify model behaviour during training; (2) Drawing on ‘soft prompting’, Method 2 designs a multi-modal prompting framework called `CogMAP’ (Cognition Mapping And Prompting), which employs these cognition features as multidimensional prompting vectors projected into the continuous embedding space of language models. Task results on ternary sentiment classification were consistently superior when incorporating either gaze or EEG data as prompts in both methods (p<0.001), across encoder-only BERT-based models and decoder-only GPT-2-based models. This study signifies a leap in cognition-inspired NLP research, addressing existing limitations while providing a new robust and effective paradigm for future investigations of bridging the gap between human cognition and artificial language processing to improve the performance and understanding of language models.

Reading-While-Listening vs. Reading-Only in A Second Language at Different Language Proficiencies: an Eye-Tracking Study

Reading-while-listening (R/L) has a facilitation effect on second language (L2) reading comprehension after longitudinal R/L training from empirical studies. However, most empirical evidence provides limited insight into how the auditory input affects readers’ language processing. When R/L was examined using eye movement metrics, a hindrance effect has been reported for L1 readers, and its facilitation effect on comprehension disappears for advanced-level L2 readers (Conklin et al., 2020). To study R/L’s effect on less adept L2 learners, this study compared the comprehension accuracy and eye movements of intermediate and elementary-level L2 readers of English between reading-only (R/O) and R/L modes. 22 university students in Macao completed a vocabulary test and reading comprehension tasks. Participants were assigned to either an intermediate-level group (n = 11) or an elementary-level group (n = 11) based on vocabulary test performance. Both groups completed the tasks while their eye movements were captured by a Tobii eye tracker. Results showed there was no significant difference between R/L and R/O in comprehension for the participant groups. Mixed model analyses of variance revealed significant main effects of reading mode (R/L or R/O) in total fixation durations and total visit durations, suggesting R/L facilitated processing of the text in both levels of participants. Significant interactions between the reading mode and participants’ language level showed that the facilitation was significantly greater for elementary-level L2 readers. Hence, we preliminarily established the accuracy of a continuum model that summarizes the differing effect of auditory input on readers across language proficiency levels.