Ziyi Liu

PhD Student

University of Southern California

About me

Hi there👋 , I am a second-year PhD student at the University of Southern California, honorly advised by Prof. Jieyu Zhao in the LIME Lab. Previously, I earned my master's degree at USC and worked as a Research Assistant in USC ISI's Ink Lab for two years under the guidance of Professor Xiang Ren. My research primarily focuses on social reasoning and trustworthy NLP, particularly evaluating LLM behavior and aligning LLM values with human values in human-LLM interaction. My work is driven by two key questions:

  • How can we make interactions between models and humans more seamless?
  • How can we ensure the faithfulness of LLMs and avoid hallucinations during interactions?

I am open to collaboration! If you are a master or undergraduate student in USC, please fill in this form first before contacting me. If you are a PhD student from another university, feel free to drop me an email!

I am looking for an internship for Summer 2026! Would appreciate any help!

News

Oct 2025
Our paper (work done during internship in Microsoft) ProMediate: A Socio-cognitive framework for evaluating proactive agents in multi-party negotiation was released on arxiv! 🎉
Mar 2025
I will join Microsoft this summer as an intern, working on social intelligence and agent research.
Sept 2024
Our paper InterIntent: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Environment was accepted at EMNLP 2024. See you in Miami! 🎉
Sept 2024
Our paper Self‑Contradictory Reasoning Evaluation and Detection was accepted as Findings of EMNLP 2024. 🎉
May 2023
Our paper Are Machine Rationales (Not) Useful to Humans? was accepted at ACL 2023.

Featured Publications and Preprints

Ziyi Liu, Bahar Sarrafzadeh, Pei Zhou, Longqi Yang, Jieyu Zhao, Ashish Sharma

Oct 29, 2025 · arXiv Preprint

ProMediate: A Socio-cognitive framework for evaluating proactive agents in multi-party negotiation

ProMediate is the first framework designed to evaluate proactive AI mediator agents in complex, multi-topic, multi-party negotiation. It consists of a simulation testbed and a socio-cognitive evaluation framework with new metrics to measure consensus change, intervention latency, and mediator effectiveness. Results show a socially intelligent mediator increases consensus change and responds faster than a generic baseline.


Ziyi Liu, Priyanka Dey, Jen-tse Huang, Zhenyu Zhao, Bowen Jiang, Rahul Gupta, Yang Liu, Yao Du, Jieyu Zhao

Apr, 2025 · arXiv Preprint

Can LLMs Grasp Implicit Cultural Values? Benchmarking LLMs' Cultural Intelligence with CQ-Bench

CQBench introduces a benchmark for evaluating large language models’ cultural intelligence by testing their ability to infer implicit cultural values from natural, multi-character conversations. Built from World Values Survey and GlobalOpinions data, CQBench includes three tasks—attitude detection, value selection, and value extraction—and is generated via a rigorous validation pipeline achieving 94.5% human–model agreement. Results show that while frontier models approach human performance in value selection, they still struggle with nuanced attitude inference, and that targeted fine-tuning on small, culturally rich datasets can yield substantial gains.

Jingyuan Huang, Jen-tse Huang, Ziyi Liu, Xiaoyuan Liu, Wenxuan Wang, Jieyu Zhao

July, 2025 - In the proceedings of ACL 2025

AI Sees Your Location, But With A Bias Toward The Wealthy World

Visual-Language Models (VLMs) demonstrate geographic recognition capabilities from images but exhibit significant regional biases: they perform better on developed, densely populated areas than on less developed, sparsely populated regions. This benchmark study also highlights privacy concerns arising from strong geographic inference performance.

Ziyi Liu, Abhishek Anand, Pei Zhou, Jen-tse Huang, Jieyu Zhao

Jun 2024 · In the proceedings of EMNLP 2024

InterIntent: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context

We introduce InterIntent, a framework for evaluating large language models’ social intelligence by testing their ability to understand and manage intentions within interactive game settings. The paper proposes four dimensions of social intelligence—situational awareness, self-regulation, self-awareness, and theory of mind—each linked to specific game tasks such as intention selection, following, summarization, and guessing. Results show that models perform well on intention selection but lag behind humans in inference tasks, highlighting areas for improvement in assessing social reasoning in LLMs.

Ziyi Liu, Soumya Sanyal, Isabelle Lee, Yongkang Du, Rahul Gupta, Yang Liu, Jieyu Zhao

Nov 2024 · Findings EMNLP 2024

Self-contradictory reasoning evaluation and detection

This work investigates self-contradictory reasoning in large language models (LLMs), where the model’s internal reasoning fails to support its answers. The authors define and measure the Self-Contra rate across multiple datasets and identify finer-grained categories of contradiction. Results show that models often produce correct answers via reasoning shortcuts or by ignoring contextual evidence, compromising reliability. They further evaluate GPT-4’s ability to detect self-contradictory reasoning and find that even with aided detection, performance (~52.2% F1) lags behind humans (~66.7% F1), underscoring limitations in current LLM reasoning robustness. :contentReference[oaicite:0]{index=0}