DICE-BENCH: Evaluating Tool-Use in Multi-Party Dialogues

Abstract

Existing function-calling benchmarks focus on single-turn interactions and overlook the complexity of real-world scenarios. To quantify how existing benchmarks address practical applications, we introduce DICE-SCORE, a metric that evaluates the dispersion of tool-related information such as function name and parameter values throughout the dialogue.

Analyzing existing benchmarks through DICE-SCORE reveals notably low scores, highlighting the need for more realistic scenarios. To address this gap, we present DICE-BENCH, a framework that constructs practical function-calling datasets by synthesizing conversations through a tool graph that maintains dependencies across rounds and a multi-agent system with distinct personas to enhance dialogue naturalness.

The final dataset comprises 1,607 high-DICE-SCORE instances. Our experiments on 19 LLMs with DICE-BENCH show that significant advances are still required before such models can be deployed effectively in real-world settings. Even GPT-4o achieves only 64% exact match accuracy on average, with performance degrading as the number of rounds or participants increases.

Data Generation Pipeline

DICE-BENCH employs a three-stage pipeline: (1) Tool Graph Construction to model inter-tool dependencies, (2) Scenario Configuration to set dialogue parameters and personas, and (3) Dialogue Simulation using multi-agent systems to generate realistic conversations.

DICE-SCORE Metric

DICE-SCORE quantifies how challenging it is to perform a function call by measuring the dispersion of tool-related information across dialogue turns. Higher scores indicate greater difficulty as critical information becomes more scattered.

$$\text{DICE}(S,T)=\frac{\min(|S_{\neq 0}|,T)\cdot\sqrt{|S|\cdot T}}{\sum_{i\in S}\ln\left(1+\alpha S_i\right)}$$

Our experiments show a strong correlation (r ≈ -0.984) between DICE-SCORE and human performance on the task, validating its effectiveness as a difficulty metric.

Rigorous Validation Pipeline

We employ a comprehensive three-stage filtering process to ensure high-quality data:

Stage 1: Automatic Evaluation

Using G-Eval with GPT-4o to assess six criteria: Coherence, Consistency, Fluency, Human-likeness, Persona Consistency, and Relevance. Dialogues scoring below 4.0 average are removed.

Stage 2: Rule-Based Filtering

Removing dialogues with GPT refusals or lacking explicit AI/Assistant addressing. Authors manually review ambiguous cases.

Stage 3: Human Validation

Expert evaluation across 15 sub-criteria covering Conversation Quality, Functional Integration, and Real-World Applicability. Only dialogues scoring 10+/15 are retained.

This rigorous process filtered out 193 dialogues from the initial 1,800, resulting in 1,607 high-quality instances.

Dataset Format

The JSON below illustrates one dialogue instance of One-Round-Two-Party scenario in DICE-BENCH.

{
 "diag_id": 1,
 "metadata": {
   "diag_id": 1,
   "user_personas": {
     "agent_a": "Marketing specialist with optimistic personality",
     "agent_b": "Skeptical engineer who prefers concise answers"
   },
   "functions": ["find_restaurant", "book_hotel", ...],
   "params_ret_val": [
     {
       "function": "find_restaurant",
       "parameters": {
         "location": "San Francisco",
         "cuisine": "Thai"
       },
       "domain": "Inquiry_and_Information_Seeking",
       "return_value": {
         "restaurant_name": "Thai Palace"
       },
       "returned_nl": "I found a great Thai place called Thai Palace in San Francisco!"
     }
   ],
   "category": "Inquiry_and_Information_Seeking",
   "task": "multi_round",
   "round_num": 2,
   "agent_num": 2
 },
 "conversation": [
   {"role": "user", "content": "Any good Thai food nearby?"},
   {"role": "assistant", "content": "Sure, let me look that up for you."},
   ...
 ]
}

See the dataset card for the complete schema description.

BibTeX

@misc{jang2025dicebenchevaluatingtoolusecapabilities,
  title={DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues},
  author={Kyochul Jang and Donghyeon Lee and Kyusik Kim and Dongseok Heo and Taewhoo Lee and Woojeong Kim and Bongwon Suh},
  year={2025},
  eprint={2506.22853},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.22853},
}

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues

DICE-BENCH is the first benchmark to evaluate LLMs' ability to use tools in realistic multi-round, multi-party group chat scenarios where function-related information is dispersed across multiple speakers.