JT-GUIAgent-V1: A Planner-Grounder Agent for Reliable GUI Interaction


China Mobile Jiutian Logo

Abstract

JT-GUIAgent-V1, developed by China Mobile's Jiutian, is a GUI Agent built upon a multimodal large language model (LLM) and a two-phase collaborative framework. By decoupling the decision-making process of autonomous mobile agents into global planning (Planner) and local grounding (Grounder), JT-GUIAgent-V1 ensures systematic task execution, significantly reducing confusion when handling complex tasks and improving overall accuracy.

The Planner and Grounder operate independently, allowing for separate optimization and evaluation of each module. This modular design enables rapid adaptation to new application scenarios or task types without requiring extensive architectural changes, demonstrating exceptional flexibility and scalability.

JT-GUIAgent Workflow Diagram

Figure 1: JT-GUIAgent-V1 Workflow

Key Components

Planner: High-Level Task Orchestration

The Planner leverages a structured prompt template that incorporates:

  • Action space selection
  • Task decomposition logic
  • Execution guidelines
  • Chain-of-thought reasoning

This design significantly enhances the multi-step planning capability of the general-purpose multimodal large model in complex environments.

Grounder: Precise Element Localization

The Grounder employs an enhanced training strategy, utilizing:

  • Extensive open-source datasets of screenshots and UI element icons
  • Custom synthetic data simulating Chinese app interfaces for improved localization

Fine-tuned on a general-purpose multimodal large model, the Grounder achieves ( ScreenSpot-V2 benchmark):

  • 97.2% accuracy in mobile text component recognition
  • 82.5% accuracy in mobile icon component recognition

Performance & Real-World Applications

  • 60% end-to-end task success rate in the AndroidWorld simulation environment
  • Reliable execution of real-world tasks (e.g., video playing, flight booking, hotel reservation, train ticket purchase) in Chinese apps

Case Study

AndroidWorld

[easy] AudioRecorderRecordAudio

User Goal: Record an audio clip using Audio Recorder app and save it.

[medium] SimpleSmsSend

User Goal: Send a text message using Simple SMS Messenger to +15132327939 with message: The night is dark and full of terrors.

[hard] SimpleCalendarAddOneEvent

User Goal: In Simple Calendar Pro, create a calendar event on 2023-10-17 at 3h with the title 'Catch up on Annual Report' and the description 'We will prepare for software updates. Looking forward to productive discussions.'. The event should last for 60 mins.

Chinese App

VideoPlay

User Goal: 打开b站,播放《孤独的美食家第10季》的第2集。

TicketOrder

User Goal: 打开12306,购买5月20日北京出发到西安、最早一班的车票,其他要求:二等座、靠窗。

Current Limitations of the Two-Phase Framework

While the two-phase framework demonstrates strengths in hierarchical task decomposition, its real-time execution faces latency challenges.

To address these limitations, we propose:

  1. Inference Acceleration - Leverage parallel computing, model pruning, and quantization techniques
  2. Lightweight Model Development - Optimize hyperparameters and integrate advanced algorithms
  3. Task Scheduling Enhancement - Implement efficient planning algorithms and dynamic resource allocation