JT-GUIAgent-V1: A Planner-Grounder Agent for Reliable GUI Interaction

Planner: High-Level Task Orchestration

The Planner leverages a structured prompt template that incorporates:

Action space selection
Task decomposition logic
Execution guidelines
Chain-of-thought reasoning

This design significantly enhances the multi-step planning capability of the general-purpose multimodal large model in complex environments.

Grounder: Precise Element Localization

The Grounder employs an enhanced training strategy, utilizing:

Extensive open-source datasets of screenshots and UI element icons
Custom synthetic data simulating Chinese app interfaces for improved localization

Fine-tuned on a general-purpose multimodal large model, the Grounder achieves ( ScreenSpot-V2 benchmark):

97.2% accuracy in mobile text component recognition
82.5% accuracy in mobile icon component recognition

Performance & Real-World Applications

60% end-to-end task success rate in the AndroidWorld simulation environment
Reliable execution of real-world tasks (e.g., video playing, flight booking, hotel reservation, train ticket purchase) in Chinese apps

AndroidWorld

[easy] AudioRecorderRecordAudio

User Goal: Record an audio clip using Audio Recorder app and save it.

[medium] SimpleSmsSend

User Goal: Send a text message using Simple SMS Messenger to +15132327939 with message: The night is dark and full of terrors.

[hard] SimpleCalendarAddOneEvent

User Goal: In Simple Calendar Pro, create a calendar event on 2023-10-17 at 3h with the title 'Catch up on Annual Report' and the description 'We will prepare for software updates. Looking forward to productive discussions.'. The event should last for 60 mins.

Chinese App

VideoPlay

User Goal: 打开b站，播放《孤独的美食家第10季》的第2集。

TicketOrder

User Goal: 打开12306，购买5月20日北京出发到西安、最早一班的车票，其他要求：二等座、靠窗。

Current Limitations of the Two-Phase Framework

While the two-phase framework demonstrates strengths in hierarchical task decomposition, its real-time execution faces latency challenges.

To address these limitations, we propose:

Inference Acceleration - Leverage parallel computing, model pruning, and quantization techniques
Lightweight Model Development - Optimize hyperparameters and integrate advanced algorithms
Task Scheduling Enhancement - Implement efficient planning algorithms and dynamic resource allocation