Key Components
Planner: High-Level Task Orchestration
The Planner leverages a structured prompt template that incorporates:
- Action space selection
- Task decomposition logic
- Execution guidelines
- Chain-of-thought reasoning
This design significantly enhances the multi-step planning capability of the general-purpose multimodal large model in complex environments.
Grounder: Precise Element Localization
The Grounder employs an enhanced training strategy, utilizing:
- Extensive open-source datasets of screenshots and UI element icons
- Custom synthetic data simulating Chinese app interfaces for improved localization
Fine-tuned on a general-purpose multimodal large model, the Grounder achieves ( ScreenSpot-V2 benchmark):
- 97.2% accuracy in mobile text component recognition
- 82.5% accuracy in mobile icon component recognition
Performance & Real-World Applications
- 60% end-to-end task success rate in the AndroidWorld simulation environment
- Reliable execution of real-world tasks (e.g., video playing, flight booking, hotel reservation, train ticket purchase) in Chinese apps
Case Study
AndroidWorld
[easy] AudioRecorderRecordAudio
User Goal: Record an audio clip using Audio Recorder app and save it.
[medium] SimpleSmsSend
User Goal: Send a text message using Simple SMS Messenger to +15132327939 with message: The night is dark and full of terrors.
[hard] SimpleCalendarAddOneEvent
User Goal: In Simple Calendar Pro, create a calendar event on 2023-10-17 at 3h with the title 'Catch up on Annual Report' and the description 'We will prepare for software updates. Looking forward to productive discussions.'. The event should last for 60 mins.
Chinese App
VideoPlay
User Goal: 打开b站,播放《孤独的美食家第10季》的第2集。
TicketOrder
User Goal: 打开12306,购买5月20日北京出发到西安、最早一班的车票,其他要求:二等座、靠窗。
Current Limitations of the Two-Phase Framework
While the two-phase framework demonstrates strengths in hierarchical task decomposition, its real-time execution faces latency challenges.
To address these limitations, we propose:
- Inference Acceleration - Leverage parallel computing, model pruning, and quantization techniques
- Lightweight Model Development - Optimize hyperparameters and integrate advanced algorithms
- Task Scheduling Enhancement - Implement efficient planning algorithms and dynamic resource allocation