Mouse (AI Vision) API Endpoints

Overview

The Mouse (AI Vision) API category provides AI-powered tools to control the mouse by describing UI elements, rather than using exact coordinates. These endpoints allow you to perform mouse operations (clicking, double-clicking, right-clicking, and dragging) by simply describing the visual appearance of UI elements.

This approach is particularly useful when you don't know the exact coordinates of elements, or when elements might appear at different positions depending on the state of the application or screen. The AI vision system analyzes a screenshot and determines the most likely location of the described element.

Note: Each AI vision operation consumes 50-100 Smooth Operator API tokens, depending on the complexity of identifying the described element. More complex or ambiguous element descriptions may require more tokens.

If you know the exact coordinates of elements, you may want to use the Mouse (Coordinates) category instead for faster operation.

Available Endpoints

Click Element by Description

POST /tools-api/mouse/click-by-description

Uses AI vision to find and click a UI element based on its description.

View Details

Double Click Element by Description

POST /tools-api/mouse/doubleclick-by-description

Uses AI vision to find and double-click a UI element based on its description.

View Details

Right Click Element by Description

POST /tools-api/mouse/rightclick-by-description

Uses AI vision to find and right-click a UI element based on its description.

View Details

Drag Element by Description

POST /tools-api/mouse/drag-by-description

Uses AI vision to find a UI element and drag it to a target location.

View Details

Usage Notes

When using these endpoints, the quality of the element description is crucial for accurate results. Descriptive, unique identifying features work best. For example, "the red X button in the top-right corner of the dialog" is better than just "the button".

Each endpoint can optionally accept a base64-encoded screenshot in the request. If not provided, the system will automatically take a screenshot before processing. Providing your own screenshot can be useful if you want to ensure the system operates on a specific state of the screen.

These operations are powered by an AI vision model specifically trained to understand desktop user interfaces. While the model is highly accurate, it may occasionally misidentify elements, especially if the description is ambiguous or if there are multiple similar-looking elements on the screen.