Drag by Description

POST
/tools-api/mouse/drag-by-description

Description

Uses AI vision to find UI elements based on natural language descriptions and perform a drag and drop operation from one element to another. This endpoint leverages computer vision AI to identify interface elements from descriptive text, allowing more intuitive interaction with the computer.

Token Usage: Each invocation of this endpoint consumes 50-100 Smooth Operator API tokens, depending on the complexity of identifying both the source and target elements. Complex or ambiguous descriptions may require more tokens.

Request Format

The request must include descriptions of both the source element to drag and the target destination.

{
    "startElementDescription": "string",  // Description of the UI element to drag from
    "endElementDescription": "string",    // Description of the destination element or location
    "imageBase64": "string",              // Optional base64-encoded screenshot image
    "mechanism": "string"                 // Optional mechanism to use for vision detection
}
            

Parameters

Parameter Type Required Description
startElementDescription string Yes A natural language description of the UI element to drag from. For example, "the file icon", "the image", "the blue folder".
endElementDescription string Yes A natural language description of the destination element or location. For example, "the folder icon", "the trash can", "the document area".
imageBase64 string No Optional base64-encoded screenshot image. If not provided, the system will automatically take a screenshot.
mechanism string No Optional mechanism to use for vision detection. Currently, only "screengrasp2" is used regardless of the value provided in this parameter.

Response Format

The response indicates whether the drag operation was successful and provides details about the operation. If the operation fails, the message will indicate which element was not found.

Note: The drag operation does not animate the mouse movement; it simply teleports the mouse to the start point, clicks, and then teleports to the end point before releasing.
{
    "success": boolean,    // Whether the operation was successful
    "message": "string",   // Result message or error description
    "timestamp": "string"  // ISO timestamp of the operation
}
            

Response Fields

Field Type Description
success boolean Indicates whether the operation was successful.
message string A message describing the result of the operation. If successful, indicates the coordinates of the drag operation. If failed, indicates which element (start or end) could not be found.
timestamp string ISO timestamp indicating when the operation was performed.

Example Response

{
    "success": true,
    "message": "Dragged from (450, 300) to (700, 450)",
    "timestamp": "2023-09-15T14:32:10.123Z"
}
            

Code Examples

import requests
import json
import base64
from PIL import ImageGrab
import io

def drag_by_description(start_element_description, end_element_description, image_base64=None, mechanism=None):
    """
    Performs a drag operation from one element to another using AI vision.
    
    Args:
        start_element_description (str): Description of the UI element to drag from
        end_element_description (str): Description of the destination element or location
        image_base64 (str, optional): Base64-encoded screenshot image. If None, takes a screenshot
        mechanism (str, optional): Mechanism to use for vision detection
        
    Returns:
        dict: Response from the server
    """
    # Base URL of the Tools Server
    base_url = "http://localhost:54321"
    
    # Endpoint
    endpoint = "/tools-api/mouse/drag-by-description"
    
    # If no image is provided, take a screenshot
    if image_base64 is None:
        # Capture the screen
        screenshot = ImageGrab.grab()
        
        # Convert to base64
        buffer = io.BytesIO()
        screenshot.save(buffer, format="PNG")
        image_base64 = base64.b64encode(buffer.getvalue()).decode("utf-8")
    
    # Prepare the request
    headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer YOUR_API_KEY"
    }
    
    payload = {
        "startElementDescription": start_element_description,
        "endElementDescription": end_element_description,
        "imageBase64": image_base64,
        "mechanism": mechanism
    }
    
    # Send the request
    response = requests.post(
        base_url + endpoint,
        headers=headers,
        data=json.dumps(payload)
    )
    
    # Return the response as JSON
    return response.json()

# Example usage
result = drag_by_description(
    "the file icon", 
    "the folder icon",
    mechanism="screengrasp2"
)

print(f"Success: {result['success']}")
print(f"Message: {result['message']}")
print(f"Timestamp: {result['timestamp']}")
import axios from 'axios';

/**
 * Performs a drag operation from one element to another using AI vision
 * 
 * @param startElementDescription - Description of the UI element to drag from
 * @param endElementDescription - Description of the destination element or location
 * @param imageBase64 - Optional base64-encoded screenshot image
 * @param mechanism - Optional mechanism to use for vision detection
 * @returns Promise with the response data
 */
async function dragByDescription(
    startElementDescription: string,
    endElementDescription: string,
    imageBase64?: string,
    mechanism?: string
): Promise<{
    success: boolean;
    message: string;
    timestamp: string;
}> {
    // Base URL of the Tools Server
    const baseUrl = 'http://localhost:54321';
    
    // Endpoint
    const endpoint = '/tools-api/mouse/drag-by-description';
    
    // Request headers
    const headers = {
        'Content-Type': 'application/json',
        'Authorization': 'Bearer YOUR_API_KEY'
    };
    
    // Request payload
    const payload = {
        startElementDescription,
        endElementDescription,
        imageBase64, // If undefined, server will take a screenshot
        mechanism
    };
    
    try {
        // Send the request
        const response = await axios.post(
            baseUrl + endpoint,
            payload,
            { headers }
        );
        
        // Return the response data
        return response.data;
    } catch (error) {
        // Handle any errors
        console.error('Error performing drag operation:', error);
        throw error;
    }
}

// Example usage
async function example() {
    try {
        const result = await dragByDescription(
            'the file icon',
            'the folder icon',
            undefined,
            'screengrasp2'
        );
        
        console.log(`Success: ${result.success}`);
        console.log(`Message: ${result.message}`);
        console.log(`Timestamp: ${result.timestamp}`);
    } catch (error) {
        console.error('Failed to perform drag operation:', error);
    }
}

example();
using System;
using System.Net.Http;
using System.Text;
using System.Text.Json;
using System.Threading.Tasks;

public class ToolsServerClient
{
    private readonly HttpClient _httpClient;
    private readonly string _baseUrl;
    private readonly string _apiKey;
    
    public ToolsServerClient(string baseUrl = "http://localhost:54321", string apiKey = "YOUR_API_KEY")
    {
        _baseUrl = baseUrl;
        _apiKey = apiKey;
        
        _httpClient = new HttpClient();
        _httpClient.DefaultRequestHeaders.Add("Authorization", $"Bearer {_apiKey}");
    }
    
    /// 
    /// Performs a drag operation from one element to another using AI vision
    /// 
    /// Description of the UI element to drag from
    /// Description of the destination element or location
    /// Optional base64-encoded screenshot image
    /// Optional mechanism to use for vision detection
    /// Action response indicating success or failure
    public async Task DragByDescriptionAsync(
        string startElementDescription,
        string endElementDescription,
        string imageBase64 = null,
        string mechanism = null)
    {
        // Endpoint
        var endpoint = "/tools-api/mouse/drag-by-description";
        
        // Create request object
        var request = new DragByDescriptionRequest
        {
            StartElementDescription = startElementDescription,
            EndElementDescription = endElementDescription,
            ImageBase64 = imageBase64, // If null, server will take a screenshot
            Mechanism = mechanism
        };
        
        // Serialize request to JSON
        var jsonContent = JsonSerializer.Serialize(request);
        var content = new StringContent(jsonContent, Encoding.UTF8, "application/json");
        
        // Send the request
        var response = await _httpClient.PostAsync(_baseUrl + endpoint, content);
        
        // Ensure success status code
        response.EnsureSuccessStatusCode();
        
        // Deserialize the response
        var jsonResponse = await response.Content.ReadAsStringAsync();
        return JsonSerializer.Deserialize(jsonResponse);
    }
}

// Request and response models
public class DragByDescriptionRequest
{
    public string StartElementDescription { get; set; }
    public string EndElementDescription { get; set; }
    public string ImageBase64 { get; set; }
    public string Mechanism { get; set; }
}

public class ActionResponse
{
    public bool Success { get; set; }
    public string Message { get; set; }
    public DateTime Timestamp { get; set; }
}

// Example usage
class Program
{
    static async Task Main()
    {
        var client = new ToolsServerClient();
        
        try
        {
            var result = await client.DragByDescriptionAsync(
                "the file icon",
                "the folder icon",
                mechanism: "screengrasp2"
            );
            
            Console.WriteLine($"Success: {result.Success}");
            Console.WriteLine($"Message: {result.Message}");
            Console.WriteLine($"Timestamp: {result.Timestamp}");
        }
        catch (Exception ex)
        {
            Console.WriteLine($"Error: {ex.Message}");
        }
    }
}

Usage Notes

  • Provide clear, specific descriptions of both the source and target elements for best results
  • Describe visual characteristics like color, size, position, or text content to help the AI identify elements accurately
  • If the operation fails, try providing more detailed descriptions or using the coordinate-based endpoint /tools-api/mouse/drag instead
  • For dragging files, ensure both the file and the destination folder are clearly visible on screen
  • The system may take a moment to process the request as it analyzes the screen with computer vision
  • If a specific element is not found, the error message will indicate whether it was the start or end element that could not be located

Examples

Example 1: Dragging a file to a folder

{
  "startElementDescription": "the Excel spreadsheet file on the desktop",
  "endElementDescription": "the Documents folder"
}

Example 2: Dragging an image in a design application

{
  "startElementDescription": "the blue circle in the top-left corner of the canvas",
  "endElementDescription": "the center of the design area"
}

Example 3: Using with a screenshot

{
  "startElementDescription": "the red button labeled 'Delete'",
  "endElementDescription": "the trash bin icon at the bottom right",
  "imageBase64": "..."
}