Building Agents with Vision in Astreus
Create agents that can see and understand images. Analyze screenshots, diagrams, photos, and visual content with multimodal AI capabilities.
Modern AI agents can process images just as naturally as they process text. Astreus makes it simple to build vision-enabled agents that analyze visual content, from UI screenshots to data visualizations.
Getting Started
First, install Astreus and set up your environment. You'll need an OpenAI API key with access to vision-capable models.
Bashnpm install @astreus-ai/astreus
Create a .env file with your configuration:
OPENAI_API_KEY=sk-your-openai-api-key-here
DB_URL=sqlite://./astreus.db
Creating a Vision Agent
Enable vision capabilities by setting the vision flag and specifying a vision-capable model. The gpt-4o model provides strong multimodal performance.
JavaScriptimport { Agent } from '@astreus-ai/astreus'; const agent = await Agent.create({ name: 'VisionBot', model: 'gpt-4o', visionModel: 'gpt-4o', vision: true, systemPrompt: 'You can analyze and describe images in detail.' });
The visionModel parameter specifies which model handles image processing. Using the same model for both text and vision ensures consistent behavior.
Analyzing Images
Pass images to your agent using the attachments parameter. The agent processes both the text prompt and the image together.
JavaScript
The agent examines the image and provides a detailed description. You can ask specific questions to focus the analysis on particular aspects.
UI Design Review
Vision agents excel at evaluating interface designs. They spot inconsistencies, accessibility issues, and usability problems in mockups and screenshots.
JavaScript
The system prompt guides how the agent approaches visual analysis. Tailor it to your specific use case for more relevant feedback.
Visual Debugging
Share error screenshots instead of transcribing stack traces. The agent reads error messages, identifies issues, and suggests fixes.
JavaScript
This natural workflow accelerates debugging. The agent examines the entire error context visible in the screenshot, often catching details you might miss when manually transcribing.
Data Extraction
Extract structured data from invoices, receipts, and forms without complex OCR pipelines. Vision agents understand document layout contextually.
JavaScript
The agent recognizes field types and relationships, handling variations in format. It returns clean structured data ready for processing in your application.
Chart Analysis
Analyze data visualizations to extract insights and trends. The agent interprets visual encodings like color, position, and size.
JavaScriptconst analysis = await agent.ask( 'Summarize the key trends shown in this sales chart.', { attachments: [{ type: 'image', path: './sales-chart.png' }] } );
This works even when underlying data isn't available. The agent reads values from axes, identifies patterns, and highlights outliers directly from the visualization.
Comparing Multiple Images
Process multiple images simultaneously for side-by-side comparison. Pass multiple attachments in a single request.
JavaScriptconst comparison = await agent.ask( 'Compare these two designs. Which has better visual hierarchy?', { attachments: [ { type: 'image', path: './design-v1.png' }, { type: 'image', path: './design-v2.png' } ] } );
The agent analyzes both images together, identifying specific differences and their impact. This enables sophisticated before-after analysis and A/B testing evaluation.
Running Your Agent
Once you've built your vision agent, run it in your development environment:
Bashnpm run dev
The complete example repository is available at astreus-ai/agent-with-vision on GitHub. Clone it to explore the full implementation and experiment with different use cases.
Key Configuration Options
Understanding the configuration options helps you optimize your vision agents:
- name: Agent identifier for tracking and debugging
- model: Primary language model (gpt-4o recommended for vision)
- visionModel: Vision-specific model, typically matches the primary model
- vision: Boolean flag enabling image processing capabilities
- systemPrompt: Instructions that guide agent behavior and analysis approach
- attachments: Array of image references with type and path properties
Image Input Methods
Astreus supports multiple ways to provide images. Use local file paths for the most straightforward approach:
JavaScriptattachments: [{ type: 'image', path: '/absolute/path/to/image.png' }]
Relative paths work too, resolved from your project directory. Choose the method that fits your workflow and file organization.
Crafting Effective Prompts
Specific prompts produce better results. Provide context about the image type and what aspects you want analyzed.
JavaScript// Generic (less effective) await agent.ask('What do you see?', { attachments: [...] }); // Specific (more effective) await agent.ask( 'This is a mobile checkout screen. Identify any usability issues that might prevent users from completing their purchase.', { attachments: [...] } );
Frame questions around specific concerns or goals. Mention the target audience or use case to help the agent apply appropriate criteria in its analysis.
Use Cases
Vision-enabled agents unlock powerful workflows across many domains:
Design & UX: Automated design review, accessibility audits, consistency checking across pages and components.
Development: Visual debugging from screenshots, code review from presentation slides, architecture diagram analysis.
Data Processing: Invoice and receipt processing, form data extraction, chart and graph interpretation.
Quality Assurance: Visual regression testing, screenshot comparison, UI compliance verification.
Performance Considerations
Image processing consumes more tokens than text-only interactions. Start with moderate resolution images and increase quality only when the agent misses important details. Balance quality with cost efficiency for your specific use case.
The gpt-4o model provides strong vision capabilities with reasonable token usage. Monitor your usage patterns to optimize for your workload.
Building Specialized Agents
Create domain-specific agents by tailoring system prompts. This focuses analysis on relevant criteria for your use case.
JavaScript
Specialized agents provide more relevant insights because they apply domain-appropriate evaluation criteria. The system prompt acts as their expertise and guides their analytical approach.
Next Steps
Start with simple tasks like image description to build intuition. Experiment with prompt phrasing to understand how different approaches affect output quality and focus.
As you gain experience, combine vision with other Astreus capabilities. Build specialized agents for your specific visual analysis needs. The example repository at astreus-ai/agent-with-vision provides a solid foundation to explore and extend.
Vision capabilities open up entirely new interaction patterns. Agents that can see bridge the gap between human visual communication and AI processing, enabling more natural and powerful workflows.
This experiment is written for Astreus v0.5.37. Please ensure you are using a compatible version.