Amazon is addressing a critical challenge in image-to-text artificial intelligence with the release of four new multimodal large language model (MLLM)-as-a-Judge evaluators within its Strands Evals software development kit. These evaluators, designed to assess Overall Quality, Correctness, Faithfulness, and Instruction Following, aim to solve the problem of AI-generated captions inventing details not present in the original image. Gartner predicts that by 2026, 80% of enterprise software will be multimodal, a significant jump from less than 10% currently, underscoring the growing need for reliable evaluation methods. The new system sends the image directly to the judge model alongside the query and response, enabling grounded scoring and debugging; as Amazon explains, a text-only evaluator “cannot tell you whether a caption faithfully describes an image.”
Define the Case and evaluators
This shift necessitates robust evaluation methods for image-to-text artificial intelligence systems, particularly as applications like visual shopping, image understanding, and chart analysis become increasingly prevalent. These evaluators represent a focused effort to address a critical challenge in the field: the tendency of AI models to generate captions or summaries containing details not actually present in the original image. The new tools aim to mitigate this issue by directly incorporating the image into the evaluation process, allowing the judge model to score and debug responses with grounded visual context. The core of Amazon’s approach lies in the “Case” structure within the Strands Evals framework, which encapsulates both the image and the instruction. This allows for a comprehensive assessment of the AI’s ability to interpret visual information and respond accurately to prompts.
Activating reference-based judging by providing a further refines the evaluation process for those evaluators that support it, enabling a more objective comparison between the AI’s response and a known correct answer. An example case, as defined by Amazon’s developers, involves presenting a revenue chart image alongside the instruction, “Which region has the highest average revenue? State the region name and the dollar amount shown in the chart.” The expected output for this case is “U.S. and Canada has the highest at $.”, providing a clear benchmark for assessing the AI’s performance. This methodology moves beyond simple text-based evaluation, acknowledging the increasing importance of multimodal understanding in modern AI applications. Looking ahead, the integration of these evaluators promises to significantly improve the reliability and trustworthiness of image-to-text AI systems.
The evaluators themselves are designed to provide different types of feedback; the MultimodalOverallQualityEvaluator utilizes a Likert scale from 1 to 5, while the MultimodalCorrectnessEvaluator, MultimodalFaithfulnessEvaluator, and MultimodalInstructionFollowingEvaluator all provide binary assessments. This multifaceted approach allows developers to pinpoint specific areas for improvement, whether it’s overall response quality, factual accuracy, adherence to the source material, or the ability to follow instructions precisely. Amazon’s developers explain, “The Case wraps the image and instruction in a MultimodalInput,” and “providing expected_output activates reference-based judging for the evaluators that support it.” This emphasis on reference-based judging is crucial for establishing a reliable and consistent evaluation process, particularly as AI models become more complex and capable of generating increasingly nuanced responses. The release of these tools signals a broader industry trend toward more rigorous and comprehensive evaluation methodologies for multimodal AI, essential for building user trust and ensuring responsible deployment.
Wire up the task and run the experiment
Industry leaders predict a substantial increase in the deployment of multimodal artificial intelligence evaluation tools this year, driven by the growing complexity of image-to-text models and the need for more robust performance metrics. This move signifies a shift away from solely text-based evaluations towards methodologies that directly incorporate visual data, acknowledging that accurate image understanding is paramount for applications ranging from visual shopping assistance to complex document analysis. Gartner predicts that by 2026, 80% of enterprise software will be multimodal, up from less than 10% currently. This approach allows for a more grounded scoring process, enabling developers to pinpoint the source of errors and debug their systems with greater precision. Amazon’s developers have created a task function that receives each individual case, processes the image using a vision model in conjunction with the provided instruction, and then returns the resulting response string for evaluation.
The agent, defined as Agent(callback_handler=None), then processes these messages, which include both the image data in the appropriate format and the textual instruction. The system then utilizes Experiment(cases=cases, evaluators=evaluators).run_evaluations_async( task=run_task, max_workers=1, ) to execute the evaluations, demonstrating a streamlined workflow for assessing multimodal AI performance. This detailed process highlights the practical steps developers can take to integrate these tools into their existing pipelines. Experts anticipate that this year will see increased experimentation with ablation studies, facilitated by the Strands Evals SDK’s design. Developers are now able to easily compare the performance of their models when evaluated with both the full multimodal input, including the image, and a plain-string input that simply describes the image. This capability allows for a direct assessment of whether the image modality is genuinely contributing to improved accuracy and understanding.
The ability to isolate the impact of the image component is crucial for optimizing model performance and ensuring that visual information is being processed effectively. The Amazon team explains, “Because each Case above carries a MultimodalInput with media, the four evaluators include the image in the judge prompt,” underscoring the seamless integration of visual data into the evaluation process. This methodology represents a significant advancement in the field of AI evaluation, moving beyond simplistic metrics to embrace a more holistic and nuanced understanding of model capabilities.
Gartner predicts that by , 80% of enterprise software will be multimodal, up from less than 10% in .
Gartner
Inspect the Report
Amazon researchers are actively refining methods for evaluating the performance of multimodal large language models (MLLMs), specifically those tasked with interpreting images and generating text. Sangmin Woo, Haibo Ding, Sungyeon Kim, and Vinayak Arannil detailed a new suite of evaluators within the Strands Evals SDK, signaling a shift toward automated, image-grounded assessment of AI systems. Amazon’s solution isn’t to simply improve image recognition, but to fundamentally alter the evaluation process itself. A sample evaluation, using a chart depicting regional revenue, produced a transcript demonstrating the system’s capabilities. The system reported, “U.S. and Canada has the highest at $.” The detailed report accompanying the evaluation is crucial for debugging.
Each assessment includes not only a score but also a reasoning string explaining the basis for that score. The researchers explain, “When a run fails in CI, you can see why without re-running,” emphasizing the practical benefit of this feature for continuous integration and development workflows. The team demonstrated that a single “Case” can be scored by four independent judges, one using a Likert scale and three employing binary assessments, within a single experiment, maintaining workflow consistency with existing text-only Strands Evals. For specialized criteria, the base class accepts an arbitrary rubric string, allowing for customization. A key question explored by the Amazon team was whether a multimodal judge, one that directly processes the image, is necessary. They compared MLLM-as-a-Judge (image plus text) against LLM-as-a-Judge, using both long and short image descriptions as input.
The results were conclusive. They found that “The multimodal judge aligned more closely with human scores than either text-only variant.” The text-only approach proved no more efficient, as it required an additional LLM call to generate the image description. They state, “If you have a multimodal judge available, use it directly.” This finding underscores the importance of leveraging the full potential of multimodal models for evaluating multimodal outputs. Selecting the appropriate model to serve as the judge was another critical consideration. After evaluating several MLLMs available on Amazon Bedrock, the team determined that Anthropic Claude Sonnet offered the best balance of accuracy, cost, and latency. They also observed that larger, reasoning-capable models generally performed more reliably, but that premium-priced models didn’t consistently outperform mid-tier options for this specific task. Prompt design also played a significant role.
The researchers found that asking the judge to had the most substantial impact on alignment with human judgment. “Score-only output is cheaper and more self-consistent, but alignment with human scores drops noticeably.” They also recommended including a few diverse calibration examples and using a fine-grained, multi-dimensional rubric to prevent a single vague score from masking distinct failure modes. For content-grounded metrics like Overall Quality, Correctness, and Faithfulness, they suggest using reference answers, while for structural metrics like Instruction Following, references can be detrimental. They noted, “Adding reference content distracted the judge from checking structural constraints.” The four new MLLM-as-a-Judge evaluators in Strands Evals move image-to-text evaluation from expensive human review or unreliable text-only proxies to automated, image-grounded scoring.
Anthropic Claude Sonnet 4.6 on Amazon Bedrock offered the best accuracy-to-cost trade-off across our runs, and we use it as the default judge model for the multimodal evaluators.