Designing for the Unknown:

Beyond the Thumbs Up: Rethinking Metrics for the Internet of Agents

How do we measure success when AI works with us, not just for us? Traditional AI metrics focus on speed and accuracy, but the real question is: does AI make us better decision-makers? Explore why we need new ways to evaluate AI-human collaboration and what the future of AI measurement looks like in the Internet of Agents.

Articles

6 min

Let’s dive into a quick story.

Mike is a logistics manager for a big supply chain operation, overseeing shipments, inventory, and distribution schedules. One afternoon, an autonomous AI agent on his team flags an unexpected issue—a manufacturing delay at a supplier two levels down the chain.

Normally, Mike wouldn’t have caught the problem until shipments were already affected. But the AI agent doesn’t just dump data on his desk. Instead, it flags the risk like a sharp-eyed colleague, offering a recommendation.

“This supplier has a pattern of delays during peak season. Based on historical reliability and cost-effectiveness, I suggest switching to an alternate supplier with minimal disruption.”

Mike hesitates. Switching suppliers mid-cycle is a risk. He needs more confidence. Instead of making a knee-jerk decision, he turns back to the AI agent. “How accurate have your past disruption predictions been?”

The AI pulls up its track record. “Over the last 18 months, I’ve correctly identified disruptions 92% of the time.”

That helps, but Mike isn’t convinced yet. “What about supplier B? They’ve been reliable before.”

“Supplier B has a longer lead time and higher costs,” the AI replies. “However, if speed is your top priority, I can adjust my recommendation.”

Mike considers this. “What about supplier C? Any chance they can handle this?”

“Supplier C is an option, but they have a 15% failure rate on last-minute orders. If reducing risk is the goal, supplier A is still your best bet.”

Now, with a clear picture of his choices, Mike approves the switch to supplier A.

This is what a looped-in-human workflow looks like, where leaders like Mike don’t just use AI as a tool, they collaborate with it. They bounce solutions back and forth, refining decisions the way they would with a trusted colleague.

But trust and collaboration doesn’t happen overnight. It’s built over time, shaped by every interaction.

While Mike’s story is made up, the challenge is real.

As AI shifts from automation to autonomous collaboration, the way we interact with it has to evolve. In Designing Human-Machine Interactions in an Autonomous Agent World, I explored how AI isn’t just a tool but a collaborator, requiring new design principles to create meaningful human-agent interactions.

But designing better interactions is only half the equation, measuring their success is just as critical. At Outshift, we’re focused on exactly this: how do we build The Internet of Agents to work seamlessly with people and prove that agents are a reliable partner? That’s why measuring trust and collaboration in AI isn’t just a technical challenge; it’s a fundamental shift in how we define success.

Would Mike rely on it the next time? Would his colleagues? If supply chain teams were to track AI-assisted interventions, they wouldn’t just look at whether the AI flagged disruptions accurately, they’d also need to understand its impact on decision-making confidence, operational speed, and overall efficiency.

We’re used to tracking how often AI gets the right answer, but what about how it actually affects decision-making? Does it offer new insights that push people to think differently? Does it make experts more confident in their choices? These are the questions we should be asking.

Why Traditional AI Metrics Fall Short

Accuracy, Latency, Efficiency, Hallucination Rate, etc.

AI metrics, like these, revolve around the model and its efficiency.

Measuring only model performance is like rating a chef just on how fast they chop onions. It’s one metric, but it hardly tells the whole story.

Did the AI actually improve the way people made decisions? Did it make teamwork smoother or lighten the mental load? AI isn’t just about getting the right answer—it’s about fitting into real workflows, making people more capable, and proving itself as a reliable partner over time.

So how do we measure what actually matters for The Internet of Agents.

Measuring What Matters: New Metrics for AI-Human Collaboration

If we really want AI to work as a teammate, not just a tool, we need a new way to measure its performance.

At Outshift, we’re thinking a lot about how to measure agentic collaboration in the Internet of Agents. Here are some key areas where we believe new kinds of metrics are needed:

Decision-Making Metrics

Decision Latency: How long does it take from problem identification to resolution when AI is involved?
Trust Reinforcement: How often do people validate or refine AI suggestions, and does trust in AI improve over time?
Consensus Speed: How quickly can a team (including AI) align on a decision?

Workflow & Efficiency Metrics

Task Hand-off Effectiveness: Does AI seamlessly transition tasks between humans and agents, or does it create bottlenecks?
Cognitive Load Reduction: Does AI meaningfully reduce the mental effort required to complete a task, or does it add more work?
Hybrid Intelligence Utilization: Is AI being used to elevate human expertise, rather than just automate tasks?

Collaboration & Interaction Metrics

Human Override Rate: How often do users reject or correct AI-generated actions
Agent Coordination Efficiency: How well do multiple AI agents collaborate on a shared task?
AI Adoption Rate: How frequently do users accept AI suggestions without modification?
Self-Correction Rate: How often does AI revise its own output before human intervention is required?

Moving Beyond the Thumbs Up

Today’s AI feedback mechanisms are basic at best. A thumbs-up/thumbs-down rating system isn’t enough to understand whether AI is truly working for us.

We need structured, contextual feedback loops that offer deeper insights:

Contextual Feedback: Tracking how often AI contributions actually improve human decision-making, not just how quickly they respond.
Collaboration Satisfaction Surveys: Gauging how well AI integrates into workflows from the user’s perspective.
Adaptive Feedback Loops: AI that learns from human revisions, pauses, and escalations to continuously improve its performance.

Designing the Right Framework

It’s not just about picking the right metrics, it’s about designing AI systems that can be observed, evaluated, and improved based on those metrics. At Outshift we're considering ways to do this, including:

Building an AI-Human Performance Dashboard that visualizes trust levels, task handoff success, and collaboration trends.
Identifying Friction Points in workflows to refine AI behaviors and reduce inefficiencies.
Developing Real-Time Observability Tools to capture AI-human interaction data in a structured way.

Content

The Future of AI Measurement

Right now, AI is mostly graded on how fast and accurate it is—like a student acing multiple-choice tests but never working in a group project. To really achieve The Internet of Agents, we need to shift the focus to how well AI actually collaborates with people.

If we measure that, we’re measuring the future of work itself.

Content

So, that gives us a world where AI agents can discover and authenticate one another, share complex information securely, and adapt to uncertainty while collaborating across different domains. And users will be working with agents that will pursue complex goals with limited direct supervision, acting autonomously on behalf of them.

As a design team, we are actively shaping how we navigate this transformation. And one key question keeps emerging: How do we design AI experiences that empower human-machine teams, rather than just automate them?

The Agentic Teammate: Enhancing Knowledge Work

In this new world, AI agents become our teammates, offering powerful capabilities:

Knowledge Synthesis: Agents aggregate and analyze data from multiple sources, offering fresh perspectives on problems.

Scenario Simulation: Agents can create hypothetical scenarios and test them in a virtual environment, allowing knowledge workers to experiment and assess risks.

Constructive Feedback: Agents critically evaluate human-proposed solutions, identifying flaws and offering constructive feedback.

Collaboration Orchestration: Agents work with other agents to tackle complex problems, acting as orchestrators of a broader agentic ecosystem.

The Agentic Teammate: Enhancing Knowledge Work

In this new world, AI agents become our teammates, offering powerful capabilities:

Knowledge Synthesis: Agents aggregate and analyze data from multiple sources, offering fresh perspectives on problems.

Scenario Simulation: Agents can create hypothetical scenarios and test them in a virtual environment, allowing knowledge workers to experiment and assess risks.

Constructive Feedback: Agents critically evaluate human-proposed solutions, identifying flaws and offering constructive feedback.

Collaboration Orchestration: Agents work with other agents to tackle complex problems, acting as orchestrators of a broader agentic ecosystem.

Addressing the Challenges: Gaps in Human-Agent Collaboration

All this autonomous help is great, sure – but it's not without its challenges.

Autonomous agents have fundamental gaps that we need to address to ensure successful collaboration:

The Agentic Teammate: Enhancing Knowledge Work

In this new world, AI agents become our teammates, offering powerful capabilities:

Knowledge Synthesis: Agents aggregate and analyze data from multiple sources, offering fresh perspectives on problems.

Scenario Simulation: Agents can create hypothetical scenarios and test them in a virtual environment, allowing knowledge workers to experiment and assess risks.

Constructive Feedback: Agents critically evaluate human-proposed solutions, identifying flaws and offering constructive feedback.

Collaboration Orchestration: Agents work with other agents to tackle complex problems, acting as orchestrators of a broader agentic ecosystem.

Probabilistic Operations

AI agents work with probabilities, leading to inconsistent outcomes and misinterpretations of intent.

Trust Over Time

Humans tend to trust AI teammates less than human teammates, making it crucial to build that trust over time.

Gaps in Contextual Understanding

AI agents often share raw data instead of contextual states, and may miss human nuances like team dynamics and intuition.

Challenges in Mental Models

Evolving AI systems can be difficult for humans to understand and keep up with, as the AI's logic may not align with human mental models.

The Solution:
Five Design Principles for Human-Agent Collaboration

Put Humans in the Driver's Seat

Users should always have the final say, with clear boundaries and intuitive controls to adjust agent behavior. An example of this is Google Photos' Memories feature which allows users to customize their slideshows and turn the feature off completely.

Make the Invisible Visible

The AI's reasoning and decision-making processes should be transparent and easy to understand, with confidence levels or uncertainty displayed to set realistic expectations. North Face's AI shopping assistant exemplifies this by guiding users through a conversational process and providing clear recommendations.

Ensure Accountability

Anticipate edge cases to provide clear recovery steps, while empowering users to verify and adjust AI outcomes when needed. ServiceNow's Now Assist AI is designed to allow customer support staff to easily verify and adjust AI-generated insights and recommendations.

Collaborate, Don't Just Automate

Prioritize workflows that integrate human and AI capabilities, designing intuitive handoffs to ensure smooth collaboration. Aisera HR Agents demonstrate this by assisting with employee inquiries while escalating complex issues to human HR professionals.

Earn Trust Through Consistency:

Build trust gradually with reliable results in low-risk use cases, making reasoning and actions transparent. ServiceNow's Case Summarization tool is an example of using AI in a low-risk scenario to gradually build user trust in the system's capabilities.

Empowering Users with Control

Establishing clear boundaries for AI Agents to ensure they operate within a well-defined scope.

Designing Tomorrow's Human-Agent Collaboration At Outshift

These principles are the foundation for building effective partnerships between humans and AI at Outshift.

Building Confidence Through Clarity

Surface AI reasoning, displaying: Confidence Levels, realistic expectations, and the extent of changes to enable informed decision-making.

Always Try To Amplify Human Potential

Actively collaborate through simulations and come to an effective outcome together.

Let Users Stay In Control When It Matters

Easy access to detailed logs and performance metrics for every agent action, enabling the review of decisions, workflows, and ensure compliance. Include clear recovery steps for seamless continuity.

Take It One Interaction at a Time

See agent actions in context and observe agent performance in network improvement.

Addressing the Challenges: Gaps in Human-Agent Collaboration

All this autonomous help is great, sure – but it's not without its challenges.

Autonomous agents have fundamental gaps that we need to address to ensure successful collaboration:

Addressing the Challenges: Gaps in Human-Agent Collaboration

All this autonomous help is great, sure – but it's not without its challenges.

Autonomous agents have fundamental gaps that we need to address to ensure successful collaboration:

The Solution:
Five Design Principles for Human-Agent Collaboration

What to Consider:
Five Design Principles for Human-Agent Collaboration

Put Humans in the Driver's Seat

Make the Invisible Visible

Ensure Accountability

Ensure Accountability

Collaborate, Don't Just Automate

Earn Trust Through Consistency: