Collaborating with AI Agents: Teamwork, Productivity, and Performance

Evidence from 2,310 participants, 1,834 teams, 11,138 ads, and ~4.93M field impressions

Executive Summary

To understand how AI agents impact productivity and work processes, a large-scale experiment was conducted on the MindMeld platform, where 2,310 participants were assigned to either human-human or human-AI teams. The study found that collaborating with AI agents significantly increased communication and productivity per worker. Human-AI teams focused more on content generation and less on direct editing, producing higher-quality ad copy but lower-quality images compared to human-human teams. Field tests of the resulting ads, which garnered nearly 5 million impressions, showed that ads with high-quality text (from AI collaboration) and high-quality images (from human collaboration) performed best. Overall, the performance of ads from both team types was similar, suggesting AI agents can enhance teamwork, especially when their traits are tuned to complement their human partners.

The Experimental Procedure

1
AI Randomization
Participants were randomly assigned to collaborate with either another human participant (Human-Human) or an AI agent (Human-AI).
2
AI Personality Prompt Randomization
AI agents were assigned a personality profile based on the Big Five traits, with each trait randomly set to a high or low level.
3
Pre-Task Survey
Participants completed a 10-item survey to measure their own Big Five personality traits.
4
Real-Time Ad Creation Task
Teams collaborated for 40 minutes in the MindMeld workspace to produce as many high-quality ads as possible.
5
Post-Task Survey & Field Evaluation
Participants completed a teamwork quality survey, and the created ads were evaluated by human raters and tested in a live ad campaign.

MindMeld Experiment at a Glance

2,310
Participants
US-representative sample from Prolific
1,834
Teams
1,258 Human-AI and 576 Human-Human
11,138
Ads Created
Submitted in 40-minute sessions
183,691
Messages
Time-stamped chat records
1,960,095
Copy Edits
Fine-grained text edit logs
63,656
Image Edits
Visual workflow interactions
10,375
AI Images
Generated via Dall-E 3
4,932,373
Field Impressions
Live ads on X (formerly Twitter)
7,546
Field Clicks
Measured with unique DocSend links
400
Campaigns
5-ad split tests; ZIP-code isolated

Collaboration Outputs Logged on MindMeld

Across all teams, the platform captured 183,691 messages, 1,960,095 ad copy edits, 63,656 image edits, 10,375 AI-generated images, and 11,138 ad submissions. These volumes establish the study’s scale and provide a rich basis for analyzing teamwork dynamics.

Teams by Collaboration Mode

Of 1,834 total teams, 1,258 collaborated with an AI agent and 576 were human-only teams. Human-AI teams consist of one human paired with an AI agent, while Human-Human teams consist of two human collaborators.

Message Category Shifts with Human-AI Collaboration (Δ fraction)

Relative to Human-Human teams, Human-AI teams shifted communication toward task-oriented categories: more content (+0.036), process (+0.025), and feedback (+0.019), and away from social (−0.085) and emotional (−0.045) messages. This evidences reduced social coordination and greater task focus.

Ad Quality Effects by Rater (Δ Likert points, 1–7)

Human raters scored Human-AI ads higher on text (+0.324) but lower on image quality (−0.134), with no meaningful change in click likelihood (−0.014). AI raters also showed higher text (+0.122) and higher click (+0.068), with no change in image (−0.014). This highlights a text–image trade-off in Human-AI collaborations.

Productivity Effects on Submissions (Δ counts)

At the team level, Human-AI and Human-Human teams produced a similar number of ads (−0.406 difference). At the individual level, humans paired with AI produced substantially more ads (+2.341), underscoring higher per-worker productivity in Human-AI teams.

Copy Completion Uplift with Human-AI (Δ fraction)

Humans collaborating with AI had higher completion rates for all ad copy fields: headline (+0.189), primary text (+0.207), and description (+0.199). This suggests stronger task completion support in Human-AI settings.

Significant Personality Interactions on Collaboration Behaviors (coefficients)

Personality fit mattered. Messages increased when conscientious humans worked with conscientious AIs (+17.295). Copy edits dropped when agreeable humans worked with neurotic AIs (−435.239). Image selection activity rose with neurotic AIs paired with open humans (+22.124). These heterogeneous effects indicate that tuning AI traits can change how teams work.

Key Insights

Human-AI teams communicated more and focused on content/process rather than social/emotional exchanges. Text quality improved while image quality declined for Human-AI outputs. Individual productivity rose sharply with AI collaboration, and field results showed broadly equivalent CPC/CTR to Human-Human teams—with outcomes driven by text and image quality rather than team type. Personality prompt interactions significantly shaped collaboration behaviors and some field metrics, suggesting AI agents can be tuned to complement human traits.

What the Study Shows

MindMeld enabled randomized Human-Human and Human-AI collaborations in a real-time, multimodal workspace. Human-AI teams sent more task-oriented messages and fewer social/emotional messages than Human-Human teams, indicating reduced social coordination costs. Despite similar team-level output, individual humans in Human-AI teams produced substantially more ads and achieved higher copy completion. Quality evaluations revealed a trade-off: Human-AI teams excelled at text but produced lower-rated images. In field tests across ~4.93M impressions and 400 campaigns, Human-AI and Human-Human ads performed similarly on CPC and CTR, with text quality predicting higher CTR and longer view durations. Randomized AI personality prompts (Big Five) interacted with human traits to shape communication, editing behavior, and some field outcomes—evidence that tailoring AI agents to complement human personalities can improve collaborative effectiveness.

How the Study Worked

1
Randomization and Pairing
Participants were randomized to Human-Human or Human-AI conditions; Human-AI used a simulated 1–5 second queue before pairing with the agent.
2
Pre-task Survey
A 10-item Big Five personality inventory (7-point Likert) was completed before collaboration.
3
Real-time Collaboration
Teams had 40 minutes to create ads with synchronized text/image editing and chat. The AI could message, edit copy, select images, and generate images (via Dall-E 3).
4
Post-task Survey
A 35-item teamwork quality survey plus 4 AI perception items; 1,964 participants completed this survey.
5
Quality Evaluations
Human raters (n≈1,195) and AI (gpt-4o-mini) evaluated Text, Image, and Click likelihood on 7-point scales using ad mockups.
6
Field Campaigns on X
2,000 ads ran in 400 campaigns with unique DocSend links, generating 4,932,373 impressions and 7,546 clicks.
7
Analysis
Mixed-effects and regression models estimated collaboration, quality, and personality-prompt effects, controlling for campaign spend and using campaign random effects where applicable.

MindMeld Platform Capabilities

🕒
Real-time Collaboration
Synchronized text and image editing with chat (websockets).
🤖
Agent Action Parity
AI can chat, edit copy, select images, and generate images via Dall-E 3.
🧠
Full UI Context
Each AI call includes on-screen state, chat history, prior actions, and screenshots.
🎭
Personality Prompting
Randomized Big Five traits for AI agents (high/low levels).
📈
Fine-grained Logging
Time-stamped messages, edits, API calls, and intermediate outputs.
🧩
Multimodal Models
Built on gpt-4o with image generation via Dall-E 3; AI ratings via gpt-4o-mini.

Communication Dynamics: A Shift in Focus

-60%
Direct Copy Edits
Decrease in direct text edits by humans in Human-AI teams, who shifted to instructing the AI instead.
-23%
Social Messages
Fewer social messages sent by humans when paired with an AI, indicating reduced social coordination cost.

Communication Volume by Team Type

This chart illustrates the significant difference in communication frequency. The report states that participants in Human-AI teams sent 45% more messages than those in Human-Human teams. This suggests that collaborating with an AI partner encourages more frequent communication, potentially to provide instructions, ask questions, and iterate on ideas.

Message Categories by Sender

This chart, based on Figure 5, shows the distribution of message types. Human-AI teams, including both the human and AI, sent a higher fraction of messages related to 'Content' and 'Process'. In contrast, Human-Human teams dedicated a larger portion of their communication to 'Social' and 'Emotional' messages, such as rapport building and expressing concern. This highlights a shift towards task-oriented communication when an AI is involved.

Key Insights

Collaboration with AI fundamentally alters teamwork by reducing the need for social and emotional coordination. This allows human participants to focus more intensely on task-related content and processes, leading to a more efficient, albeit less social, workflow.

Productivity and Output

+73%
Individual Productivity
Increase in the number of ads submitted per individual in Human-AI teams compared to Human-Human teams.
Comparable
Team-Level Output
Human-AI teams produced a similar number of ads as Human-Human teams, effectively achieving the same output with half the human labor.

Ad Copy Completion Rates per Individual

This chart, based on Figure 7, demonstrates that individuals in Human-AI teams had significantly higher completion rates for all ad copy components. This suggests that AI collaboration provides crucial support, helping individuals complete tasks more consistently, which is particularly beneficial for lower-performing participants.

Key Insights

AI agents act as a significant productivity multiplier at the individual level. While overall team output remains similar, the efficiency gain is substantial, suggesting that Human-AI teams can be considered near-substitutes for Human-Human teams while requiring fewer human resources.

Ad Quality: The Text vs. Image Trade-Off

Human Evaluations of Ad Quality (7-point scale)

Human raters found that ads created by Human-AI teams had significantly higher text quality. However, they rated the image quality of these same ads as lower than those from Human-Human teams. The estimated likelihood of clicking the ad was nearly identical between the two groups.

AI Evaluations of Ad Quality (7-point scale)

The AI model (gpt-4o-mini) rated the text quality and click likelihood of Human-AI ads as slightly higher, while rating image quality as nearly identical. This contrasts with human evaluations, highlighting a potential blind spot in the AI's ability to assess visual appeal in the same way humans do.

Key Insights

There is a clear trade-off in Human-AI collaboration for multimodal tasks. While GPT models excel at text generation, leading to higher-quality copy, they are less effective at generating or selecting high-quality images. This suggests a need for specialized visual AI tools to complement language models in creative workflows.

Real-World Impact: Field Study Results

~4.9M
Total Ad Impressions
Generated during a 20-day live campaign on the social media platform X.
7,546
Total Clicks
Clicks generated from the 2,000 ads run in the field experiment.

Key Insights

The field experiment revealed that ad quality, rather than the type of team that created it, was the primary driver of performance. Higher image quality (more common in Human-Human ads) led to a lower Cost-Per-Click (CPC), while higher text quality (more common in Human-AI ads) led to a higher Click-Through Rate (CTR). Overall, the ads from both team types performed similarly in the real world.

Message Categories Used in Analysis

Task-focused
1
Content
Information, facts, or deliverables directly related to the task.
Information, facts, or deliverables directly related to the task.
Coordination
2
Process
Approach, prioritization, logistics, and planning for the work.
Approach, prioritization, logistics, and planning for the work.
Interpersonal
3
Social
Rapport building and non-task social interactions.
Rapport building and non-task social interactions.
Affect
4
Emotional
Expressions of feelings like concern, frustration, or satisfaction.
Expressions of feelings like concern, frustration, or satisfaction.
Evaluation
5
Feedback
Evaluations, constructive criticism, and judgment on work products.
Evaluations, constructive criticism, and judgment on work products.
Misc
6
Other
Messages outside the structured categories.
Messages outside the structured categories.

Study- and field-scale metrics recorded during experimentation and deployment.

Key Counts and Breakdown

Individuals
2,310 (Human-AI: 1,258; Human-Human: 1,052)
Teams
1,834 (Human-AI: 1,258; Human-Human: 576)
Ads Created
11,138
Messages
183,691
Copy Edits
1,960,095
Image Edits
63,656
AI-Generated Images
10,375
Field Impressions
4,932,373
Field Clicks
7,546
Campaigns
400 (5-ad split tests)
Session Time
40 minutes
Attrition
7.6%
Payment
$9 plus two $100 bonuses

Ecosystem components enabling real-time, multimodal collaboration, evaluation, and field testing.

Models, Tools, and Platforms Used

gpt-4o-2024-08-06Multimodal agent for collaboration
gpt-4o-mini-2024-07-18AI ratings and message labeling
Dall-E 3External image generation API
X (formerly Twitter)Field deployment of ads
DocSendView-through tracking with unique links
ProlificParticipant recruitment
Google Cloud App EngineHuman rating platform
pusher.comRealtime collaboration backend
tiptap.devRich-text editing in the workspace

The Personality Factor: Tuning AI for Better Collaboration

The study randomized AI personality prompts (based on the Big Five traits) to see how they interacted with human collaborators. The results showed that 'fit' matters significantly:


**Synergy:** Pairing a conscientious AI with a conscientious human increased communication by 62%. **Complementarity:** Pairing an open AI with a conscientious human improved image quality.
**Conflict:** Pairing a conscientious AI with an extroverted human reduced the quality of text, images, and clicks.
**Productivity Boost:** An agreeable AI increased the number of ads submitted when paired with either an extraverted or neurotic human.


These findings suggest that AI agents can be tuned to complement human personality traits, creating more effective and productive human-AI partnerships. Customizing AI behavior is a critical factor for optimizing collaborative success.

Interpreting Reported Percentage Changes

The abstract reports a 137% communication increase with AI agents, and detailed analyses report a 45% message increase for Human-AI teams compared to Human-Human teams. Copy-edit reductions are reported as 84% (team-level) and 60% (individual-level) in different contexts. These figures reflect different analytic levels; all are directly cited from the report and should be interpreted within their respective scopes.