Friend or Foe? The Gap Between Human and AI Social Intention Perception

Imagine a figure approaching in the distance. Before seeing their face or hearing their voice, you must instantly decide: friend or threat? While humans effortlessly read subtle body language to make this survival instinct, artificial intelligence (AI) continues to struggle. Historically, AI has focused on recognizing basic emotions (like happiness) or physical actions (like walking), ignoring social intention - the social signals directed at others. For a service robot or AI agent, knowing whether a person poses a threat is far more important than simply identifying their emotion.

Now, researchers have established a new benchmark for "embodied social intention," uncovering how we signal threats and revealing a critical "alignment gap" between human cognition and AI.

To study how humans communicate these signals, researchers at Tohoku University recorded 160 motion-capture performances from 80 performers from Japan and Taiwan. The performers conveyed friendly or hostile intentions to an "imaginary alien" who had just landed on earth and possessed no knowledge of human culture or language, forcing the performers to rely purely on non-verbal body language.

Some of the common friendly actions conveyed to the alien included bending to show politeness and humbleness and opening arms to show an open-body greeting. For hostile interactions, the performers used threatening behaviours such as throwing objects to drive the alien away.

The researchers also employed the help of 77 observers from Japan, Taiwan, and China who watched all 160 videos, judging whether they found the videos friendly or hostile. Interestingly, Taiwanese performers tended to use big, forceful movements to show their hostility. Their fast motions that contained a lot of physical power made their hostile interactions easily intelligible for all viewers. However, Japanese performances were different.

Their hostile movements were smaller and more controlled - containing ten times less motion energy than Taiwanese clips. Japanese viewers picked up on these subtle signals significantly higher (76% accuracy) than Taiwanese and Chinese viewers (69% and 65%).The researchers also employed the help of 77 observers from Japan, Taiwan, and China who watched all 160 videos, judging whether they found the videos friendly or hostile. Interestingly, Taiwanese performers tended to use big, forceful movements to show their hostility. Their fast motions that contained a lot of physical power made their hostile interactions easily intelligible for all viewers. However, Japanese performances were different.

Their hostile movements were smaller and more controlled - containing ten times less motion energy than Taiwanese clips. Japanese viewers picked up on these subtle signals significantly higher (76% accuracy) than Taiwanese and Chinese viewers (69% and 65%).

When testing an AI model (ST-GCN), researchers found a critical blind spot. Although the AI achieved 69% accuracy, it still did not 'think' like a human (Figure 2). Human observers across three cultures (Figure 3) showed high agreement with one another (correlations of over 0.79), however the AI's judgments barely aligned with human perception (a correlation of just 0.26). Humans use cognitive "inverse planning" to infer the hidden mental goals behind an action. The AI, however, merely matched physical patterns, failing to register the heavy social meaning behind subtle, passive-aggressive motions. For example, someone standing very still, arms crossed tight, body turned slightly away. The AI sees almost no motion and treats it as harmless. A human reads it instantly as "back off." Simply put, the movements that confused human observers were completely different from the ones that confused the AI.

This "alignment gap" presents a safety risk for human-machine interaction. A system that correctly classifies high-energy threats but remains blind to low-energy hostility may fail to de-escalate subtle conflicts. Bridging this gap will require AI that is not only accurate, but also perceptually aligned with human social cognition -capable of interpreting not just how people move, but what those movements mean.

https://www.tohoku.ac.jp/en/press/friend_or_foe.html

Title: Friend or Foe? Benchmarking Human Perception and ST-GCN Decoding of Embodied Social Intention

Authors: Miao Cheng, Zhan Dai, Victor Schneider, Kanta Ozawa, Yangyang Cai, Ken Fujiwara, Yoshifumi Kitamura, Chia-huei Tseng

Conference: 2026 International Conference on Automatic Face and Gesture Recognition (FG)

Attached files

This video captures the performers' friendly body movements. ©Tohoku University
This video captures the performer's hostile body movements ©Tohoku University

28/05/2026 Tohoku University

Regions: Asia, Japan, China, Taiwan

Keywords: Applied science, Artificial Intelligence, Technology, Society, Psychology

Disclaimer: AlphaGalileo is not responsible for the accuracy of content posted to AlphaGalileo by contributing institutions or for the use of any information through the AlphaGalileo system.

Latest Publications

Testimonials