Researchers at the University of California San Diego say modern AI models can now pass the Turing test, a long-standing benchmark designed to measure whether machines can convincingly imitate human conversation.
In a study published in the Proceedings of the National Academy of Sciences, participants held simultaneous text conversations with both a human and an AI model before deciding which one was the real person. Nearly 500 people participated across multiple experiments involving university students and online users.
The strongest results came from OpenAI’s GPT-4.5 model, which was identified as human in 73% of conversations when researchers assigned it a detailed persona prompt. Meta’s LLaMa-3.1-405B achieved a 56% human-identification rate under the same conditions, statistically similar to the humans it was compared against.
Researchers also tested older or less capable systems, including GPT-4o and ELIZA, a rule-based chatbot originally developed in the 1960s. Those systems performed significantly worse, with human identification rates of roughly 21% and 23%, respectively.
The study found that prompting played a major role in performance. Models given explicit instructions about personality, tone, communication style, and conversational behavior were substantially more convincing than models operating without role guidance. Without persona prompts, GPT-4.5’s success rate dropped to 36%, while LLaMa-3.1-405B fell to 38%.
According to the researchers, the models succeeded less because of factual intelligence and more because they replicated human conversational imperfections, including humor, hesitation, directness, and occasional mistakes.
What It Means
The findings add to growing evidence that advanced language models are becoming increasingly difficult to distinguish from humans in online interactions. While AI systems have already demonstrated strong performance in reasoning and knowledge tasks, the study suggests conversational realism is improving rapidly as well.
Researchers said the results raise broader questions about trust and identity online, particularly as AI systems become integrated into customer service, social platforms, and digital communication tools. The ability of models to convincingly imitate human behavior could create new risks around fraud, manipulation, impersonation, and misinformation.
The study also reframes the meaning of the Turing test itself. Originally proposed by Alan Turing in 1950 as a measure of machine intelligence, the benchmark may now reflect something closer to “humanlikeness” rather than raw reasoning ability.
The Road Here
The UC San Diego researchers built a dedicated online interface for the experiments that resembled a standard messaging platform. Participants were given either five-minute or 15-minute conversations before deciding which conversational partner was human.
The results also highlight how heavily modern AI systems depend on prompting techniques. Researchers noted that models often required explicit human-written instructions to adopt believable personalities and communication styles, suggesting current systems can simulate human behavior effectively but may not independently infer how humans naturally behave in conversation.