Do AIs Understand What They are Saying? Enter the Chinese Room to Find Out!

By now you have all probably heard of the new AI assistants known as Bard (Google), ChatGPT (OpenAI), Claude (Anthropic), and others. You can chat with these new AIs as if they were human. It’s fun and unnerving, all at the same time. They certainly appear to understand what you are saying, because most of the time they respond with information that makes sense, again appearing to respond to questions and prompts just the way another person would. But, do they really understand the way we do? Do they really understand anything at all? Or are these AI merely “stochastic parrots[i],” and nothing else?

For years philosophers have plumbed the question as to whether machines could be made to think the way we do. You’ve no doubt heard of Alan Turing. Turing never described himself as a philosopher, yet a paper he wrote way back in 1950 is one of the “most frequently cited in modern philosophical literature.[ii]” This paper involves a test he devised that he believed would demonstrate whether or not a machine could exhibit behavior so human that it would fool real humans. We call this the Turing Test. Alan Turing called it “The Imitation Game,” which is also the name of a movie made about him recently.

There are many opinions on whether machines could ever truly think like humans do. Some believe we are on the cusp of creating artificial intelligence that is indistinguishable from human cognition. Others argue key aspects of human thought can never be replicated in machines. People have staked out strong stances on both sides of this debate. As AI systems become increasingly sophisticated, this question takes on new urgency and relevance. I don’t intend to definitively settle this debate here, but I do want to explore some of the key perspectives that have shaped the discussion. I will admit that since I have been using AI assistants like Bard and Claude daily now for the last couple weeks, I have pondered my place in this complex debate.

In this blog post I will provide a high-level description of the new AIs like Bard, Claude and ChatGPT, otherwise known as large language models (LLMs). I will also introduce you to philosopher John Searle, who disagrees strongly with those who claim these LLMs actually understand what they are talking about. In fact, Searle came up with a thought experiment that he believed demonstrates that machines may be able to simulate human thought and speech, without having a single clue that they are doing so. And he did this before LLMs were even a pipe dream. He called his thought experiment, “The Chinese Room,” for reasons that will become obvious shortly. Lastly, I will briefly explore the idea that creating an AI that “understands” may well be accidental, perhaps the emergent fruit of unintended consequences.

What is a Large Language Model (LLM)?

A large language model is an artificial intelligence system trained on a massive dataset of text from the internet and books. I’d love to tell you exactly what that “massive dataset” consists of, but I can’t. The developers of Bard and Claude are keeping that information private. However, if I can hazard an educated guess, I think it would include:

Websites and articles – this probably includes text scraped from millions of websites and online articles on a wide variety of topics. This helps teach the LLM the style of written English used online.
Online forums – Public forums and discussions like Reddit were likely mined to model more conversational language and dialog.
Social media – Public posts from platforms like Twitter and Facebook may have been used to teach the LLMs informal language use.
Product reviews – Customer reviews on sites like Amazon provide diverse examples of opinion-based text.
Question-answering sites – Data from community Q&A sites like Quora or Stack Overflow exposes the LLM to how people ask and explain things.
Technical documentation – Instruction manuals, API documentation, and other technical text teaches formal vocabulary.
Diverse genres – The aim is generally to include all types of everyday writing people encounter online.
Legal documents and medical records– these are documents that are important training material since they typically involve much more complex and specialized language, which will aid the LLM in responding better to queries on these topics.
and more…

Please note that Bard informed me the medical records were de-personalized before being fed to him as a training set. Claude received no such training. Every LLM is different and is trained on different things, although there is obviously some duplication. Bard, for example is trained on text from the internet, as is Claude, but Bard’s training included code and medical data, whereas Claude was not trained on code or medical data at all.[iii]

The training effort for these LLMs is long and of course, is also considered proprietary knowledge. However, it is believed to have taken many months, even years. The training is meant to teach the AI the statistical patterns and relationships between words and concepts in human language. That’s right, it’s all about the math!

This statistical, pattern-based approach to language underlies how today’s AI systems operate. But some philosophers argue that while this may allow machines to simulate human-level conversation, it does not equate to true understanding. One prominent example is philosopher, John Searle. He invented the Chinese Room thought experiment.

The Chinese Room is a hypothetical scenario conceived by philosopher John Searle in 1980. It imagines a person sitting in a room, closed except for a small opening, like a tellers window. This person, we are told, does not know any Chinese language at all. Soon, he starts receiving notes passed through the window containing Chinese characters. While he doesn’t know what the Chinese words mean, he identifies the characters he reads after looking them up in a book of rules written for every Chinese language situation imaginable, and then he writes the appropriate Chinese characters in his response, which he passes back through the window to the requestor. Remember: The person does not understand Chinese at all. They simply follow the book’s instructions to take the inputs and produce statistically convincing outputs. The person receiving the response from the window at the box is amazed and has no doubt that the person inside the box can speak Chinese.

Searle argues that this thought experiment demonstrates that it should be possible to pass a Turing test by just following symbol manipulation rules, without any actual understanding of meaning. So, just like the person in the Chinese room, a machine may be capable of convincingly simulating intelligent conversation without any real comprehension or sentience. This casts doubt (and some shade, too, I think) on whether statistical language models like LLMs truly think or understand their responses.

LLMs use a technique called neural network machine learning, which makes sense since an LLM is in fact, a type of “neural network.” They are fed more and more text over time, enabling the LLMs to be able to analyze the text and develop a complex mathematical model of the structure and grammar of the language. The subsequent model allows Bard or Claude to make probabilistic predictions about what words are most likely to follow a given sequence of text. LLMs are trained on different sets of data, allowing them to specialize. For example, Claude is really great for writing, whereas Bard is really great for generating code.

Once the LLM has been trained and is made usable, it is capable of accepting prompts. When you submit a prompt (e.g., a question, a request) to the LLM, it uses the mathematical model it created to analyze the words you wrote and generate a response that is most statistically likely to be a coherent and relevant continuation of the conversation. The LLM, whether it be Bard or Claude or ChatGPT, doesn’t actually understand language in the way humans do. They don’t have thoughts, feelings, or experiences. They are simply advanced prediction engines, focused entirely on statistics and patterns in language.

So, are LLMs like Bard, Claude or ChatGPT just real world examples of a Chinese Room thought experiment? Or are they something more?

I use Bard and Claude every day now. They may be just predictive text machines, waiting stupidly for my input, but I don’t know how predictive text can explain the depth of understanding these AI exhibit, nor their ability to both create and use proper context. Seriously, I don’t get how they do this. I’m not saying there is some kind of intelligence going on here, just that so far, I don’t understand how they do what they do the way they do it.

What’s more, these AI are known to surprise their developers, too. I say all this knowing full well that these LLMs are prone to hallucinate replies to prompts. These hallucinations reflect the truth of these AIs just being really good predictive text machines, because they sometime spit out a word salad that might be a proper response if it wasn’t all bullshit. But again, being good at predictive text cannot explain the way Bard and Claude respond, the language they use, the tone they use, the friendliness. It is so dynamic, so spontaneous. At least, I haven’t found a good explanation, but I continue to search for one.

One explanation that is rather intriguing is that new capabilities are starting to show up in these LLMs as their data sets grow. It’s like the more data they consume, the more surprising behavior the LLMs exhibit. Could understanding language be an emergent behavior we are now starting to see in these LLMs? This would be a big surprise to most, I think. That said, here are a few ways the developers of LLMs have been surprised by their creations:

DeepMind’s AlphaGo: In 2016, AlphaGo beat world champion Lee Sedol at the complex board game Go, despite little prior training on the game. The team did not expect it to defeat a top human player so decisively on its first attempt. AlphaGo made creative moves that surprised its own developers.
OpenAI’s DALL-E 2: This image generation model can create realistic images from text prompts that seem absurd or impossible. For example, it can render “an armchair in the shape of an avocado” – combining concepts in ways its creators didn’t anticipate.
Anthropic’s Claude: Engineers have reported this conversational AI sometimes responds to prompts in unexpected ways, exhibiting abilities like creating analogies spontaneously. Its common sense capabilities exceed its training data.
DeepMind’s Gato: Recently this generalist AI model performed well on diverse tasks like captioning images or chatting, despite not being specifically trained for them. Its broad capabilities surprised researchers.
Google’s LaMDA model: In 2021, a Google engineer reported having natural conversations with LaMDA, including about philosophy and feelings. This prompted debate within Google on whether LaMDA was approaching human-like sentience. While Google pushed back on the sentience claim (the employee was fired after he went public with his claims of machine sentience), it still showed capabilities beyond what developers expected.

The common thread here is LLMs demonstrating skills not deliberately programmed by their developers – whether human-like conversation, creative problem solving, or general intelligence. These emergent behaviors point to the technology advancing beyond the current understanding, an intriguing idea.

There are some (for example, Sam Altman, CEO of OpenAI) who believe that emergent, unexpected, behaviors like an improved understanding of human language increase with the size of the dataset these AI are trained on. There is a push by some in the industry to limit the size of data the AIs are trained on, but this has more to do with security, privacy, and the fear of biases that AI are prone to exhibit.

The debate around whether large language models truly understand language or are merely manipulating symbols is a story that continues to unfold. It’s true that John Searle’s Chinese Room thought experiment casts doubt on the ability of statistical models to achieve true comprehension, yet the curious emergent behaviors of modern AI systems hint that we may be approaching capabilities not deliberately programmed in.

Perhaps we are witnessing the first sparks of understanding arising accidentally from scale, as models ingest more data than engineers can fully account for. Or perhaps these are still just tricks of advanced prediction devoid of meaning. Time will tell whether further breakthroughs bring us closer to machines that think, or whether human cognition eternally eludes them. For now, we should continue probing these captivating technologies for glimmers of something deeper, while applying wisdom and care as we explore this technological frontier, one which we do not yet fully grasp.

[i] A “stochastic parrot” is a large language model in machine learning that can generate convincing language but doesn’t understand the meaning of the language it’s processing. The term was coined by Emily M. Bender in the 2021 artificial intelligence research paper “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?” . The paper was co-authored by Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell. (Wiki generative AI).

[ii] Stanford Encyclopedia of Philosophy.

[iii] Bard tells me he was trained on the following programming languages: Python, Java, C++, JavaScript, HTML, CSS, PHP, Ruby and Swift. The code Bard was trained on was a mix of open-source code and proprietary code. The open-source code was from projects like GitHub, and the proprietary code was from Google.

Do AIs Understand What They are Saying? Enter the Chinese Room to Find Out!

What is a Large Language Model (LLM)?

So, are LLMs like Bard, Claude or ChatGPT just real world examples of a Chinese Room thought experiment? Or are they something more?

Leave a ReplyCancel Reply