If you’ve used a recent large language model (LLM) like ChatGPT-5, you might have noticed some frustrating issues. Beyond their well-known flaws, these models are taking an unexpectedly long time to answer even simple questions. What's behind this delay?
The simple answer is that they've become overly cautious to minimize errors and avoid embarrassing mistakes. I experienced this first-hand when I gave an LLM a straightforward fill-in-the-blanks task:
" _ _, the British Prime Minister, _ the cabinet yesterday."
The initial response was: "Liz Truss, the British Prime Minister, appointed the cabinet yesterday."
When I pointed out that Liz Truss is no longer the Prime Minister, the chatbot "corrected" itself with an even more bizarre answer: "Rishi Sunak, the British Prime Minister, removed the cabinet yesterday." After that, I gave up trying to get it to update its information on British politics.
This issue isn’t limited to political trivia. When asked a math problem, an AI chatbot can only provide an accurate answer if it already exists somewhere on the internet. If it can't find a solution, it will often spend minutes scanning the web before producing a long, convoluted "estimated solution" that is more often wrong than right. Just ask a teenager whether they use AI for maths problems anymore; you will clearly sense they have already got cold feet over it; cheating is clearly on the wane!
The examples of factual errors with the British Prime Minister aren't just simple mistakes; they're a perfect illustration of a core problem known as "AI hallucination." This term describes a phenomenon where a large language model generates an output that is plausible and grammatically correct, but is entirely made up, nonsensical, or factually incorrect. The model isn't "thinking" in the human sense; it's predicting the next most likely sequence of words based on its training data, even if that sequence leads to a false statement.
This is arguably the most dangerous flaw in current LLMs. Unlike a human who might admit, "I don't know," or "I'm not sure," an AI chatbot is programmed to provide a confident answer. This overconfidence in the face of uncertainty can be deeply misleading. For users, it creates a false sense of trust, as the AI’s response is often presented with the same authority as a verified fact. The problem is compounded when a user, like yourself, tries to correct the model. Instead of recognizing its error and seeking a verified source, the model simply generates another plausible-sounding but equally incorrect response, as demonstrated by the Rishi Sunak example.
Despite claims that these models have all the data they need for complex tasks, the reality is far from it. They often lack the most current and accurate information, which is a major problem for a tool meant to provide reliable answers.
Beyond their flawed predictions, running these AI models is an incredibly costly and resource-intensive business. They need to constantly collect and analyse massive amounts of data from the internet to stay "updated." This process is not only expensive but also environmentally damaging. The sophisticated processors required to crunch this data consume huge amounts of electricity and require vast quantities of water for cooling.
Despite these significant costs and the models' often disappointing user experience, generative AI apps are struggling to remain profitable. A recent MIT report was quite damning, revealing that 95% of AI apps fail to make a profit.
In light of these developments, governments are becoming increasingly wary of integrating AI into critical decision-making processes. In fields like medicine, a human layer is being reintroduced as the final line of defence to prevent catastrophic errors. The technology's current limitations and the high risk of inaccurate outputs mean that a human "safety net" is becoming a necessity.
This deceptive nature has serious ethical implications, particularly in critical fields. Imagine a medical AI "hallucinating" a diagnosis, or a legal AI fabricating a case precedent. The confidence of the output could easily lead a human user to make a disastrous decision. It highlights the urgent need for a "human-in-the-loop" model, where a qualified expert provides a crucial final layer of oversight to verify, and if necessary, override the AI's output. The more these models are integrated into daily life, the more critical it is that we understand their inherent tendency to confidently invent information, and we build systems that protect against these unpredictable and costly errors.
In short, current large language models seem to be good at predicting the past, not the future. They are all about pattern recognition and giving out the most feasible option based on the laws of probability, not something that stems from human-like thinking. In this context, the prospect of a machine thinking like a human, a being with a soul or spirit, still exists only in the realm of fantasy.