The Limitations of GPT-4 — LessWrong (2024)

Amidst the rumours about a new breakthrough at OpenAI I thought I'd better publish this draft before it gets completely overtaken by reality. It is essentially a collection of "gaps" between GPT4 and the human mind. Unfortunately the rumours around Q* force me to change the conclusion from "very short timelines seem unlikely" to "who the f**k knows".

While GPT4 has a superhuman breadth of knowledge, writing speed and short-term memory, compared to the human mind GPT4 has a number of important limitations.

Some of these will be overcome in the near future because they depend on engineering and training data choices. Others seem more fundamental to me, because they are due to the model architecture and the training setup.

These fundamental limitations are the reason why I do not expect scaling GPT further to lead to AGI. In fact, I interpret further scaling of the exact current paradigm as evidence that overcoming these limitations is hard.

I expect scaled up GPT4 to exhibit the same strength and weaknesses and for the improved strengths to paper over the old weaknesses at most in a superficial fashion.

I also expect with further scaling the tasks GPT is unable to do to increasingly load on the fundamental limitations and therefore diminishing returns.

See Also

ChatGPT-4 – how smart is it today?

Humans not only think quickly and intuitively but also engage in slow, reflective thinking to process complex issues. GPT-4's architecture is not meaningfully recurrent; it has a limited number of processing steps for each token, putting a hard cap on sequential thought.

This contrast with human cognition is most evident in GPT4's unreliable counting ability. But it also shows up in many other tasks. The lack of system 2 thinking may be the most fundamental limitation of current large language models.

Learning during Problem Solving

Humans rewire their brains through thinking; synapses are continuously formed or broken down. When we suddenly understand something, that realization often lasts a lifetime. GPT4, once trained, does not change during use.

It doesn't learn from its mistakes nor from correctly solved problems. It notably lacks an optimization step in problem-solving that would ensure previously unsolvable problems can be solved and that this problem-solving ability persists.

The fundamental difference here, is that in humans the correct representations for a given problem are worked out during the problem-solving process and then usually persist – GPT4 relies on the representations learned during training, new problems stay out of reach.

Even retraining doesn’t solve this issue because it would require many similar problems and their solutions for GPT4 to learn the necessary representations.

Compositionality and Extrapolation

Some theories suggest that the human neocortex, the seat of intelligence, uses most of its capacity to model the interplay of objects, parts, concepts, and sub-concepts. This ability to abstractly model the interplay of parts allows for better extrapolation and learning from significantly less data.

In contrast, GPT-4 learns the statistical interplay between words. Small changes in vocabulary can significantly influence its output. It requires a vast amount of data to learn connections due to a lack of inductive bias for compositionality.

Limitations due to the Training Setup

Things not present in the training data are beyond the model's learning capacity, including many visual or acoustic phenomena and especially physical interaction with the world.

GPT-4 does not possess a physical, mechanical, or intuitive understanding of many world aspects. The world is full of details that become apparent only when one tries to perform tasks within it. Humans learn from their interaction with the world and are evolutionarily designed to act within it. GPT-4 models data, and there is nothing beyond data for it.

This results in a lack of consistency in decisions, the ability to robustly pursue goals, and the understanding or even the need to change things in the world. The input stands alone and does not represent real-world situations.

GPT-4's causal knowledge is merely meta-knowledge stored in text. Learning causal models of new systems would require interaction with the system and feedback from it. Due to this missing feedback, there is little optimization pressure against hallucinations.

Conclusion

Some of these points probably interact or will be solved by the same innovation. System 2 thinking is probably necessary to move the parts of concepts around while looking for a solution to a problem.

The limitations due to the training setup might be solved with a different one. But that means forgoing cheap and plentiful data. The ability to learn from little data will be required to learn from modalities other than abundant and information-dense text.

It is very unclear to me how difficult these problems are to solve. But I also haven’t seen realistic approaches to tackle them. Every passing year makes it more likely that these problems are hard to solve.

Very short timelines seemed unlikely to me when I wrote this post, but Q* could conceivably solve "system 2 thinking" and/or "learning during problem solving" which might be enough to put GPT5 over the threshold of "competent human" in many domains.

FAQs

The Limitations of GPT-4 — LessWrong? ›

Learning during Problem Solving

Show Me More ›

What is the problem with GPT-4? ›

OpenAI's GPT-4, the Large Language Model that powers various AI models, has faced recent criticism for a perceived decline in performance. Users have reported instances where GPT-4 appears to struggle with tasks it once handled effortlessly.

Discover More ›

What do you think are the strengths of GPT-4 What are its weaknesses? ›

It can also generate code, process images, and interpret 26 languages. While GPT-4 appears to be more accurate than its predecessors, it still invents facts—or hallucinates—and should not be used without fact-checking, particularly for tasks where accuracy is important.

Show Me More ›

What can GPT-4 do that 3.5 Cannot? ›

GPT-3.5 is primarily a text tool, whereas GPT-4 is able to understand images and voice prompts. If you provide it with a photo, it can describe what's in it, understand the context of what's there, and make suggestions based on it.

Read The Full Story ›

How many FLOPs is GPT-4? ›

Well, over the course of several months, the most famous large language model, GPT-4, trained on over 20 septillion FLOPs, or the number twenty followed by 25 zeros.

Read On ›

Is GPT-4 more accurate than GPT-3? ›

GPT-4, the most recent generation of OpenAI, has 45 gigabytes of training data as opposed to GPT-3's 17 gigabytes. GPT-4 can therefore provide results that are substantially more accurate than GPT-3.

Know More ›

Is GPT-4 trying to escape? ›

GPT-4 offers plans to escape the system

Kosinski, in his Twitter thread, shared his experience with GPT-4. The professor asked GPT-4 if it needed help in escaping. Much to his shock, the model asked him to hand over its documentation and wrote a Python code to run on his machine, enabling it to use it for its own sake.

Discover More ›

Does GPT-4 have limitations? ›

GPT-4 is limited to ~50 messages every 3 hours. You're then switching back to GPT-3.5 automatically. Not much we can do about it, I'm afraid. I know this wasn't much help, but it's a broad idea of the limits.

What is the main limitations of GPT models? ›

1. Lack of common sense reasoning: GPT models may struggle with tasks requiring deep understanding of context or common sense reasoning. 2. Sensitivity to biases in training data: GPT models can inadvertently perpetuate biases present in their training data.

Has ChatGPT-4 gotten worse? ›

Code blocks are broken, responses are incredibly slow and errors often fall, and often does not understand the context, can switch to English for no reason after requesting an image. Unfortunately, the chat is getting worse and worse every time.

Get More Info ›

When ChatGPT 5 will be released? ›

OpenAI is reportedly gearing up to release a more powerful version of ChatGPT in the coming months. The new AI model, known as GPT-5, is slated to arrive as soon as this summer, according to two sources in the know who spoke to Business Insider.

Learn More Now ›

What will be the difference between GPT-4 and 5? ›

GPT-5 offers superior multilingual capabilities, breaking down language barriers more effectively than GPT-4, and supporting a wider array of languages.

Read The Full Story ›

Why is GPT-4 getting dumber? ›

One explanation for the vast shift in performance, according to the paper, could be that GPT-4 has drifted from chain-of-thought, a prompting approach where LLMs tackle multi-step problems by breaking them down into more intermediate steps.

Is GPT-4 losing accuracy in its coding capabilities? ›

Today, the accuracy and quality of the GPT-4 vision model's responses have significantly decreased, to the point where it incorrectly answers about 50 percent of the questions. As a result of this issue, we have lost 15 percent of our users today and have incurred a considerable financial loss.

How much can GPT-4 remember? ›

The System Has a Longer Working Memory

It also has a longer working memory of about 64,000 words, or about 50 pages, so it can remember and refer back to things from earlier in a conversation.

Keep Reading ›

Is GPT-4 reliable? ›

The findings revealed high interrater reliability, with ICC scores ranging from 0.94 to 0.99 for different time periods, indicating that GPT-4 is capable of producing consistent ratings. The prompt used in this study is also presented and explained.

Read On ›

Is GPT-4 safe to use? ›

OpenAI's GPT-4 large language model may be more trustworthy than GPT-3.5 but also more vulnerable to jailbreaking and bias, according to research backed by Microsoft.

See Details ›

Is it worth buying GPT-4? ›

GPT-4 is not only more powerful than GPT-3.5, but it's also multimodal, meaning it's capable of analyzing text, images, and voice. For instance, GPT-4 can accept an image as part of a prompt and provide an accurate text response, generate images, and be spoken to and then respond using its voice.

See Details ›

Is GPT-4 safe? ›

We spent 6 months making GPT-4 safer and more aligned. GPT-4 is 82% less likely to respond to requests for disallowed content and 40% more likely to produce factual responses than GPT-3.5 on our internal evaluations.

Discover More Details ›