ChatGPT-4 – how smart is it today? (2024)

Introduction

Sometimes we need to check reality against the hype (including mine!). I personally feel that GPT-4 with the right prompts can do many of the research and writing tasks that I can do if it has access to the relevant and current information.[1] GPT-4 offers the potential of an amazing productivity boost to anyone who uses it regularly and from my perspective the more I use it, the more GPT-4 improves my productivity.

There is no single definition of human intelligence and many definitions of Artificial General Intelligence (AGI). A useful definition was agreed by a group of 52 psychologists in 1994.

The consensus group defined intelligence as a very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience. This definition implies that intelligence is not limited to a specific domain or task, but rather encompasses a broad range of cognitive skills and abilities. Building an artificial system that exhibits the kind of general intelligence captured by the 1994 consensus definition is a long-standing and ambitious goal of AI research.[2]

What does GPT-4 think?

Prompt: How would you compare GPT-4 intelligence and capabilities to humans? Do not include explanations as to how GPT-4 makes mistakes in its answers as so do humans.

GPT-4

I will provide five main comparison areas between GPT-4 and humans, highlighting advantages for each.

  1. Language understanding and generation:

  • Human advantage: Humans can understand context, idioms, and cultural nuances in language more effectively than GPT-4.
  • GPT-4 advantage: GPT-4 can process and generate text in multiple languages simultaneously, transcending language barriers more efficiently than the average human.

2. Creativity and innovation:

  • Human advantage: Humans excel at creating original ideas and concepts, driven by their unique experiences, emotions, and personal perspectives.
  • GPT-4 advantage: GPT-4 can generate a vast number of diverse and creative outputs in a short time, offering a range of ideas and solutions that may not have been considered by humans.

3. Learning and adaptation:

  • Human advantage: Humans can learn from experiences, adapt to new situations, and apply their knowledge to novel contexts, enhancing their problem-solving abilities.
  • GPT-4 advantage: GPT-4 can process and analyze large amounts of data at an unprecedented speed, enabling rapid learning and adaptation to new information.

4. Emotional intelligence and empathy:

  • Human advantage: Humans possess a natural ability to empathize, allowing them to connect with others, understand emotions, and communicate effectively in emotionally charged situations.
  • GPT-4 advantage: GPT-4 can remain unbiased and objective in emotionally charged discussions, providing logical and rational solutions without being influenced by emotions.

5. Knowledge and expertise:

  • Human advantage: Human experts possess in-depth knowledge and experience in specific fields, enabling them to provide nuanced and context-aware insights.
  • GPT-4 advantage: GPT-4 has access to an extensive range of information from various domains, enabling it to provide quick and informed answers across a wide range of topics.

Conclusion: GPT-4 is an impressive AI, capable of surpassing humans in specific areas such as language processing, data analysis, and creative output generation. However, humans still excel in aspects like emotional intelligence, adaptability, and specialized expertise. Recognizing and leveraging the strengths of both humans and GPT-4 can lead to more effective collaboration and innovation in various fields.

ChatGPT-4 – how smart is it today? (1)
GPT-4 is an impressive AI, capable of surpassing humans in specific areas such as language processing, data analysis, and creative output generation.

I added follow on questions which included GPT-4 listing its top five weaknesses in order of importance:

1. Lack of common sense and reasoning

2. Inability to understand context beyond text

3. Sensitivity to input phrasing

4. Verbose and over-optimized responses

5. Ethical concerns and harmful content

What does research into GPT-4 intelligence tell us?

Most pre-print research papers in the AI area are published on the Cornell University arXiv website.[3] A search for ‘GPT-4’ in the title field found 21 papers. Most of these papers were looking at how GPT-4 performed in specific areas. However, an important paper ‘Sparks of Artificial General Intelligence: Early experiments with GPT-4,’ by a group of Microsoft researchers who had early access to GPT-4 looked more broadly at GPT-4 and assessed its capabilities against a broad range of cognitive tasks:

We have presented our initial exploration of GPT-4 across a wide range of tasks and domains, providing supporting evidence to the claim that GPT-4’s abilities are comparable to human-level for many of them.[4]

Comparisons of GPT-4 with human performance

The following research papers provide data which illustrates just how competent GPT-4 is on a range professional accreditation level tasks.

Plastic Surgery In-Service Training Exam

The authors of “GPT-4 to GPT-3.5:'Hold My Scalpel' -- A Look at the Competency of OpenAI's GPT on the Plastic Surgery In-Service Training Exam Plastic Surgery In-Service Training Exam” found that GPT-4 scored highly on the plastic surgeon training exam. They noted high scores on this exam ‘highly correlate with passing the written boards required to become a Board Certified Plastic Surgeon’.[5] The chart below highlights just how much GPT-4 improved on GPT-3.5

ChatGPT-4 – how smart is it today? (2)

Medical Challenge Problems

Another paper examined the ‘Capabilities of GPT-4 on Medical Challenge Problems’. The quote below shows just how impressed the authors where with the performance of GPT-4

‘…maintain that GPT-4’s exceptional performance on benchmarks serves as an indicator of its potential for being harnessed in medical education and for aiding healthcare professionals with numerous aspects of healthcare delivery.’[6]

The chart below again highlights just how much GPT-4 outperforms GPT-3.5.

ChatGPT-4 – how smart is it today? (3)

The authors of this report went on to conclude:

‘The rate of progress of LLMs has implications beyond the medical profession...large swaths of modern society are predicated on a “grand bargain” in which professional classes invest years or even decades in technical education and training and are, in turn, afforded certain benefits by citizens and governments, including the exclusive right to practice in their field, social prestige, and above-average compensation. Technical disruption of this social contract can have implications not only for the medical field but for numerous other knowledge-intensive professions including law, banking, engineering, accounting, and others.’[7]

Newtonian reasoning benchmarks

The paper ‘Advances in apparent conceptual physics reasoning in GPT-4’ also tested both GPT-3.5 and GPT-4 on a benchmark test - Newtonian Mechanics on the Force Concept Inventory. The authors stated:

‘In its first attempt, GPT-3.5 gave a correct answer for fifteen of thirty, or 50%. This is just short of the 60% benchmark suggested by the architects of the FCI to represent when a student has “barely begun to use Newtonian concepts coherently in their reasoning.” …GPT-4, by contrast, achieved a staggering 28/30 on its first attempt...Indeed, its responses come quite close to perfectly demonstrating expert-level competence, with a few very notable exceptions and limitations.'[8]

The only paper specifically mentioning GPT-4 that was critical of its performance against specific benchmarks was ‘Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4’.

'Our experiments show that both ChatGPT and GPT-4 are good at solving well-known logical reasoning reading comprehension benchmarks but struggle at handling out-of-distribution datasets. Their performances on natural language inference tasks requiring logical reasoning still need improving.'[9]

All the papers mention the risks that GPT-4 gives incorrect answers at times and doesn’t realise that these answers are just ‘hallucinations.[10] I really don’t see a lot of difference between human and GPT-4 errors although the peer-review process will reduce human errors. What would be a fertile research topic is comparing how often GPT-4 makes mistakes compared to specialists in all the relevant fields where GPT-4 does better than most humans on relevant tests.

My observations

Over the last eight weeks I have used GPT-4 extensively on a wide range of tasks. As I have noted before using GPT-4 to write code has problems with most of the python code it has written for me requiring debugging – which is time consuming and painful when you don’t know how to program.

ChatGPT-4 – how smart is it today? (4)

I have observed two problems with GPT-4 in its current iteration using ChatGPT Plus. The first is the prompt limit which can be got around by using the API. The prompt limit to GPT-4 has made it impossible to do some detailed sentiment analysis that I wanted to try with some survey data. The second issue is the inability of GPT-4 to remember all my previous chat sessions so while GPT-4 tracks well within a chat it can’t reference my other chats to give it more context. This is especially important with the programming tasks I have been getting GPT-4 to do for me.

Other than the restrictions above I have found GPT-4 able to help me with everything I have used it for. One of the most impressive outcomes was looking at how GPT-4 could be used by academics to help provide more detailed feedback on student’s assignments. This approach would require consent from students with the relevant academics or teachers reviewing all feedback provided by GPT-4. An example of the sort of feedback that GPT-4 can provide is shown below:

GPT-4 Conclusion: Your conclusion provides a helpful summary of the strengths and weaknesses of your essay, and it focuses on providing helpful instructions on how to improve your work for future assessments. It is essential to provide recommendations for future work, and your conclusion does an excellent job of highlighting the importance of policymakers considering the potential ramifications of protectionist measures.
Overall, your essay is well-written, and you have demonstrated a strong understanding of the topic. To further improve your work, you may consider using more formal language in your writing, particularly for postgraduate level work. Additionally, try to avoid using colloquialisms and contractions in your writing, and make sure to check your work for any errors in grammar, spelling, or punctuation. Keep up the great work!

Conclusion

ChatGPT-4 – how smart is it today? (5)

All the research papers above highlight just how much GPT-4 has improved compared to GPT-3.5. The question for all of us is just how quickly GPT-4 and other Generative AI models will continue to improve.

Humans sometimes fail to recognise the power of exponential growth as this graph of population increase demonstrates:

ChatGPT-4 – how smart is it today? (6)

If we accept the hypotheses that the capability of AI is improving exponentially the key question is how quickly it is doubling in capacity:

  • GPT-3.5 released November 2022
  • GPT-4 released March 2023
  • Open-Source developers have solved the scaling problem - Google "We Have No Moat, And Neither Does OpenAI"[12] – May 2023
  • BingChat with history directly embedded in the browser -public release – May 2023
  • Code Interpreter for ChatGPT – released to a wider audience – May 2023[13]

ChatGPT-4 – how smart is it today? (7)

Rick Molony, 8 May 2023

Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)

Endnotes

[1] OpenAI has provided limited access to two plug-ins which would address some of the limitations discussed in this article. Currently access is restricted as part of a limited alpha roll-out. I will revisit this conversation when I have access to an expanded GPT-4 plug-in trial.

[2] Microsoft Research Paper, Sparks of Artificial General Intelligence: Early experiments with GPT-4, p.4. Note in later versions of this paper this quote has been deleted from the paper.

[3] Cornell University, arXiv, title search for GPT-4, conducted 8 May 2023

[7] arXiv, Capabilities of GPT-4 on Medical Challenge Problems, p.22, 12 April 2023

[10] YouTube, IBM Technology, Why Large Language Models Hallucinate, 20 April 2023 – 9 min 30 sec video that provides a very clear explanation as to why LLM hallucinate.

[12] Semianalysis.com, Leaked Google internal document ‘Google "We Have No Moat, And Neither Does OpenAI", published 4 May 2023. Essential reading as the author outlines just how much progress has been made by the open-source community. I will be checking out the open-source alternatives to GPT-4 and Microsoft Bing.

[13] YouTube, Data Literacy, Trying out Code Interpreter for ChatGPT, released 3 May 2023 – very impressive new functionality. A new form of AI intelligence.

ChatGPT-4 – how smart is it today? (8)
ChatGPT-4 – how smart is it today? (2024)
Top Articles
Latest Posts
Article information

Author: Carmelo Roob

Last Updated:

Views: 6204

Rating: 4.4 / 5 (65 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Carmelo Roob

Birthday: 1995-01-09

Address: Apt. 915 481 Sipes Cliff, New Gonzalobury, CO 80176

Phone: +6773780339780

Job: Sales Executive

Hobby: Gaming, Jogging, Rugby, Video gaming, Handball, Ice skating, Web surfing

Introduction: My name is Carmelo Roob, I am a modern, handsome, delightful, comfortable, attractive, vast, good person who loves writing and wants to share my knowledge and understanding with you.