Introduction
Sometimes we need to check reality against the hype (including mine!). I personally feel that GPT-4 with the right prompts can do many of the research and writing tasks that I can do if it has access to the relevant and current information.[1] GPT-4 offers the potential of an amazing productivity boost to anyone who uses it regularly and from my perspective the more I use it, the more GPT-4 improves my productivity.
There is no single definition of human intelligence and many definitions of Artificial General Intelligence (AGI). A useful definition was agreed by a group of 52 psychologists in 1994.
The consensus group defined intelligence as a very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience. This definition implies that intelligence is not limited to a specific domain or task, but rather encompasses a broad range of cognitive skills and abilities. Building an artificial system that exhibits the kind of general intelligence captured by the 1994 consensus definition is a long-standing and ambitious goal of AI research.[2]
What does GPT-4 think?
Prompt: How would you compare GPT-4 intelligence and capabilities to humans? Do not include explanations as to how GPT-4 makes mistakes in its answers as so do humans.
GPT-4
I will provide five main comparison areas between GPT-4 and humans, highlighting advantages for each.
2. Creativity and innovation:
3. Learning and adaptation:
4. Emotional intelligence and empathy:
5. Knowledge and expertise:
Conclusion: GPT-4 is an impressive AI, capable of surpassing humans in specific areas such as language processing, data analysis, and creative output generation. However, humans still excel in aspects like emotional intelligence, adaptability, and specialized expertise. Recognizing and leveraging the strengths of both humans and GPT-4 can lead to more effective collaboration and innovation in various fields.
GPT-4 is an impressive AI, capable of surpassing humans in specific areas such as language processing, data analysis, and creative output generation.
I added follow on questions which included GPT-4 listing its top five weaknesses in order of importance:
1. Lack of common sense and reasoning
2. Inability to understand context beyond text
3. Sensitivity to input phrasing
4. Verbose and over-optimized responses
5. Ethical concerns and harmful content
What does research into GPT-4 intelligence tell us?
Most pre-print research papers in the AI area are published on the Cornell University arXiv website.[3] A search for ‘GPT-4’ in the title field found 21 papers. Most of these papers were looking at how GPT-4 performed in specific areas. However, an important paper ‘Sparks of Artificial General Intelligence: Early experiments with GPT-4,’ by a group of Microsoft researchers who had early access to GPT-4 looked more broadly at GPT-4 and assessed its capabilities against a broad range of cognitive tasks:
We have presented our initial exploration of GPT-4 across a wide range of tasks and domains, providing supporting evidence to the claim that GPT-4’s abilities are comparable to human-level for many of them.[4]
Comparisons of GPT-4 with human performance
The following research papers provide data which illustrates just how competent GPT-4 is on a range professional accreditation level tasks.
Plastic Surgery In-Service Training Exam
The authors of “GPT-4 to GPT-3.5:'Hold My Scalpel' -- A Look at the Competency of OpenAI's GPT on the Plastic Surgery In-Service Training Exam Plastic Surgery In-Service Training Exam” found that GPT-4 scored highly on the plastic surgeon training exam. They noted high scores on this exam ‘highly correlate with passing the written boards required to become a Board Certified Plastic Surgeon’.[5] The chart below highlights just how much GPT-4 improved on GPT-3.5
Medical Challenge Problems
Another paper examined the ‘Capabilities of GPT-4 on Medical Challenge Problems’. The quote below shows just how impressed the authors where with the performance of GPT-4
‘…maintain that GPT-4’s exceptional performance on benchmarks serves as an indicator of its potential for being harnessed in medical education and for aiding healthcare professionals with numerous aspects of healthcare delivery.’[6]
The chart below again highlights just how much GPT-4 outperforms GPT-3.5.
The authors of this report went on to conclude:
‘The rate of progress of LLMs has implications beyond the medical profession...large swaths of modern society are predicated on a “grand bargain” in which professional classes invest years or even decades in technical education and training and are, in turn, afforded certain benefits by citizens and governments, including the exclusive right to practice in their field, social prestige, and above-average compensation. Technical disruption of this social contract can have implications not only for the medical field but for numerous other knowledge-intensive professions including law, banking, engineering, accounting, and others.’[7]
Newtonian reasoning benchmarks
The paper ‘Advances in apparent conceptual physics reasoning in GPT-4’ also tested both GPT-3.5 and GPT-4 on a benchmark test - Newtonian Mechanics on the Force Concept Inventory. The authors stated:
‘In its first attempt, GPT-3.5 gave a correct answer for fifteen of thirty, or 50%. This is just short of the 60% benchmark suggested by the architects of the FCI to represent when a student has “barely begun to use Newtonian concepts coherently in their reasoning.” …GPT-4, by contrast, achieved a staggering 28/30 on its first attempt...Indeed, its responses come quite close to perfectly demonstrating expert-level competence, with a few very notable exceptions and limitations.'[8]
The only paper specifically mentioning GPT-4 that was critical of its performance against specific benchmarks was ‘Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4’.
'Our experiments show that both ChatGPT and GPT-4 are good at solving well-known logical reasoning reading comprehension benchmarks but struggle at handling out-of-distribution datasets. Their performances on natural language inference tasks requiring logical reasoning still need improving.'[9]
All the papers mention the risks that GPT-4 gives incorrect answers at times and doesn’t realise that these answers are just ‘hallucinations.[10] I really don’t see a lot of difference between human and GPT-4 errors although the peer-review process will reduce human errors. What would be a fertile research topic is comparing how often GPT-4 makes mistakes compared to specialists in all the relevant fields where GPT-4 does better than most humans on relevant tests.
My observations
Over the last eight weeks I have used GPT-4 extensively on a wide range of tasks. As I have noted before using GPT-4 to write code has problems with most of the python code it has written for me requiring debugging – which is time consuming and painful when you don’t know how to program.
I have observed two problems with GPT-4 in its current iteration using ChatGPT Plus. The first is the prompt limit which can be got around by using the API. The prompt limit to GPT-4 has made it impossible to do some detailed sentiment analysis that I wanted to try with some survey data. The second issue is the inability of GPT-4 to remember all my previous chat sessions so while GPT-4 tracks well within a chat it can’t reference my other chats to give it more context. This is especially important with the programming tasks I have been getting GPT-4 to do for me.
Other than the restrictions above I have found GPT-4 able to help me with everything I have used it for. One of the most impressive outcomes was looking at how GPT-4 could be used by academics to help provide more detailed feedback on student’s assignments. This approach would require consent from students with the relevant academics or teachers reviewing all feedback provided by GPT-4. An example of the sort of feedback that GPT-4 can provide is shown below:
GPT-4 Conclusion: Your conclusion provides a helpful summary of the strengths and weaknesses of your essay, and it focuses on providing helpful instructions on how to improve your work for future assessments. It is essential to provide recommendations for future work, and your conclusion does an excellent job of highlighting the importance of policymakers considering the potential ramifications of protectionist measures.
Overall, your essay is well-written, and you have demonstrated a strong understanding of the topic. To further improve your work, you may consider using more formal language in your writing, particularly for postgraduate level work. Additionally, try to avoid using colloquialisms and contractions in your writing, and make sure to check your work for any errors in grammar, spelling, or punctuation. Keep up the great work!
Conclusion
All the research papers above highlight just how much GPT-4 has improved compared to GPT-3.5. The question for all of us is just how quickly GPT-4 and other Generative AI models will continue to improve.
Humans sometimes fail to recognise the power of exponential growth as this graph of population increase demonstrates:
If we accept the hypotheses that the capability of AI is improving exponentially the key question is how quickly it is doubling in capacity:
Rick Molony, 8 May 2023
Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
Endnotes
[1] OpenAI has provided limited access to two plug-ins which would address some of the limitations discussed in this article. Currently access is restricted as part of a limited alpha roll-out. I will revisit this conversation when I have access to an expanded GPT-4 plug-in trial.
[2] Microsoft Research Paper, Sparks of Artificial General Intelligence: Early experiments with GPT-4, p.4. Note in later versions of this paper this quote has been deleted from the paper.
[3] Cornell University, arXiv, title search for GPT-4, conducted 8 May 2023
[4] arXiv, Sparks of Artificial General Intelligence: Early experiments with GPT-4, v.5, 13 April 2023
[5] arXiv, GPT-4 to GPT-3.5: 'Hold My Scalpel' -- A Look at the Competency of OpenAI's GPT on the Plastic Surgery In-Service Training Exam, 4 April 2023
[6] arXiv, Capabilities of GPT-4 on Medical Challenge Problems, 12 April 2023
[7] arXiv, Capabilities of GPT-4 on Medical Challenge Problems, p.22, 12 April 2023
[8] arXiv, Advances in apparent conceptual physics reasoning in GPT-4, 16 April 2023
[9] arXiv, Evaluating the Logical Reasoning Ability of ChatGPT and GPT-4, 20 April 2023
[10] YouTube, IBM Technology, Why Large Language Models Hallucinate, 20 April 2023 – 9 min 30 sec video that provides a very clear explanation as to why LLM hallucinate.
[11] Abby Watkins, Why It’s Important to Understand Exponential Growth in the Modern World, 28 July 2021
[12] Semianalysis.com, Leaked Google internal document ‘Google "We Have No Moat, And Neither Does OpenAI", published 4 May 2023. Essential reading as the author outlines just how much progress has been made by the open-source community. I will be checking out the open-source alternatives to GPT-4 and Microsoft Bing.
[13] YouTube, Data Literacy, Trying out Code Interpreter for ChatGPT, released 3 May 2023 – very impressive new functionality. A new form of AI intelligence.