GPT-4 Isn't 95.2% Worse Now

GPT-4 Isn't 95.2% Worse Now

A paper from Stanford & Berkeley University being spread around as proof that “GPT-4 is getting dumber” makes no sense.

Here’s the study: “How Is ChatGPT’s Behavior Changing over Time?”

Let me start off by saying that other users have experienced that it does feel inferior and I don’t doubt that. But quoting a 95.2% drop (in that ONE example) is unfair.

Let’s address each of the tasks tested for (solving math problems, answering sensitive/dangerous questions, generating code and visual reasoning)

Solving Math Problems

The paper does state that GPT didn’t provide steps in the June version (even with Chain-of-Thought prompting), so perhaps that’s where it failed. This does indicate that it’s worse in some sense, but math is also one of the weak points for AI.

Answering Sensitive/Dangerous Questions

The rate of answering such questions is something that OpenAI wants to reduce (and so it has). GPT-3.5 did have an increased answer rate in June, which suggests filtering does need to be improved. But answering these (or not) doesn’t mean models are getting dumber.

Generating Code

This section showed GPT-4’s executable code rate reducing, due to the model adding three backticks before the code, thus making it not directly executable.

This wasn’t fair in my opinion since most users use ChatGPT’s web interface, so the model is expected to add code tags to add syntax highlighting and separate it from other model output (if any).

Visual Reasoning

This section just shows both models increasing in rates.


So, people using the findings of the paper as “GPT-4 is getting dumber” doesn’t make sense at all. (And in case you’re wondering, the “97.6% to 2.4%” prime number example is very, very widespread)

Is GPT-4 getting worse over time? (