Claude 3 Opus vs GPT-4 vs Gemini 1.5 Pro AI Models Tested

In line with our earlier comparison betweenGemini 1.5 Pro and GPT-4, we are back with a new AI model test focusing onAnthropic’s Claude 3Opus model. The company states that Claude 3 Opus has finally beaten OpenAI’sGPT-4 modelon popular benchmarks. To test the claims, we’ve done a detailed comparison between Claude 3 Opus, GPT-4, and Gemini 1.5 Pro.

If you want to find out how the Claude 3 Opus model performs in advanced reasoning, maths, long-context data, image analysis, etc., go through our comparison below.

1. The Apple Test

1. The Apple Test

Let’s start with the popular Apple test that evaluates the reasoning capability of LLMs. In this test, the Claude 3 Opus model answers correctly and says you have three apples now. However, to get a correct response, I had to set a system prompt adding that you are an intelligent assistant who is an expert in advanced reasoning.

Without the system prompt, the Opus model was giving a wrong answer. And well, Gemini 1.5 Pro and GPT-4 gave correct answers, in line with our earlier tests.

Winner: Claude 3 Opus, Gemini 1.5 Pro, and GPT-4

2. Calculate the Time

2. Calculate the Time

In this test, we try to trick AI models to see if they exhibit any sign of intelligence. And sadly, Claude 3 Opus fails the test, much like Gemini 1.5 Pro. I also added in the system prompt that the questions can be tricky so think intelligently. However, the Opus model delved into mathematics, coming to a wrong conclusion.

In our earlier comparison, GPT-4 also gave the wrong answer in this test. However, after publishing our results, GPT-4 has been variably generating output, often wrong, and sometimes right. We ran the same prompt again this morning, and GPT-4 gave a wrong output, even when told not to use the Code Interpreter.

Winner: None

3. Evaluate the Weight

Next, we asked all three AI models to answer whether a kilo of feathers is heavier than a pound of steel. And well, Claude 3 Opus gave a wrong answer saying that a pound of steel and a kilogram of feathers weigh the same.

Gemini 1.5 Pro and GPT-4 AI models responded with correct answers. A kilo of any material will weigh heavier than a pound of steel as the mass value of a kilo is around 2.2 times heavier than a pound.

Winner: Gemini 1.5 Pro and GPT-4

4. Solve a Maths Problem

In our next question, we asked the Claude 3 Opus model to solve a mathematical problem without calculating the whole number. And it failed again. Every time I ran the prompt, with or without a system prompt, it gave wrong answers in varying degrees.

I was excited to seeClaude 3 Opus’ 60.1% score in the MATH benchmark, outranking the likes of GPT-4 (52.9%) and Gemini 1.0 Ultra (53.2%).

It seems with chain-of-thought prompting, you can get better results from the Claude 3 Opus model. For now, with zero-shot prompting, GPT-4 and Gemini 1.5 Pro gave a correct answer.

Winner: Gemini 1.5 Pro and GPT-4

5. Follow User Instructions

When it comes to following user instructions, the Claude 3 Opus model performs remarkably well. It has effectively dethroned all AI models out there. When asked to generate 10 sentences that end with the word “apple”, it generates 10 perfectly logical sentences ending with the word “apple”.

In comparison, GPT-4 generates nine such sentences and Gemini 1.5 Pro performs the worst, struggling to generate even three such sentences. I would say if you’re looking for an AI model where following user instruction is crucial to your task then Claude 3 Opus is a solid option.

We saw this in action when anX userasked Claude 3 Opus to follow multiple complex instructions and create a book chapter on Andrej Karpathy’s Tokenizer video. The Opus model did agreat job and created a beautiful book chapterwith instructions, examples, and relevant images.

Winner: Claude 3 Opus

6. Needle In a Haystack (NIAH) Test

Anthropic has been one of the companies that pushed AI models to support a large context window. While Gemini 1.5 Pro lets you load up to a million tokens (in preview), Claude 3 Opus comes with a context window of 200K tokens. According to internal findings on NIAH, the Opus model retrieved the needle with over 99% accuracy.

In our test with just 8K tokens, Claude 3 Opus couldn’t find the needle, whereas GPT-4 and Gemini 1.5 Pro easily found it during our testing. We also ran the test on Claude 3 Sonnet, but it failed again. We need to do more extensive testing of the Claude 3 models to understand their performance over long-context data. But for now, it does not look good for Anthropic.

Winner: Gemini 1.5 Pro and GPT-4

7. Guess the Movie (Vision Test)

Claude 3 Opus is a multimodal model and supports image analysis too. So we added a still from Google’s Gemini demo and asked it to guess the movie. And it gave the right answer: Breakfast at Tiffany’s. Well done Anthropic!

GPT-4 also responded with the right movie name, but strangely, Gemini 1.5 Pro gave a wrong answer. I don’t know what Google is cooking. Nevertheless, Claude 3 Opus’ image processing is pretty good and on par with GPT-4.

Winner: Claude 3 Opus and GPT-4

The Verdict

After testing the Claude 3 Opus model for a day, it seems like a capable model but falters on tasks where you expect it to excel. In our commonsense reasoning tests, the Opus model doesn’t perform well, and it’s behind GPT-4 and Gemini 1.5 Pro. Except for following user instructions, it doesn’t do well in NIAH (supposed to be its strong suit) and maths.

Also, keep in mind that Anthropic has compared the benchmark score of Claude 3 Opus with GPT-4’s initial reported score, when it was first released in March 2023. When compared with the latest benchmark scores of GPT-4, Claude 3 Opus loses to GPT-4, aspointed outby Tolga Bilge on X.

That said, Claude 3 Opus has its own strengths. Auser on Xreported that Claude 3 Opus was able totranslate from Russian to Circassian(a rare language spoken by very few) with just a database of translation pairs. Kevin Fischer furthersharedthat Claude 3 understoodnuances of PhD-level quantum physics. Another user demonstrated that Claude 3 Opus learnsself types annotationin one shot, better than GPT-4.

So beyond benchmark and tricky questions, there are specialized areas where Claude 3 can perform better. So go ahead, check out the Claude 3 Opus model and see whether it fits your workflow. If you have any questions, let us know in the comments section below.

Arjun Sha

Passionate about Windows, ChromeOS, Android, security and privacy issues. Have a penchant to solve everyday computing problems.

Add new comment

Name

Email ID

Δ

01

02

03

04

05