Meet Groq, a Lightning Fast AI Accelerator that Beats ChatGPT and Gemini

While using ChatGPT, especially with theGPT-4 model, you must have noticed how slow the model responds to queries. Not to mention, voice assistants based on large language models likeChatGPT’s Voice Chatfeature or the recently releasedGemini AI, which replaced Google Assistanton Android phones are even slower due to thehigh latency of LLMs. But all of that is likely to change soon, thanks to Groq’s powerful new LPU (Language Processing Unit) inference engine.

Groq has taken the world by surprise. Mind you, this is not Elon Musk’s Grok, which is an AI model available on X (formerly Twitter). Groq’s LPU inference engine can generate a massive500 tokens per second when running a 7B model. It comes down to around 250 tokens per second when running a 70B model. This is a far cry from OpenAI’s ChatGPT, which runs on GPU-powered Nvidia chips that offer around 30 to 60 tokens per second.

Groq is Built by Ex-Google TPU Engineers

Groq is Built by Ex-Google TPU Engineers

Groq is not an AI chatbot but an AI inference chip, and it’s competing against industry giants like Nvidia in the AI hardware space. It wasco-founded by Jonathan Ross in 2016, who while working at Google co-founded the team to build Google’s first TPU (Tensor Processing Unit) chip for machine learning.

Later, many employees left Google’s TPU team and created Groq tobuild hardware for next-generation computing.

What is Groq’s LPU?

What is Groq’s LPU?

The reason Groq’s LPU engine is so fast in comparison to established players like Nvidia is that it’s built entirely on a different kind of approach.

According to the CEO Jonathan Ross, Groq firstcreated the software stack and compilerand then designed the silicon. It went with the software-first mindset to make the performance “deterministic” — a key concept to get fast, accurate, and predictable results in AI inferencing.

As for Groq’s LPU architecture, it’s similar to how anASIC chip(Application-specific integrated circuit) works and is developed on a 14nm node. It’s not a general-purpose chip for all kinds of complex tasks instead, it’scustom-designed for a specific task, which, in this case, is dealing with sequences of data in large language models. CPUs and GPUs, on the other hand, can do a lot more but also result in delayed performance and increased latency.Groq is a Radically Different kind of AI architectureAmong the new crop of AI chip startups, Groq stands out with a radically different approach centered around its compiler technology for optimizing a minimalist yet high-performance architecture. Groq’s secret sauce is this…pic.twitter.com/Z70sihHNbx— Carlos E. Perez (@IntuitMachine)February 20, 2024

Groq is a Radically Different kind of AI architectureAmong the new crop of AI chip startups, Groq stands out with a radically different approach centered around its compiler technology for optimizing a minimalist yet high-performance architecture. Groq’s secret sauce is this…pic.twitter.com/Z70sihHNbx— Carlos E. Perez (@IntuitMachine)February 20, 2024

And with the tailored compiler that knows exactly how the instruction cycle works in the chip, the latency is reduced significantly. The compiler takes the instructions and assigns them to the correct place reducing latency further. Not to forget, every Groq LPU chipcomes with 230MB of on-die SRAMto deliver high performance and low latency with much better efficiency.

Coming to the question of whether Groq chips can be used for training AI models, as I said above, it is purpose-built for AI inferencing. It doesn’t feature any high-bandwidth memory (HBM), which is required for training and fine-tuning models.

Groq also states that HBM memory leads to non-determinacy of the overall system, which adds to increased latency. So no, youcan’t train AI modelson Groq LPUs.

We Tested Groq’s LPU Inference Engine

You can head to Groq’s website (visit) to experience the blazing-fast performance without requiring an account or subscription. Currently, ithosts two AI models, including Llama 70B and Mixtral-8x7B. To check Groq’s LPU performance, we ran a few prompts on theMixtral-8x7B-32Kmodel, which is one of the best open-source models out there.

Groq’s LPU generated a great output at aspeed of 527 tokens per second, taking only 1.57 seconds to generate 868 tokens (3846 characters) on a 7B model. On a 70B model, its speed is reduced to 275 tokens per second, but it’s still much higher than the competition.

To compare Groq’s AI accelerator performance, we did the same test on ChatGPT (GPT-3.5, a 175B model) and we calculated the performance metrics manually. ChatGPT, which uses Nvidia’s cutting-edge Tensor-core GPUs, generated output at a speed of61 tokens per second, taking 9 seconds to generate 557 tokens (3090 characters).

For better comparison, we did the same test on the free version of Gemini (powered by Gemini Pro) which runs on Google’s Cloud TPU v5e accelerator. Google has not disclosed the model size of the Gemini Pro model. Its speed was56 tokens per second, taking 15 seconds to generate 845 tokens (4428 characters).

In comparison to other service providers, theray-projectdid an extensiveLLMPerf testand found that Groq performed much better than other providers.

While we have not tested it, Groq LPUs alsowork with diffusion models, and not just language models. According to the demo, it can generate different styles of images at 1024px under a second. That’s pretty remarkable.

Groq vs Nvidia: What Does Groq Say?

In areport, Groq says its LPUs arescalableand can be linked together using optical interconnectacross 264 chips. It can further be scaled using switches, but it will add to latency. According to the CEO Jonathan Ross, the company is developing clusters that can scale across 4,128 chips which will be released in 2025, and it’s developed on Samsung’s 4nm process node.

In a benchmark test performed by Groq using 576 LPUs on a 70B Llama 2 model, it performed AI inferencing in one-tenth of the time taken by a cluster of Nvidia H100 GPUs.

Not just that, Nvidia GPUs took 10 joules to 30 joules of energy to generate tokens in a response whereas Groq onlytook 1 joule to 3 joules. In summation, the company says, that Groq LPUs offer 10x better speed, for AI inferencing tasks at 1/10th the cost of Nvidia GPUs.

What Does It Mean For End Users?

Overall, it’s an exciting development in the AI space, and with the introduction of LPUs, users are going to experience instant interactions with AI systems. The significant reduction in inference time means users canplay with multimodal systems instantlywhile using voice, feeding images, or generating images.

Groq is already offering API access to developers so expect much better performance of AI models soon. So what do you think about the development of LPUs in the AI hardware space? Let us know your opinion in the comment section below.

Arjun Sha

Passionate about Windows, ChromeOS, Android, security and privacy issues. Have a penchant to solve everyday computing problems.

Add new comment

Name

Email ID

Δ

01

02

03

04

05