Home Artificial Intelligence Google Cloud Run now allows AI inferencing on Nvidia GPUs

by Anirban Ghoshal

Senior Writer

Google Cloud Run now allows AI inferencing on Nvidia GPUs

News

Aug 22, 20244 mins

Cloud ComputingGPUs

The service, currently in preview, will allow enterprises to run their real-time AI inferencing applications serving large language models on Nvidia L4 GPUs inside the managed service.

Credit: Michael Vi / Shutterstock

Google Cloud has updated its managed compute service Cloud Run with a new feature that will allow enterprises to run their real-time AI inferencing applications serving large language models (LLMs) on Nvidia L4 GPUs.

The new feature assumes significance for developers as the support of Nvidia GPUs will enhance the capabilities of Cloud Run by accelerating the compute time required for inferencing as well as helping save expenditure.

Cloud Run, which was first previewed in April 2019, allows enterprises to run stateless containers that are invocable via HTTP requests.

The managed or serverless compute service is also available on Google Kubernetes Engine (GKE), allowing developers to run containerized HTTP workloads on a managed Kubernetes cluster.

Arguably, the service has been popular among developers as it allowed them to run computations or workloads on-demand — in stark contrast to a typical cloud instance that runs for a specific time and is always available.

However, the growing demand for the ability to run AI-related workloads, that too via a serverless compute service, forced Google to add GPU support to Cloud Run.

The combination of GPU support and the serverless nature of the service, according to experts, should benefit enterprises trying to run AI workloads as with Cloud Run they don’t need to buy and station hardware compute resources on-premises and not spend relatively more by spinning up a typical cloud instance.

“When your app is not in use, the service automatically scales down to zero so that you are not charged for it,” Google wrote in a blog post.

The company claims that the new feature opens up new use cases for developers, including performing real-time inference with lightweight open models such as Google’s open Gemma (2B/7B) models or Meta’s Llama 3 (8B) to build custom chatbots or on-the-fly document summarization, while scaling to handle spiky user traffic.

Another use case is serving custom fine-tuned gen AI models, such as image generation tailored to your company’s brand, and scaling down to optimize costs when nobody’s using them.

Additionally, Google said that the service can be used to speed up compute-intensive Cloud Run services, such as on-demand image recognition, video transcoding and streaming, and 3D rendering.

But are there caveats?

To being with, enterprises may worry about cold start — a common phenomenon with serverless services. Cold start refers to the amount of time needed for the service to load before running actively.

This is significant for enterprises as it has a direct relation and effect with latency. For example, time required by the LLM to reply to a user query via an enterprise application.

However, Google seems to have it covered.

“Cloud Run instances with an attached L4 GPU with driver pre-installed starts in approximately 5 seconds, at which point the processes running in your container can start to use the GPU. Then, you’ll need another few seconds for the framework and model to load and initialize,” the company explained in the blog post.

Further, to boost the confidence of enterprises to try out Cloud Run’s new feature, the company has put out cold start times for several lightweight models.

Cold start times for Gemma 2b, Gemma2 9b, Llama2 7b/13b, and Llama3.1 8b models with the Ollama framework, range from 11 to 35 seconds, the company wrote, adding that the duration provided measures the time to start an instance from 0, load the model in the GPU, and for the LLM to return its first word. Other supported frameworks for the service include vLLM and PyTorch. Cloud Run can also be deployed via Nvidia NIM.

by Anirban Ghoshal

Senior Writer

Anirban Ghoshal is a senior writer covering enterprise software for CIO.com and databases and cloud and AI infrastructure for InfoWorld.

Americas

Topics

About

Policies

Our Network

More

Google Cloud Run now allows AI inferencing on Nvidia GPUs

The service, currently in preview, will allow enterprises to run their real-time AI inferencing applications serving large language models on Nvidia L4 GPUs inside the managed service.

But are there caveats?

More from this author

Oracle to offer 131,072 Nvidia Blackwell GPUs via its cloud

Google offered complainant €470 million to maintain Microsoft antitrust probe: Report

China seeks 30% growth in national compute capacity by 2025

Alibaba to cease data center operations in India and Australia

Microsoft lays off staffers from its Azure division

Alibaba Cloud is betting on emerging markets with massive price cuts

Microsoft Build 2024: Cloud infra updates include Cobalt 100-based VMs, access to Copilot in Azure

Google unveils next-generation AI chip Trillium

Show me more

How to examine files on Linux

Supermicro unveils AI-optimized storage powered by Nvidia

Nvidia to power India’s AI factories with tens of thousands of AI chips

Has the hype around ‘Internet of Things’ paid off? | Ep. 145

Episode 1: Understanding Cisco’s Converged SDN Transport

Episode 2: Pluggable Optics and the Internet for the Future

How to use the diff3 command

How to use the colordiff command

How to use the CMP command

Google Cloud Run now allows AI inferencing on Nvidia GPUs

The service, currently in preview, will allow enterprises to run their real-time AI inferencing applications serving large language models on Nvidia L4 GPUs inside the managed service.

But are there caveats?

Related content

Billion-dollar fine against Intel annulled, says EU Court of Justice

F5, Nvidia team to boost AI, cloud security

AWS, Google Cloud certs command highest pay

2024 global network outage report and internet health check

Newsletter Promo Module Test

More from this author

Oracle to offer 131,072 Nvidia Blackwell GPUs via its cloud

Google offered complainant €470 million to maintain Microsoft antitrust probe: Report

China seeks 30% growth in national compute capacity by 2025

Alibaba to cease data center operations in India and Australia

Microsoft lays off staffers from its Azure division

Alibaba Cloud is betting on emerging markets with massive price cuts

Microsoft Build 2024: Cloud infra updates include Cobalt 100-based VMs, access to Copilot in Azure

Google unveils next-generation AI chip Trillium

Show me more

How to examine files on Linux

Supermicro unveils AI-optimized storage powered by Nvidia

Nvidia to power India’s AI factories with tens of thousands of AI chips

Has the hype around ‘Internet of Things’ paid off? | Ep. 145

Episode 1: Understanding Cisco’s Converged SDN Transport

Episode 2: Pluggable Optics and the Internet for the Future

How to use the diff3 command

How to use the colordiff command

How to use the CMP command