Cost-Effective AI with Ollama, GKE GPU Sharing, and vCluster

Google
Cost-Effective AI with Ollama, GKE GPU Sharing, and vCluster

As organizations scale their AI workloads, two major challenges often emerge: the high cost of underutilized GPUs and the operational complexity of managing isolated environments for multiple teams. Traditionally, assigning a whole GPU to a single pod is inefficient, but managing separate clusters for every team is operationally heavy.

In this post, we'll demonstrate how to solve both problems by combining Google Kubernetes Engine (GKE) GPU time-sharing with vCluster for multi-tenancy. We'll deploy Ollama to serve open models (like Mistral) in isolated virtual environments that share the same physical GPU infrastructure.

The architecture leverages GKE Autopilot to abstract away the physical infrastructure. Instead of managing nodes, you simply deploy workloads, and Autopilot provisions the necessary hardware on demand, including GPUs, drivers, etc.

This setup lets teams have their own isolated environments, APIs, and Ollama instances, and potentially different models, while running on the same cost-effective, shared GPU nodes. For example, Team A (e.g., Legal Research) and Team B (e.g., Customer Support) can work in separate environments while they share GPU resources.

cost-effective-ai-ollama-gke-vcluster-shared-nodes

vCluster lets you create virtual Kubernetes clusters on top of an existing Kubernetes cluster. It supports various tenancy modes, including the shared nodes model that's shown in the diagram, where each virtual cluster gets its own isolated control plane while sharing the underlying worker nodes. Each virtual cluster can be accessed independently by teams who get full admin access to their cluster without interfering with others. This model also lets you leverage host cluster features when needed, and you have the ability to deploy your own controllers and CRDs inside each virtual cluster.

When you use vCluster, you can use any of these tenancy modes:

Shared nodes: The shared nodes mode allows multiple virtual clusters to run workloads on the same physical Kubernetes nodes. This configuration is ideal for scenarios where maximizing resource utilization is a top priority—especially for internal developer environments, CI/CD pipelines, and cost-sensitive use cases.

Private nodes: Using private nodes is a mode for vCluster where, instead of sharing the host cluster's worker nodes, individual worker nodes are joined to a vCluster. These private nodes act as the vCluster's worker nodes and they aren't shared with other vClusters on the same host cluster.

Auto nodes: You can configure vCluster to automatically provision and join worker nodes based on the node and resource requirements. To use auto nodes, you need vCluster Platform installed and vCluster needs to be connected to it.

Standalone: vCluster Standalone is a different architecture mode for vCluster for the control plane and node. The standalone mode doesn't require a host cluster. vCluster is deployed directly onto nodes like other Kubernetes distributions. vCluster Standalone can run on any type of node, whether it's a bare-metal node or a VM. It provides the strictest isolation for workloads because there's no shared host cluster for the control plane or worker nodes.

To follow along on the deployment steps, make sure that you have the following installed:

gcloud CLI

vcluster CLI

kubectl

kubectx

Unlike GKE Standard, we don't need to calculate node counts or configure node pools manually. Instead, we'll automatically create the cluster and then get credentials.

Set environment variables and create a GKE Autopilot cluster:

Replace YOUR_PROJECT_ID and YOUR_REGION_ID with the Google Cloud project and region that you want to use.

Get the credentials to configure your local kubectl:

With the Autopilot cluster running, we can now create isolated environments for our tenants. We'll create two vClusters, demo1 and demo2. You'll need a vcluster.yaml manifest file for configuration.

When you use GKE Autopilot, it might take a few minutes to create the first vCluster. This is because vCluster waits for its own control plane pods to be up and running. Because Autopilot provisions the underlying nodes dynamically in response to this new workload, there's a brief delay while the infrastructure is initialized.

Note: If you receive an error warning that you're trying to create a vCluster inside another, select no and then switch back to the correct host context.

We start by creating the deployment manifest for Ollama. This manifest deploys Ollama and uses a Kubernetes Service to expose it on port 11434.

Create the deployment manifest for Ollama. This manifest deploys Ollama and it uses a Kubernetes Service to expose it on port 11434. Nodes are selected that use GPU time-sharing.

When the vCluster is active, switch contexts to work inside demo1:

Deploy Ollama in the virtual environment:

Even though we're in a virtual cluster, when we create pods that request GPUs, the request is synced to the host. GKE Autopilot detects this requirement and automatically attaches the necessary GPU hardware to the nodes that are running your workloads.

With the server running, perform the model pull and test entirely within the virtual cluster context:

Verify the API:

Repeat the steps to deploy Ollama and pull the model to the second virtual cluster:

Now let's switch back to the host cluster context and see what's going on.

Check how many nodes have been provisioned and where are the Ollama pods running:

You should see two nodes. One is running the vCluster components. The other runs the Ollama instances with L4 GPUs. Your output should look like this (node names will be different):

Check where the Ollama pods are running:

Notice that both Ollama pods are running on the same node. This node has been provisioned by GKE Autopilot with L4 GPUs and GPU Sharing configured.

By using GKE Autopilot, we've removed the need to manually configure GPU node pools or time-sharing strategies. Autopilot provides resources dynamically, while vCluster ensures that Team A's Legal Research data and Team B's Customer Support bots remain completely isolated. This implementation provides a robust, low-maintenance platform for scaling AI workloads.

Originally published on Google.