Skip to main content

GPU Clusters

A GPU cluster is the multi-replica sibling of a GPU instance. Same container, but deployed as N replicas behind a Kubernetes Deployment, all accessible via a single endpoint URL.

Use a GPU cluster when:

  • You have a stateless service (your own inference API, a custom backend, etc.) you want to run on multiple GPUs for throughput
  • You want a single stable URL load-balanced across the replicas
  • You don't need the managed inference endpoint features (that's User Inference)

Cluster vs Instance vs Model Instance

GPU InstanceGPU ClusterModel Instance
Replicas1N (2+)N (managed)
Interactive terminalYesYes (per replica)No
Public URLOptional, opt-inOptional, opt-inUnified API at api.ecohash.com with model:instance_id
Auto-scalingNoNo (fixed replicas)Not yet (planned)
OpenAI-compatible routingNoNo (you manage the HTTP server)Yes (built in)
BillingHold for estimated duration, refunded on stopSame24h hold, renewed until balance = $0
Best forDev / trainingCustom service on GPUYour own model as a managed API

Launch a cluster

From the console

  1. Compute → GPU Clusters → Create cluster.
  2. Fill in:
    • Name (required, shown in list and used in the public URL)
    • Region, GPU type, GPU count per replica
    • Replicas — 2 minimum, up to the region's GPU budget
    • Container image — your container
    • Estimated duration hours — same semantics as instances; cluster auto-terminates at expiry
    • Startup command — optional override
    • Shared filesystems to attach — useful for shared model weights
    • Expose a public service URL — opt-in checkbox. Tick it to open a public URL at launch and enter the service port your container listens on. Leave it unticked for a private cluster — you can open the URL later from the detail page.
  1. Click Create.

Public URL

When you opt in to a public service URL, EcoLink mints one at:

https://api.ecohash.com/gpu-clusters/{cluster_id}/service/

The URL is authenticated — requests must carry your console session cookie or an Authorization: Bearer eco_YOUR_KEY header. Anyone hitting the URL without auth gets 401. Traffic is load-balanced across all replicas.

curl https://api.ecohash.com/gpu-clusters/123/service/health \
-H "Authorization: Bearer eco_YOUR_KEY"

Open the public URL later

If you launched the cluster without ticking the opt-in box, the detail page shows a Create endpoint form once the cluster is running. Enter the service port your container listens on and click Create. The URL appears within a few seconds.

You can change the port at any time by re-submitting the form — the URL stays the same; only the port the URL forwards to changes.

The public URL config persists across stop / resume, so resuming a stopped cluster gives you the same URL.

Scaling

Clusters have fixed replicas — no autoscaling. To scale up or down:

  1. Console → cluster detail page → Scale.
  2. Change the replica count.
  3. K8s rolls the new configuration.

The hold is adjusted proportionally (additional hold if scaling up, no refund if scaling down — the cluster doesn't know your intent, so we don't refund mid-run).

Duration and auto-stop

Same as GPU instances: estimated_duration_hours is a commitment, at expiry the cluster auto-terminates. Extend before the boundary to keep it running. See Duration and extending.

When a cluster is the wrong answer

If you want:

  • An OpenAI-compatible endpoint for your own model — use a model instance. You get multi-region redundancy, regional failover, and the unified API URL with your API key.
  • A single-GPU dev box — use a GPU instance.

Billing

Cluster hourly rate = hourly_rate_per_gpu × gpu_count × replicas. Credit hold at launch covers estimated_duration_hours × hourly_rate. The launch dialog shows the per-GPU rate and total hold.