GPU Clusters

A GPU cluster is the multi-replica sibling of a GPU instance. Same container, but deployed as N replicas behind a Kubernetes Deployment, all accessible via a single endpoint URL.

Use a GPU cluster when:

You have a stateless service (your own inference API, a custom backend, etc.) you want to run on multiple GPUs for throughput
You want a single stable URL load-balanced across the replicas
You don't need the managed inference endpoint features (that's User Inference)

Cluster vs Instance vs Model Instance

	GPU Instance	GPU Cluster	Model Instance
Replicas	1	N (2+)	N (managed)
Interactive terminal	Yes	Yes (per replica)	No
Public URL	Optional, opt-in	Optional, opt-in	Unified API at `api.ecohash.com` with `model:instance_id`
Auto-scaling	No	No (fixed replicas)	Not yet (planned)
OpenAI-compatible routing	No	No (you manage the HTTP server)	Yes (built in)
Billing	Hold for estimated duration, refunded on stop	Same	24h hold, renewed until balance = $0
Best for	Dev / training	Custom service on GPU	Your own model as a managed API

Launch a cluster

From the console

Compute → GPU Clusters → Create cluster.
Fill in:
- Name (required, shown in list and used in the public URL)
- Region, GPU type, GPU count per replica
- Replicas — 2 minimum, up to the region's GPU budget
- Container image — your container
- Estimated duration hours — same semantics as instances; cluster auto-terminates at expiry
- Startup command — optional override
- Shared filesystems to attach — useful for shared model weights
- Expose a public service URL — opt-in checkbox. Tick it to open a public URL at launch and enter the service port your container listens on. Leave it unticked for a private cluster — you can open the URL later from the detail page.

Click Create.

Public URL

When you opt in to a public service URL, EcoLink mints one at:

https://api.ecohash.com/gpu-clusters/{cluster_id}/service/

The URL is authenticated — requests must carry your console session cookie or an Authorization: Bearer eco_YOUR_KEY header. Anyone hitting the URL without auth gets 401. Traffic is load-balanced across all replicas.

curl https://api.ecohash.com/gpu-clusters/123/service/health \
  -H "Authorization: Bearer eco_YOUR_KEY"

Open the public URL later

If you launched the cluster without ticking the opt-in box, the detail page shows a Create endpoint form once the cluster is running. Enter the service port your container listens on and click Create. The URL appears within a few seconds.

You can change the port at any time by re-submitting the form — the URL stays the same; only the port the URL forwards to changes.

The public URL config persists across stop / resume, so resuming a stopped cluster gives you the same URL.

Scaling

Clusters have fixed replicas — no autoscaling. To scale up or down:

Console → cluster detail page → Scale.
Change the replica count.
K8s rolls the new configuration.

The hold is adjusted proportionally (additional hold if scaling up, no refund if scaling down — the cluster doesn't know your intent, so we don't refund mid-run).

Duration and auto-stop

Same as GPU instances: estimated_duration_hours is a commitment, at expiry the cluster auto-terminates. Extend before the boundary to keep it running. See Duration and extending.

When a cluster is the wrong answer

If you want:

An OpenAI-compatible endpoint for your own model — use a model instance. You get multi-region redundancy, regional failover, and the unified API URL with your API key.
A single-GPU dev box — use a GPU instance.

Billing

Cluster hourly rate = hourly_rate_per_gpu × gpu_count × replicas. Credit hold at launch covers estimated_duration_hours × hourly_rate. The launch dialog shows the per-GPU rate and total hold.

Cluster vs Instance vs Model Instance​

Launch a cluster​

From the console​

Public URL​

Open the public URL later​

Scaling​

Duration and auto-stop​

When a cluster is the wrong answer​

Billing​

Related​