GPU Clusters
A GPU cluster is the multi-replica sibling of a GPU instance. Same container, but deployed as N replicas behind a Kubernetes Deployment, all accessible via a single endpoint URL.
Use a GPU cluster when:
- You have a stateless service (your own inference API, a custom backend, etc.) you want to run on multiple GPUs for throughput
- You want a single stable URL load-balanced across the replicas
- You don't need the managed inference endpoint features (that's User Inference)
Cluster vs Instance vs Model Instance
| GPU Instance | GPU Cluster | Model Instance | |
|---|---|---|---|
| Replicas | 1 | N (2+) | N (managed) |
| Interactive terminal | Yes | Yes (per replica) | No |
| Public URL | Optional, opt-in | Optional, opt-in | Unified API at api.ecohash.com with model:instance_id |
| Auto-scaling | No | No (fixed replicas) | Not yet (planned) |
| OpenAI-compatible routing | No | No (you manage the HTTP server) | Yes (built in) |
| Billing | Hold for estimated duration, refunded on stop | Same | 24h hold, renewed until balance = $0 |
| Best for | Dev / training | Custom service on GPU | Your own model as a managed API |
Launch a cluster
From the console
- Compute → GPU Clusters → Create cluster.
- Fill in:
- Name (required, shown in list and used in the public URL)
- Region, GPU type, GPU count per replica
- Replicas — 2 minimum, up to the region's GPU budget
- Container image — your container
- Estimated duration hours — same semantics as instances; cluster auto-terminates at expiry
- Startup command — optional override
- Shared filesystems to attach — useful for shared model weights
- Expose a public service URL — opt-in checkbox. Tick it to open a public URL at launch and enter the service port your container listens on. Leave it unticked for a private cluster — you can open the URL later from the detail page.
- Click Create.
Public URL
When you opt in to a public service URL, EcoLink mints one at:
https://api.ecohash.com/gpu-clusters/{cluster_id}/service/
The URL is authenticated — requests must carry your console session cookie or an Authorization: Bearer eco_YOUR_KEY header. Anyone hitting the URL without auth gets 401. Traffic is load-balanced across all replicas.
curl https://api.ecohash.com/gpu-clusters/123/service/health \
-H "Authorization: Bearer eco_YOUR_KEY"
Open the public URL later
If you launched the cluster without ticking the opt-in box, the detail page shows a Create endpoint form once the cluster is running. Enter the service port your container listens on and click Create. The URL appears within a few seconds.
You can change the port at any time by re-submitting the form — the URL stays the same; only the port the URL forwards to changes.
The public URL config persists across stop / resume, so resuming a stopped cluster gives you the same URL.
Scaling
Clusters have fixed replicas — no autoscaling. To scale up or down:
- Console → cluster detail page → Scale.
- Change the replica count.
- K8s rolls the new configuration.
The hold is adjusted proportionally (additional hold if scaling up, no refund if scaling down — the cluster doesn't know your intent, so we don't refund mid-run).
Duration and auto-stop
Same as GPU instances: estimated_duration_hours is a commitment, at expiry the cluster auto-terminates. Extend before the boundary to keep it running. See Duration and extending.
When a cluster is the wrong answer
If you want:
- An OpenAI-compatible endpoint for your own model — use a model instance. You get multi-region redundancy, regional failover, and the unified API URL with your API key.
- A single-GPU dev box — use a GPU instance.
Billing
Cluster hourly rate = hourly_rate_per_gpu × gpu_count × replicas. Credit hold at launch covers estimated_duration_hours × hourly_rate. The launch dialog shows the per-GPU rate and total hold.