Deploying an adapter
A registered adapter sits at /fine-tuned-models until you deploy it. Deploying means attaching the adapter to a LoRA-enabled model instance — the adapter loads into the running serving pod alongside the base model, and clients can immediately call it via the inference API.
Adapters don't run on their own; they always ride on top of a base-model serving pod (a "model instance"). One LoRA-enabled model instance can host many adapters at once.
Two ways to deploy
From either /fine-tuned-models or the fine-tune job's detail page, click Deploy. The dialog offers two modes:
Attach to existing instance
If you already have a LoRA-enabled model instance running with the same base model, this is the fast path — no new GPUs, no startup wait. The dialog lists eligible model instances; pick one and click Attach. The adapter loads in seconds.
Model instances that already have this adapter attached are greyed out so you don't try to attach the same adapter twice.
Launch new instance
If you don't have a LoRA-enabled instance, this option creates one with the right base model + multi-LoRA enabled, then attaches your adapter as the first one on it. Pick a GPU type and count; the dialog defaults to the cheapest GPU that fits the base.
A new model instance takes a few minutes to start before the adapter can be used. Once it's running, future deploys can use the Attach to existing path.
LoRA-enabled vs. regular model instances
When you launch a model instance, the launch dialog has a "Enable multi-LoRA on this instance" checkbox. Tick it to make the instance ready to host adapters. The setting is fixed at launch — you can't toggle it on a running instance, so if you forgot, launch a fresh one.
The cap on how many adapters one instance can host is set at launch (default 16, up to 64). Higher caps use slightly more GPU memory.
Once an instance is LoRA-enabled and running, attach/detach is instant: no pod restart, no GPU churn. The base model and any other attached adapters keep serving traffic the whole time.
Calling the adapter
Each attached adapter gets an alias of the form <display_label>:<id> — the display label is whatever you renamed the fine-tuned model to (defaults to ft<jobId>), and the ID is the adapter's row ID. Aliases are visible on /fine-tuned-models and on the model-instance detail page.
You put the alias in the request's model field. For example, if your adapter is ft10:1:
curl https://api.ecohash.com/v1/chat/completions \
-H "Authorization: Bearer eco_..." \
-H "Content-Type: application/json" \
-d '{
"model": "ft10:1",
"messages": [
{"role": "system", "content": "You are a marketing copywriter for Pulse."},
{"role": "user", "content": "Write a one-sentence tagline that emphasizes speed."}
]
}'
The router resolves ft10:1 against your account's attachments, finds whichever model instance has it loaded, and forwards the request there. You don't need to know which instance hosts it.
To call the base model on the same instance (without the adapter), use the base form <base_model_name>:<instance_id> — for example qwen2.5-7b-instruct:125. Same instance, same pod; the adapter's weights are applied only when the model field matches the adapter's alias.
The model-instance detail page surfaces both forms in the API tab when adapters are attached, so you can copy the exact model string.
Detaching
Open the model-instance detail page and find the Attached LoRA Adapters panel. Each row has a Detach button — click it, confirm, and the adapter unloads. Other adapters and any in-flight requests are unaffected.
Detaching doesn't delete the adapter file. The adapter stays at /fine-tuned-models and can be re-attached anytime. Detach when:
- You want to free a slot on a full instance (you've hit the per-instance cap).
- You're moving the adapter to a different model instance — the deploy dialog offers a one-click move via detach-then-attach when you try to attach somewhere else.
Capacity
Each LoRA-enabled instance has a per-instance adapter cap (default 16). Once that many adapters are attached, attempting to attach the next one returns a "model instance at capacity" error. You can:
- Detach an adapter to free a slot.
- Launch a second LoRA-enabled instance and attach there.
The cap is configurable at launch time.
Region pinning
A fine-tuned adapter is bound to the region where it was trained. The deploy dialog only lists model instances in that region, and the Launch new instance option locks to the same region. To serve the same adapter in a different region, run a new fine-tuning job there.
Lifecycle summary
fine-tuned-models (registered)
│
│ click Deploy
▼
attach to existing OR launch new
│
▼
attached ◄────── re-attach later
│
│ click Detach
▼
detached
│
│ (adapter file stays in your account)
▼
re-deployable indefinitely
Next steps
- User Inference cost & lifecycle — cost model for the model instance the adapter is attached to
- Calling your endpoint — the call-time story is the same as platform models