Skip to main content

Launching a fine-tuning job

Once you have a ready_for_training dataset (uploaded directly or generated by synthesis), the wizard at /fine-tune/new runs the actual training.

The shortest path: from /datasets, find the row with status ready to train and click Train. The wizard opens with the dataset prefilled.

The wizard, step by step

1. Dataset

Pre-filled if you arrived from a dataset row. Otherwise pick from the dropdown of your account's ready_for_training datasets. Shows the example count and warns if the dataset is below the recommended 1 000 examples.

2. Base model

The dropdown lists every base the platform marks as fine-tune-eligible and that has an active deployment. Pick one. The wizard shows the base's parameter count and the GPU memory each parameter consumes.

3. GPU type and count

Pick a GPU type from the dropdown (the list updates as the platform's catalog changes) and how many to use. The wizard estimates VRAM headroom and decides automatically whether to use 4-bit QLoRA (when the base wouldn't fit in full precision on the chosen GPU). A small banner explains the decision so you know what's happening.

GPU count > 1 enables Distributed Data Parallel (DDP) — the trainer spreads the batch across ranks, finishes faster but costs proportionally more. For a typical 1k-example dataset on a 7B base, 1 GPU is almost always the right call.

4. LoRA hyperparameters

The wizard splits these into two tiers:

Tier 1 — pick from sane defaults (most users stop here):

FieldWhat it doesSensible default
EpochsHow many passes over the dataset3
Learning rateStep size during training2e-4 (10× a full fine-tuning learning rate, typical for LoRA SFT)
LoRA rRank of the adapter — capacity vs. size tradeoff16
LoRA alphaScaling factor for the adapter (usually 2 × r)32
Micro batch sizePer-step batch on each GPU4

Tier 2 — Advanced (collapsible):

FieldWhat it does
DropoutRegularization on the adapter; small values 0.0–0.1
Target modulesWhich projection layers the adapter modifies (default: q_proj, v_proj)
Gradient accumulationLarger effective batch without more VRAM
Max sequence lengthTruncate examples longer than this (defaults to base model's context)
Save stepsHow often to checkpoint mid-epoch
SeedReproducibility

If Tier 2 doesn't matter for your case, leave it. The defaults are tuned for the common case (chat-style SFT on a 7B base with a 1k–10k row dataset).

5. Cost estimate

The wizard computes an estimated total based on:

  • The chosen GPU's hourly rate × estimated training seconds (function of dataset size, sequence length, epochs, batch size).
  • Fixed overhead constants for image pull, model load, dataset prep, evaluation, and register.

Rendered as a range, not a point. The lower bound assumes everything goes smoothly; the upper bound includes a 30% headroom for variable phases like a cold image pull. Both bounds use the constants the meter actually charges against — so the real bill almost always lands inside the range.

The wizard also shows the balance impact — your account balance now, minus the upper-bound estimate, equals what you'll have left if training maxes out.

6. Approve & Train

Two checks before the job starts:

  1. Balance gate — if your account balance is less than the upper-bound estimate, the button is disabled. Top up first.
  2. Region capacity — if the chosen region is out of your GPU type, the button shows the wait. You can pick a different GPU type or wait.

When both are clear, click Approve & Train. The job moves to training, kicks off the GPU pod, and the detail page (/fine-tune/{id}) starts polling for progress.

Job status reference

The detail page renders a different panel for each state. Quick reference:

StateWhat you see
trainingLive loss / step count / steps remaining, refreshed every few seconds
evaluatingEval set running. Usually 1–3 minutes
trained_pending_registerregisteredAdapter being registered. Quick — about a second
paused"Balance reached zero" banner with a Top up link, plus a Resume button. See Auto-pause
failedError class and last log lines
cancelledStopped by the user; metered GPU time settled. See Cancelling

Auto-pause when your balance hits zero

Training is metered per-minute against your account balance. If the balance reaches zero mid-training:

  1. EcoLink charges for the GPU time consumed up to that minute.
  2. The trainer pod is told to checkpoint and exit gracefully — work in progress is saved, not thrown away.
  3. The job's status flips to paused. The meter stops; no further charges accrue.
  4. The detail page shows a "Balance reached zero" banner with a Top up link to the billing page and a disabled Resume button.

To continue from where you left off:

  1. Top up your balance from /billing.
  2. Return to the fine-tuning job's detail page. Resume is now clickable.
  3. Click Resume. EcoLink re-runs the same balance gate (you must have at least the remaining estimated cost), then re-launches the trainer pod from the latest checkpoint. The metering clock starts fresh — the time you spent paused is not billed.

You can also Pause manually at any point: click Pause if you want to free up GPU capacity for something else. Same checkpoint-and-resume flow as the auto-pause.

Cancelling

The Cancel button on the detail page settles whatever GPU time the meter has consumed and stops the job permanently. The adapter file (whatever was checkpointed) stays on EcoLink storage for 30 days in case you want to inspect it; after that it's deleted unless promoted to a registered fine-tuned model.

After registration

When the job lands at registered, the detail page surfaces a Deploy button. That's the next step — attaching the adapter to a model instance so you can call it via the API.

Next steps