Synthesis

Synthesis turns a seeds + skill prompt dataset into a chat-format training dataset by asking a stronger LLM (the teacher) to answer each seed in the persona you specified. The result is a child dataset you can train against the same way as a hand-curated upload.

This is the most common path for distilling a "digital employee" skill: you don't have hand-written answers for every prompt your users will send, but you do have a sense of what those prompts look like.

When to use synthesis

Situation	Use synthesis?
You already have ~1 000 hand-curated answer examples in the right voice	No — upload them as chat format directly
You have a few hundred prompts but no answers	Yes
You want to teach a small model new factual knowledge	Probably not — distillation propagates the teacher's voice, not memorizes facts

Teacher models

Teachers are open-source frontier models served by their own commercial API providers. EcoLink doesn't host the teacher — you pay the provider directly via your own API key (BYOK).

Currently supported teacher providers (the wizard's dropdown updates as new ones come online):

Provider	Model	Strength
DeepSeek	DeepSeek R1 / V3	Strong reasoning + code; good default
Alibaba	Qwen 3 / Qwen 3.5 (Max / Plus)	Strong on Chinese + structured output
Z.AI / Zhipu	GLM-4.5	Generalist
Moonshot	Kimi K2	Long context

If a provider you want isn't there, contact support.

BYOK — bring your own teacher API key

You sign up directly with the teacher's provider (DeepSeek, Alibaba Cloud, etc.), get their API key, and paste it into the synthesis dialog. EcoLink:

Encrypts the key at rest in your account's storage.
Mounts the decrypted key only inside the synthesis job for the duration of the run.
Deletes the mounted copy when the job finishes (success or failure).

The plaintext key never touches our database or logs. You're billed by the teacher provider for whatever the synthesis costs at their price — typically a fraction of a cent per seed for a ~1k-token answer.

Estimating teacher cost: roughly seeds × avg_completion_tokens × teacher_$_per_token. For 1 000 seeds at 500 output tokens each on DeepSeek V3 (~$0.27 per million output tokens at writing), that's ~$0.14. Real costs scale with completion length and which teacher you pick. The synthesis dialog shows a per-provider estimate before you start.

Triggering synthesis

Open a valid_seeds dataset (/datasets/{id}) and click Generate Synthetic Data (or click the same button on the row in the list page).
Pick a teacher from the dropdown.
Paste your API key for that provider. Keys are validated with a no-op call before the job starts; an invalid key fails fast rather than after spending real money.
(Optional) Toggle Rejection sampling — for each seed, the teacher generates 3 candidates and a separate judge call picks the best one. Better quality, ~3× the cost.
Click Start Synthesis. The dataset moves to synthesizing and the detail page polls for progress.

A typical 1 000-seed synthesis takes 5–15 minutes wall clock, depending on the teacher's rate limits and whether rejection sampling is on.

What you get back

When all seeds have been processed (or skipped due to teacher errors), EcoLink creates a child dataset linked to your seeds dataset, status ready_for_training. Open it from the parent's row to:

Read a sample preview of the generated rows.
Download the full JSONL to inspect by hand if you want.
Click Train to launch a fine-tuning job — the wizard prefills the child for you.

Each row in the child carries metadata showing which teacher answered it, useful when you're sweeping teachers to compare quality.

Failure handling

Bad API key: synthesis fails before any cost is incurred — the pre-flight no-op call catches it.
Teacher rate limit: EcoLink backs off and retries; usually no user action needed.
Some seeds fail (teacher returned empty / errored): synthesis still produces a child dataset with the successful rows. The detail page shows a count of skipped seeds. If too many failed, run synthesis again with a different teacher.
All seeds fail: synthesis goes to synthesis_failed; no child dataset is created. The detail page shows the most common error, usually a key/quota issue with the teacher.

Next steps

Launching a job — train an adapter against the synthesized dataset
Datasets — back to dataset basics

When to use synthesis​

Teacher models​

BYOK — bring your own teacher API key​

Triggering synthesis​

What you get back​

Failure handling​

Next steps​