Synthesis
Synthesis turns a seeds + skill prompt dataset into a chat-format training dataset by asking a stronger LLM (the teacher) to answer each seed in the persona you specified. The result is a child dataset you can train against the same way as a hand-curated upload.
This is the most common path for distilling a "digital employee" skill: you don't have hand-written answers for every prompt your users will send, but you do have a sense of what those prompts look like.
When to use synthesis
| Situation | Use synthesis? |
|---|---|
| You already have ~1 000 hand-curated answer examples in the right voice | No — upload them as chat format directly |
| You have a few hundred prompts but no answers | Yes |
| You want to teach a small model new factual knowledge | Probably not — distillation propagates the teacher's voice, not memorizes facts |
Teacher models
Teachers are open-source frontier models served by their own commercial API providers. EcoLink doesn't host the teacher — you pay the provider directly via your own API key (BYOK).
Currently supported teacher providers (the wizard's dropdown updates as new ones come online):
| Provider | Model | Strength |
|---|---|---|
| DeepSeek | DeepSeek R1 / V3 | Strong reasoning + code; good default |
| Alibaba | Qwen 3 / Qwen 3.5 (Max / Plus) | Strong on Chinese + structured output |
| Z.AI / Zhipu | GLM-4.5 | Generalist |
| Moonshot | Kimi K2 | Long context |
If a provider you want isn't there, contact support.
BYOK — bring your own teacher API key
You sign up directly with the teacher's provider (DeepSeek, Alibaba Cloud, etc.), get their API key, and paste it into the synthesis dialog. EcoLink:
- Encrypts the key at rest in your account's storage.
- Mounts the decrypted key only inside the synthesis job for the duration of the run.
- Deletes the mounted copy when the job finishes (success or failure).
The plaintext key never touches our database or logs. You're billed by the teacher provider for whatever the synthesis costs at their price — typically a fraction of a cent per seed for a ~1k-token answer.
Estimating teacher cost: roughly seeds × avg_completion_tokens × teacher_$_per_token. For 1 000 seeds at 500 output tokens each on DeepSeek V3 (~$0.27 per million output tokens at writing), that's ~$0.14. Real costs scale with completion length and which teacher you pick. The synthesis dialog shows a per-provider estimate before you start.
Triggering synthesis
- Open a
valid_seedsdataset (/datasets/{id}) and click Generate Synthetic Data (or click the same button on the row in the list page). - Pick a teacher from the dropdown.
- Paste your API key for that provider. Keys are validated with a no-op call before the job starts; an invalid key fails fast rather than after spending real money.
- (Optional) Toggle Rejection sampling — for each seed, the teacher generates 3 candidates and a separate judge call picks the best one. Better quality, ~3× the cost.
- Click Start Synthesis. The dataset moves to
synthesizingand the detail page polls for progress.
A typical 1 000-seed synthesis takes 5–15 minutes wall clock, depending on the teacher's rate limits and whether rejection sampling is on.
What you get back
When all seeds have been processed (or skipped due to teacher errors), EcoLink creates a child dataset linked to your seeds dataset, status ready_for_training. Open it from the parent's row to:
- Read a sample preview of the generated rows.
- Download the full JSONL to inspect by hand if you want.
- Click Train to launch a fine-tuning job — the wizard prefills the child for you.
Each row in the child carries metadata showing which teacher answered it, useful when you're sweeping teachers to compare quality.
Failure handling
- Bad API key: synthesis fails before any cost is incurred — the pre-flight no-op call catches it.
- Teacher rate limit: EcoLink backs off and retries; usually no user action needed.
- Some seeds fail (teacher returned empty / errored): synthesis still produces a child dataset with the successful rows. The detail page shows a count of skipped seeds. If too many failed, run synthesis again with a different teacher.
- All seeds fail: synthesis goes to
synthesis_failed; no child dataset is created. The detail page shows the most common error, usually a key/quota issue with the teacher.
Next steps
- Launching a job — train an adapter against the synthesized dataset
- Datasets — back to dataset basics