Datasets
A dataset is the input to a fine-tuning job. You create one once and reuse it across as many jobs as you want — every job starts from a validated dataset, so you never have to re-upload to retry training with different hyperparameters.
Datasets live at /datasets in the console.
Two upload modes
When you click New Dataset, you pick one of two paths.
Upload (chat format)
Use this if you already have training examples in OpenAI's chat-completion format. Each line of the JSONL file is one training example. The validator runs on upload and your dataset moves directly to ready to train.
Seeds + skill prompt
Use this if you have prompts but no answers — the common case for distilling a skill from an existing teacher model. Each line of the JSONL file is a seed (a user message), and you also paste a skill prompt that defines the persona/tone/format. Your dataset lands at ready to synthesize, and from the dataset detail page you trigger synthesis: pick a teacher model (DeepSeek R1, Qwen 3, etc.), paste your teacher API key (BYOK), and EcoLink fans each seed through the teacher to produce a chat-format child dataset at ready to train.
The teacher-gen path is the bulk of how managed fine-tuning gets used — you typically don't have hand-curated examples for a new skill, but you do have a sense of the prompts users will send. See Synthesis for the full flow including teacher choices, BYOK key handling, and rejection sampling.
Chat-format JSONL
Each line is one JSON object with a messages array. Two equivalent shapes are accepted.
Preferred — assistant turn inside messages[]
{"messages":[{"role":"system","content":"You are a marketing copywriter for a SaaS analytics product."},{"role":"user","content":"Write a one-sentence tagline that emphasizes speed."},{"role":"assistant","content":"Ship insights at the speed of your data — analytics that keeps up."}]}
{"messages":[{"role":"system","content":"You are a customer-support agent. Be concise and reassuring."},{"role":"user","content":"My export download keeps failing at 80%."},{"role":"assistant","content":"That usually means the export hit a 5-minute timeout. Try the 'Email me when ready' option in the export dialog — it works in the background and you'll get a download link when it's done. If it still fails, send me the export ID."}]}
Legacy — top-level completion field
{"messages":[{"role":"system","content":"You translate English to formal Japanese."},{"role":"user","content":"Could you send me the report by tomorrow?"}],"completion":"明日までにレポートをお送りいただけますでしょうか。"}
The validator reshapes this into the first form before training. Both work; pick whichever matches the data you already have.
What the validator requires
- One JSON object per line, no trailing commas.
messagesis a non-empty array; each entry hasrole(system|user|assistant) andcontent.messages[0].role === "system"(the skill description).- The conversation must end with an assistant turn — either as the last
messages[]entry, or via a top-levelcompletionfield. - Each field's content is non-empty and ≤ 64 KB.
- Total examples ≥ 100 (hard floor).
- No example exceeds the base model's context window (system + user + assistant tokens combined).
Warnings (allowed but surfaced in the UI so you can fix them before training):
- Fewer than 1 000 examples → trained adapter quality will be limited.
- Duplicate
usercontent across rows → likely a data-prep bug. - Extreme length variance → possible schema drift.
Seeds JSONL
For the teacher-gen path. Each line is one user message that the teacher model will answer.
{"input":"Write a one-sentence tagline for a SaaS analytics tool that emphasizes speed."}
{"input":"Write a 3-bullet announcement post for a new dashboard feature called 'Smart Filters'."}
{"input":"Suggest five subject-line variations for a winback email to lapsed users."}
You also provide a skill prompt in the wizard — the system prompt that gets pasted as messages[0] of every generated row. Three things to nail in a skill prompt: persona (who's answering), voice + tone (how do they write), and format constraints (length, list vs paragraph, forbidden phrases). Example:
You are a marketing copywriter for a B2B SaaS analytics product called Pulse. You write in a confident, direct voice — short sentences, concrete benefits, no hype words like "revolutionary" or "game-changing". Outputs default to one paragraph unless the prompt asks for a list. Always center the customer's outcome (faster decisions, fewer surprises) over feature names.
The wizard ships three preset skill prompts (Marketing copywriter, Customer-support agent, Code reviewer) you can drop in and edit from a working baseline.
Validation lifecycle
- You click Create & Upload. The console asks the API for a presigned URL; the file uploads directly to EcoLink's S3 bucket.
- The API enqueues a validate job. Worker picks it up within seconds and downloads the file.
- Validator runs the rules above. The dataset's status moves to
valid,valid_with_warnings,invalid, or — for seeds-mode —valid_seeds(ready for synthesis). - The validator writes a summary (counts, warnings, top-N invalid lines with reasons) to the dataset. Open the dataset detail page to read it.
If a dataset is invalid, the detail page shows the first ~50 line errors with line numbers. Fix in your source, re-upload as a new dataset.
Reusing a dataset
The same dataset can power any number of fine-tuning jobs. Pick it from the wizard's dataset dropdown each time. Useful for:
- Trying multiple base models against the same data.
- Sweeping hyperparameters (LoRA rank, learning rate, epochs).
- Re-running with a fresh seed.
The dataset row shows used in N jobs so you don't accidentally delete one that's still being trained against. The Delete button is disabled while any active job references it.
Parent / child datasets
When you trigger synthesis on a seeds dataset, EcoLink creates a child dataset at valid_seeds → synthesizing → ready_for_training. The child references its parent for audit. You can train against the child as many times as you want; the parent stays available for re-synthesis if you want different teacher settings.
parent (seeds, skill prompt)
└─ child (chat format, ready_for_training) ← train against this
Open the child to see its from #<parent_id> link.
Limits
- File size: 50 MB per upload. Larger files are rejected at the gateway with HTTP 413.
- At least 100 examples to start training (1 000+ recommended for usable quality).
- Cost: $0 for upload, $0 for validation. Synthesis is also free on EcoLink — you pay your teacher provider directly via BYOK.
If you have a dataset larger than 50 MB, split it: train against a 50 MB shard first to validate the recipe, then concatenate before re-running. Or contact support for a raised limit on your account.
Next steps
- Synthesis — generate chat-format data from seeds + a teacher model
- Launching a job — train an adapter against a ready dataset