Skip to main content

Speech-to-text

Transcribe audio to text with Whisper Large v3 via POST /v1/audio/transcriptions — OpenAI-compatible.

Basic request

curl https://api.ecohash.com/v1/audio/transcriptions \
-H "Authorization: Bearer eco_YOUR_KEY" \
-F model=large-v3 \
-F file=@recording.mp3

Response:

{ "text": "Hello, this is a test of the EcoLink transcription API." }

Parameters

This endpoint takes multipart/form-data (not JSON):

FieldTypeDefaultNotes
modelstringRequired. "large-v3"
filefileRequired. Audio file — mp3, wav, m4a, webm, ogg, flac
languagestringauto-detectISO-639-1 code: en, es, fr, zh, ja, etc. Specify to skip auto-detect
promptstringemptyContext prompt to bias the transcription
response_formatstring"json""json", "text", "srt", "vtt", "verbose_json"
temperaturenumber0Sampling temperature; raise if you get hallucinations on silent audio

Response formats

json (default)

{ "text": "Full transcript goes here." }

text

Plain text, no JSON wrapping:

Full transcript goes here.

srt / vtt (subtitles with timestamps)

1
00:00:00,000 --> 00:00:04,000
Hello, this is a test.

2
00:00:04,000 --> 00:00:08,500
Of the EcoLink transcription API.

verbose_json (words + timestamps)

{
"task": "transcribe",
"language": "en",
"duration": 8.5,
"text": "Hello, this is a test. Of the EcoLink transcription API.",
"segments": [
{ "start": 0.0, "end": 4.0, "text": "Hello, this is a test." },
{ "start": 4.0, "end": 8.5, "text": "Of the EcoLink transcription API." }
]
}

Python

from openai import OpenAI
client = OpenAI(api_key="eco_...", base_url="https://api.ecohash.com/v1")

with open("recording.mp3", "rb") as f:
resp = client.audio.transcriptions.create(
model="large-v3",
file=f,
response_format="verbose_json",
)

print(resp.text)
for seg in resp.segments:
print(f"[{seg.start:.2f}-{seg.end:.2f}] {seg.text}")

Limits

  • File size: up to 25 MB
  • Duration: up to 30 minutes per request
  • For longer audio: split client-side into 5–10 minute chunks, transcribe each, concatenate

Tips

  • Specify language when you know it — avoids auto-detect mistakes on short clips.
  • Use prompt to bias toward domain terms: "This is a discussion about Kubernetes, nginx, and CEL expressions." helps the model spell technical terms.
  • Strip silence before upload if your audio has long silent gaps — Whisper can hallucinate on silence.
  • Prefer mono, 16-kHz audio. Stereo gets downmixed; higher sample rates get resampled.

Billing

Speech-to-text bills per second of audio duration (not file size, not processing time).