Reranker
A reranker takes a query and a list of candidate documents, then re-orders them by true relevance. Embeddings are fast but approximate; a reranker is slower but much more accurate. Typical pipeline:
- Retrieve the top 50–200 candidates via embeddings / BM25 (fast, approximate)
- Rerank the top results via a cross-encoder (slow, precise)
- Feed the top 5–10 to an LLM for the final answer
Endpoint
POST /v1/rerank — this is a non-OpenAI extension (OpenAI has no rerank endpoint). Shape follows Cohere's rerank API.
curl https://api.ecohash.com/v1/rerank \
-H "Authorization: Bearer eco_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "BAAI/bge-reranker-large",
"query": "how does solar power work?",
"documents": [
"Solar panels convert sunlight into electricity via the photovoltaic effect.",
"My cat likes sunlight but cannot produce electricity.",
"Photovoltaic cells made of silicon transform photons into an electric current."
]
}'
Response:
{
"results": [
{ "index": 0, "relevance_score": 0.987 },
{ "index": 2, "relevance_score": 0.945 },
{ "index": 1, "relevance_score": 0.012 }
],
"model": "BAAI/bge-reranker-large",
"usage": { "total_tokens": 64 }
}
results is sorted by relevance_score descending. index maps back to the original documents array.
Request parameters
| Parameter | Type | Notes |
|---|---|---|
model | string | Required. Reranker model ID |
query | string | Required. The user's query |
documents | array of strings | Required. 1–1000 candidates |
top_n | integer | Optional. Return only the top N scores (default: all) |
return_documents | bool | Optional. Include the document text in each result (default: false — saves bandwidth) |
Typical use
import requests
def rerank(query: str, candidates: list[str], top_n=5):
r = requests.post(
"https://api.ecohash.com/v1/rerank",
headers={"Authorization": f"Bearer eco_YOUR_KEY"},
json={
"model": "BAAI/bge-reranker-large",
"query": query,
"documents": candidates,
"top_n": top_n,
},
)
r.raise_for_status()
return [(res["index"], res["relevance_score"]) for res in r.json()["results"]]
# After embedding-based retrieval:
candidates = retrieve_top_k(query, k=100) # your existing vector search
top = rerank(query, [c.text for c in candidates], top_n=5)
reranked = [candidates[i] for i, score in top]
Cost vs speed
Rerankers process the query + each document through a cross-encoder — quadratic in document length, linear in document count. Rough costs:
- 50 short (100-token) docs: ~$0.001 per query
- 200 short docs: ~$0.004 per query
- 50 long (1K-token) docs: ~$0.01 per query
Keep document text concise in reranker input. Use the full document only at LLM time.
Tuning
- More candidates isn't always better. If your embedding retrieval already returns the right answer in the top 20, reranking 200 candidates wastes compute. Tune
kbased on your evaluation set. - Pair with a light embedding model. Fast approximate retrieval + accurate rerank is the canonical pattern.
- Prefer
bge-reranker-largeoverbge-reranker-baseunless latency is critical — the large model is notably better at semantic matching.