Skip to main content

Reranker

A reranker takes a query and a list of candidate documents, then re-orders them by true relevance. Embeddings are fast but approximate; a reranker is slower but much more accurate. Typical pipeline:

  1. Retrieve the top 50–200 candidates via embeddings / BM25 (fast, approximate)
  2. Rerank the top results via a cross-encoder (slow, precise)
  3. Feed the top 5–10 to an LLM for the final answer

Endpoint

POST /v1/rerank — this is a non-OpenAI extension (OpenAI has no rerank endpoint). Shape follows Cohere's rerank API.

curl https://api.ecohash.com/v1/rerank \
-H "Authorization: Bearer eco_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "BAAI/bge-reranker-large",
"query": "how does solar power work?",
"documents": [
"Solar panels convert sunlight into electricity via the photovoltaic effect.",
"My cat likes sunlight but cannot produce electricity.",
"Photovoltaic cells made of silicon transform photons into an electric current."
]
}'

Response:

{
"results": [
{ "index": 0, "relevance_score": 0.987 },
{ "index": 2, "relevance_score": 0.945 },
{ "index": 1, "relevance_score": 0.012 }
],
"model": "BAAI/bge-reranker-large",
"usage": { "total_tokens": 64 }
}

results is sorted by relevance_score descending. index maps back to the original documents array.

Request parameters

ParameterTypeNotes
modelstringRequired. Reranker model ID
querystringRequired. The user's query
documentsarray of stringsRequired. 1–1000 candidates
top_nintegerOptional. Return only the top N scores (default: all)
return_documentsboolOptional. Include the document text in each result (default: false — saves bandwidth)

Typical use

import requests

def rerank(query: str, candidates: list[str], top_n=5):
r = requests.post(
"https://api.ecohash.com/v1/rerank",
headers={"Authorization": f"Bearer eco_YOUR_KEY"},
json={
"model": "BAAI/bge-reranker-large",
"query": query,
"documents": candidates,
"top_n": top_n,
},
)
r.raise_for_status()
return [(res["index"], res["relevance_score"]) for res in r.json()["results"]]

# After embedding-based retrieval:
candidates = retrieve_top_k(query, k=100) # your existing vector search
top = rerank(query, [c.text for c in candidates], top_n=5)
reranked = [candidates[i] for i, score in top]

Cost vs speed

Rerankers process the query + each document through a cross-encoder — quadratic in document length, linear in document count. Rough costs:

  • 50 short (100-token) docs: ~$0.001 per query
  • 200 short docs: ~$0.004 per query
  • 50 long (1K-token) docs: ~$0.01 per query

Keep document text concise in reranker input. Use the full document only at LLM time.

Tuning

  • More candidates isn't always better. If your embedding retrieval already returns the right answer in the top 20, reranking 200 candidates wastes compute. Tune k based on your evaluation set.
  • Pair with a light embedding model. Fast approximate retrieval + accurate rerank is the canonical pattern.
  • Prefer bge-reranker-large over bge-reranker-base unless latency is critical — the large model is notably better at semantic matching.