Reranker

A reranker takes a query and a list of candidate documents, then re-orders them by true relevance. Embeddings are fast but approximate; a reranker is slower but much more accurate. Typical pipeline:

Retrieve the top 50–200 candidates via embeddings / BM25 (fast, approximate)
Rerank the top results via a cross-encoder (slow, precise)
Feed the top 5–10 to an LLM for the final answer

Endpoint

POST /v1/rerank — this is a non-OpenAI extension (OpenAI has no rerank endpoint). Shape follows Cohere's rerank API.

curl https://api.ecohash.com/v1/rerank \
  -H "Authorization: Bearer eco_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "BAAI/bge-reranker-large",
    "query": "how does solar power work?",
    "documents": [
      "Solar panels convert sunlight into electricity via the photovoltaic effect.",
      "My cat likes sunlight but cannot produce electricity.",
      "Photovoltaic cells made of silicon transform photons into an electric current."
    ]
  }'

Response:

{
  "results": [
    { "index": 0, "relevance_score": 0.987 },
    { "index": 2, "relevance_score": 0.945 },
    { "index": 1, "relevance_score": 0.012 }
  ],
  "model": "BAAI/bge-reranker-large",
  "usage": { "total_tokens": 64 }
}

results is sorted by relevance_score descending. index maps back to the original documents array.

Request parameters

Parameter	Type	Notes
`model`	string	Required. Reranker model ID
`query`	string	Required. The user's query
`documents`	array of strings	Required. 1–1000 candidates
`top_n`	integer	Optional. Return only the top N scores (default: all)
`return_documents`	bool	Optional. Include the document text in each result (default: false — saves bandwidth)

Typical use

import requests

def rerank(query: str, candidates: list[str], top_n=5):
    r = requests.post(
        "https://api.ecohash.com/v1/rerank",
        headers={"Authorization": f"Bearer eco_YOUR_KEY"},
        json={
            "model": "BAAI/bge-reranker-large",
            "query": query,
            "documents": candidates,
            "top_n": top_n,
        },
    )
    r.raise_for_status()
    return [(res["index"], res["relevance_score"]) for res in r.json()["results"]]

# After embedding-based retrieval:
candidates = retrieve_top_k(query, k=100)    # your existing vector search
top = rerank(query, [c.text for c in candidates], top_n=5)
reranked = [candidates[i] for i, score in top]

Cost vs speed

Rerankers process the query + each document through a cross-encoder — quadratic in document length, linear in document count. Rough costs:

50 short (100-token) docs: ~$0.001 per query
200 short docs: ~$0.004 per query
50 long (1K-token) docs: ~$0.01 per query

Keep document text concise in reranker input. Use the full document only at LLM time.

Tuning

More candidates isn't always better. If your embedding retrieval already returns the right answer in the top 20, reranking 200 candidates wastes compute. Tune k based on your evaluation set.
Pair with a light embedding model. Fast approximate retrieval + accurate rerank is the canonical pattern.
Prefer bge-reranker-large over bge-reranker-base unless latency is critical — the large model is notably better at semantic matching.

Endpoint​

Request parameters​

Typical use​

Cost vs speed​

Tuning​

Endpoint

Request parameters

Typical use

Cost vs speed

Tuning