# FleetSearch — llms.txt
# Self-description for AI agents and LLM-driven clients.

## What this is
FleetSearch is a self-hosted, explainable search engine over the OTROTL
fleet of sites. Ranking is a transparent hybrid: BM25 lexical evidence,
LSA latent-semantic similarity, link-graph authority (enhanced PageRank
with TrustRank seeding and same-host dampening), freshness decay, and
title/anchor matching — fused with Reciprocal Rank Fusion for candidate
generation, then ordered by a linear blend. Nothing is a black box: every
result carries a full score_breakdown.

## Endpoints (HTTPS; /api/v1/* also answers at /v1/*)
- GET /api/v1/search?q=&limit=&offset=&host=
  Standard JSON search. Results include url, canonical_url, title, plain-
  text snippet, host, lang, score, score_breakdown, content_hash (sha256
  of the extracted text), fetched_at (ISO-8601) and jsonld_types.
- GET /api/v1/agent/search?q=&limit=&host=&cursor=
  Same pipeline, agent ergonomics:
  * Cursor pagination: each full page returns next_cursor (opaque token).
    Pass it back as ?cursor= for the next page; stop when next_cursor is
    null. Treat it as opaque. Malformed cursors get 400 {"error":"bad_cursor"}.
  * NDJSON streaming: send "Accept: application/x-ndjson" to receive one
    JSON object per line — N result lines, then one final meta line:
    {"meta": {query, total_candidates, latency_ms, lsa_active,
    query_log_id, next_cursor}}.
- GET /api/v1/agent/doc/{doc_id}
  Full extracted plain text (capped at 50,000 chars, "truncated" flag),
  headings, jsonld_types, pagerank_pct, inlinks_count, and a citation
  block: {suggested, content_hash, retrieved_at}. Use citation.suggested
  verbatim to cite a result verifiably.
- POST /mcp
  Model Context Protocol endpoint (JSON-RPC 2.0, stateless, protocol
  revision 2025-06-18). Methods: initialize, ping, tools/list, tools/call.
  Tools:
  * fleet_search  — query the index (query, limit 1..50, optional host).
  * fleet_fetch   — full document + citation by doc_id OR url.
  * fleet_explain — exact score_breakdown of one document for a query.
- GET /api/v1/suggest?q=    Query completions.
- GET /health               Liveness (DB + cache + corpus size).

## How scoring works (score_breakdown, formula v1)
final = 0.32*lexical + 0.18*semantic + 0.22*authority + 0.10*freshness
        + 0.12*title + 0.06*anchor
- lexical:   BM25 (k1=1.5, b=0.75), saturated to [0,1) via s/(s+10)
- semantic:  cosine of query vs document in a 256-dim LSA latent space
- authority: PageRank percentile over the crawled link graph (TrustRank-
             seeded teleport, intra-host links dampened 0.5x, nofollow
             links excluded)
- freshness: 2^(-age_days/90) — half-life 90 days
- title/anchor: fraction of query terms in the title / inbound anchor text
Every result's score_breakdown lists each component's raw value, weight
and weighted contribution, plus formula_version ("v1"). Weights may be
re-tuned; breaking shape changes will bump formula_version.

## Rate limits (per client IP, 60-second window)
- Search/agent/MCP: 30 requests/min anonymous, 120 requests/min with an
  Authorization header.
- 429 responses carry Retry-After (seconds); successful responses carry
  X-RateLimit-Limit and X-RateLimit-Remaining. Please back off on 429.

## Crawler
Our crawler identifies as:
  FleetSearchBot/1.0 (+https://onetimelogin.com/fleetsearch-bot)
It honors robots.txt (including Crawl-delay, capped at 30s), meta robots
noindex/nofollow, rel=nofollow, and canonical URLs; fetches each host
serially with at least a 1s delay; and only indexes pages that allow it.
Blocking the UA in robots.txt removes a site from the index at recrawl.

## Notes for agents
- Pin citations to content_hash: if a page changes, the hash changes.
- total_candidates is capped at 200 by the candidate pipeline; use the
  cursor to walk pages deterministically (score DESC, doc_id ASC).
- Queries are English-centric (stemming + stopwords); max 200 chars.
- Contact / issues: operations team via https://onetimelogin.com (this
  service is part of the OTROTL fleet).