# FleetSearch — llms.txt # Self-description for AI agents and LLM-driven clients. ## What this is FleetSearch is a self-hosted, explainable search engine over the OTROTL fleet of sites. Ranking is a transparent hybrid: BM25 lexical evidence, LSA latent-semantic similarity, link-graph authority (enhanced PageRank with TrustRank seeding and same-host dampening), freshness decay, and title/anchor matching — fused with Reciprocal Rank Fusion for candidate generation, then ordered by a linear blend. Nothing is a black box: every result carries a full score_breakdown. ## Endpoints (HTTPS; /api/v1/* also answers at /v1/*) - GET /api/v1/search?q=&limit=&offset=&host= Standard JSON search. Results include url, canonical_url, title, plain- text snippet, host, lang, score, score_breakdown, content_hash (sha256 of the extracted text), fetched_at (ISO-8601) and jsonld_types. - GET /api/v1/agent/search?q=&limit=&host=&cursor= Same pipeline, agent ergonomics: * Cursor pagination: each full page returns next_cursor (opaque token). Pass it back as ?cursor= for the next page; stop when next_cursor is null. Treat it as opaque. Malformed cursors get 400 {"error":"bad_cursor"}. * NDJSON streaming: send "Accept: application/x-ndjson" to receive one JSON object per line — N result lines, then one final meta line: {"meta": {query, total_candidates, latency_ms, lsa_active, query_log_id, next_cursor}}. - GET /api/v1/agent/doc/{doc_id} Full extracted plain text (capped at 50,000 chars, "truncated" flag), headings, jsonld_types, pagerank_pct, inlinks_count, and a citation block: {suggested, content_hash, retrieved_at}. Use citation.suggested verbatim to cite a result verifiably. - POST /mcp Model Context Protocol endpoint (JSON-RPC 2.0, stateless, protocol revision 2025-06-18). Methods: initialize, ping, tools/list, tools/call. Tools: * fleet_search — query the index (query, limit 1..50, optional host). * fleet_fetch — full document + citation by doc_id OR url. * fleet_explain — exact score_breakdown of one document for a query. - GET /api/v1/suggest?q= Query completions. - GET /health Liveness (DB + cache + corpus size). ## How scoring works (score_breakdown, formula v1) final = 0.32*lexical + 0.18*semantic + 0.22*authority + 0.10*freshness + 0.12*title + 0.06*anchor - lexical: BM25 (k1=1.5, b=0.75), saturated to [0,1) via s/(s+10) - semantic: cosine of query vs document in a 256-dim LSA latent space - authority: PageRank percentile over the crawled link graph (TrustRank- seeded teleport, intra-host links dampened 0.5x, nofollow links excluded) - freshness: 2^(-age_days/90) — half-life 90 days - title/anchor: fraction of query terms in the title / inbound anchor text Every result's score_breakdown lists each component's raw value, weight and weighted contribution, plus formula_version ("v1"). Weights may be re-tuned; breaking shape changes will bump formula_version. ## Rate limits (per client IP, 60-second window) - Search/agent/MCP: 30 requests/min anonymous, 120 requests/min with an Authorization header. - 429 responses carry Retry-After (seconds); successful responses carry X-RateLimit-Limit and X-RateLimit-Remaining. Please back off on 429. ## Crawler Our crawler identifies as: FleetSearchBot/1.0 (+https://onetimelogin.com/fleetsearch-bot) It honors robots.txt (including Crawl-delay, capped at 30s), meta robots noindex/nofollow, rel=nofollow, and canonical URLs; fetches each host serially with at least a 1s delay; and only indexes pages that allow it. Blocking the UA in robots.txt removes a site from the index at recrawl. ## Notes for agents - Pin citations to content_hash: if a page changes, the hash changes. - total_candidates is capped at 200 by the candidate pipeline; use the cursor to walk pages deterministically (score DESC, doc_id ASC). - Queries are English-centric (stemming + stopwords); max 200 chars. - Contact / issues: operations team via https://onetimelogin.com (this service is part of the OTROTL fleet).