Rendered at 05:14:32 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
hello12343214 53 minutes ago [-]
I use gemini and cursor for enterprise software implementation, but they often suggest incorrect solutions to edge cases and unique config requirements. An AI that has a higher likelihood of being accurate is very appealing. I'll give Sup AI at try over the next few days at work.
Also, discovering HLE was great... scrolling through some of the questions brings back memories of college organic chem.
scottmu 8 hours ago [-]
I want to clarify what Ken meant by "entropy in the output token probability distributions." Whenever an LLM outputs a token, it's choosing that token out of all possible tokens. Every possible output token has a probability assigned by the model (typically a logarithm of the probability). This is a probability distribution (the output token probabilities sum to 1). Entropy is a measure of uncertainty and can quantify if a token probability distribution is certain (1 token has a 99.9% probability, and the rest share the leftover 0.1% probability) or uncertain (every token has roughly the same probability, so it's pretty much random which token is selected). Low entropy is the former case, and high entropy is the latter.
There is interesting research in the correlation of entropy with accuracy and hallucinations:
Ensembling usually hits a wall at latency and cost. Running these in parallel is table stakes, but how are you handling the orchestration layer overhead when one provider (e.g., Vertex or Bedrock) spikes in P99 latency? If you're waiting for the slowest model to get entropy stats, the DX falls off a cliff. Are you using speculative execution or a timeout/fallback strategy to maintain a responsive ttft?
supai 13 hours ago [-]
A few things:
- We do something similar to OpenRouter which measures the latency of the different providers, to ensure we always get the fastest results
- Users can cancel a single model stream if it's taking too long
- The orchestrator is pretty good at choosing what models for what task. The actual confidence scoring and synthesis at the end is the difficult part that you cannot do naively, however, the orchestrator plays the biggest part in optimizing cost + speed. I've made sure that we don't exceed 25% extra in cost or time in the vast majority of queries, compared to equivalent prompts in ChatGPT/Gemini/etc.
The reason why this is viable IMO is because of the fact that you can run multiple less-intelligent models with lower thinking efforts and beat a single more-intelligent model with a large thinking effort. The thinking effort reduction speeds up the prompt dramatically.
Also, discovering HLE was great... scrolling through some of the questions brings back memories of college organic chem.
There is interesting research in the correlation of entropy with accuracy and hallucinations:
- https://www.nature.com/articles/s41586-024-07421-0
- https://arxiv.org/abs/2405.19648
- https://arxiv.org/abs/2509.04492 (when only a small number of probabilities are available, which is something we frequently deal with)
- https://arxiv.org/abs/2603.18940
- tons more, happy to chat about if interested
- We do something similar to OpenRouter which measures the latency of the different providers, to ensure we always get the fastest results
- Users can cancel a single model stream if it's taking too long
- The orchestrator is pretty good at choosing what models for what task. The actual confidence scoring and synthesis at the end is the difficult part that you cannot do naively, however, the orchestrator plays the biggest part in optimizing cost + speed. I've made sure that we don't exceed 25% extra in cost or time in the vast majority of queries, compared to equivalent prompts in ChatGPT/Gemini/etc.
The reason why this is viable IMO is because of the fact that you can run multiple less-intelligent models with lower thinking efforts and beat a single more-intelligent model with a large thinking effort. The thinking effort reduction speeds up the prompt dramatically.
The sequential steps are then:
1. Ensemble RAG 2. Orchestrator 3. Models in parallel 4. Synthesizer
And retries for low-confidence (although that's pretty optimized with selective retries of portions of the answer).