Perplexity guides our decisions on model scaling.
How It Works:
Evaluate candidate models on a held-out dataset; select the one with the best trade-off between low perplexity and inference speed/cost.
Key Benefits:
Real-World Use Cases:
At least 10K tokens for stable estimates.
Only indirectly-human evaluation is still vital.