The Power of Golden Datasets in LLM and Agent Evaluation