How DeepRails aggregates granular evaluations into actionable scores
Our evaluation methodology fundamentally differs from traditional approaches by decomposing AI responses into their atomic components. This granular analysis ensures no error, hallucination, or quality issue goes undetected.
We evaluate AI responses at the claim level, not the response level. This granular approach catches subtle errors that holistic evaluations miss.
Research shows that decomposing complex judgments into atomic decisions improves accuracy by 23-31% compared to holistic evaluation (Chen et al., 2022). By evaluating each claim independently, we prevent single errors from distorting the full evaluation, provide precise insight into where failures occur, and empower prompt engineers to apply focused, efficient fixes.
Chen, L., Zaharia, M., & Zou, J. (2022). FrugalGPT: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176.Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). G-Eval: NLG evaluation using GPT-4 with better human alignment. arXiv preprint arXiv:2303.16634.Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W. T., Koh, P. W., … & Hajishirzi, H. (2023). FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251.Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., … & Snoek, J. (2019). Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. Advances in neural information processing systems, 32.Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1249.Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018). FEVER: a large-scale dataset for fact extraction and VERification. arXiv preprint arXiv:1803.05355.