Development and Validation of the Provider Documentation Summarization Quality Instrument for Large Language Models

Authors: Emma Croxford, Yanjun Gao, Nicholas Pellegrino, Karen K. Wong, Graham Wills, Elliot First, Miranda Schnier, Kyle Burton, Cris G. Ebby, Jillian Gorskic, Matthew Kalscheur, Samy Khalil, Marie Pisani, Tyler Rubeor, Peter Stetson, Frank Liao, Cherodeep Goswami, Brian Patterson, Majid Afshar

Abstract: As Large Language Models (LLMs) are integrated into electronic health record
(EHR) workflows, validated instruments are essential to evaluate their
performance before implementation. Existing instruments for provider
documentation quality are often unsuitable for the complexities of
LLM-generated text and lack validation on real-world data. The Provider
Documentation Summarization Quality Instrument (PDSQI-9) was developed to
evaluate LLM-generated clinical summaries. Multi-document summaries were
generated from real-world EHR data across multiple specialties using several
LLMs (GPT-4o, Mixtral 8x7b, and Llama 3-8b). Validation included Pearson
correlation for substantive validity, factor analysis and Cronbach’s alpha for
structural validity, inter-rater reliability (ICC and Krippendorff’s alpha) for
generalizability, a semi-Delphi process for content validity, and comparisons
of high- versus low-quality summaries for discriminant validity. Seven
physician raters evaluated 779 summaries and answered 8,329 questions,
achieving over 80% power for inter-rater reliability. The PDSQI-9 demonstrated
strong internal consistency (Cronbach’s alpha = 0.879; 95% CI: 0.867-0.891) and
high inter-rater reliability (ICC = 0.867; 95% CI: 0.867-0.868), supporting
structural validity and generalizability. Factor analysis identified a 4-factor
model explaining 58% of the variance, representing organization, clarity,
accuracy, and utility. Substantive validity was supported by correlations
between note length and scores for Succinct (rho = -0.200, p = 0.029) and
Organized (rho = -0.190, p = 0.037). Discriminant validity distinguished high-
from low-quality summaries (p < 0.001). The PDSQI-9 demonstrates robust construct validity, supporting its use in clinical practice to evaluate LLM-generated summaries and facilitate safer integration of LLMs into healthcare workflows. Source: http://arxiv.org/abs/2501.08977v1

About the Author

Leave a Reply

Your email address will not be published. Required fields are marked *

You may also like these