Now more than ever, clinicians can access an incredible amount of data about their patients. Electronic health records (EHRs) offer a massive repository of information about each individual: notes of all kinds, laboratory results, imaging data, scanned forms, and saved images. Soon, we may even be able to add data from wearable devices such as personal fitness trackers into the mix.
However, this breadth of information can be both a blessing and a curse. Clinicians can learn more about their patients from the medical chart than was previously possible—but only if they are able to rapidly and accurately sort through that information and find the most relevant points for a given clinical encounter.
This is particularly true when it comes to talking with patients about their goals of care, a phrase now common in medicine, but one with widely varying definitions. A comprehensive explanation is that goals of care are “the overarching aims of medical care for a patient that are informed by patients’ underlying values and priorities, established within existing clinical context, and used to guide decisions about the use of or limitation on specific medical interventions.” Each person’s individual goals are nuanced and shaped by multiple dimensions of their life, not just the medical or social ones. It takes time to understand someone’s goals, and they typically can’t be whittled down to fit neatly into a checkbox in the EHR.
Many of the notes that may contain information about a patient’s goals and values are often strewn about the chart across numerous visits and hospitalizations. New tools within EHR systems such as Epic include a centralized place for where advance care planning and conversations about goals of care can be documented. However, they are not always used by the clinical team, and even when they are, they may end up contributing further to the expanding volume of notes that makes finding relevant information difficult.
Ultimately, clear goals leading to decision-making must emerge from discussions with the patient, who is the final arbiter regarding what actions the medical team should take. In addition, just as individuals have naturally evolving preferences outside of the clinical environment, so too might their preferences evolve within the clinical environment. For this reason, it’s essential that clinicians regularly discuss goals of care with their patients. These conversations are most fruitful when the clinician is aware of previous discussions on this topic and has good grasp of the patient’s biopsychosocial context. This is particularly important in the hospital setting, where many patients face crisis and must participate in shared decision-making with a clinician they’ve never met.
Adding Something New to the Clinical Toolbox
To address this challenge, our team at Duke Forge developed a tool to help clinicians rapidly identify and analyze the rich information contained in free-text notes in the EHR—information that could prepare them to discuss goals of care with a patient. Under the initial guidance of Drs. Azalea Kim and David J. Casarett, the team employed machine learning methods, including one called natural language processing (NLP), to develop an algorithm capable of identifying chart notes likely to be most helpful to clinicians.
Similar tools have been applied across a wide range of clinical situations such as diagnosing new diseases, monitoring disease symptoms, or evaluating response to treatment. Several studies have used such tools to assess adherence to quality indicators for the delivery of palliative care.,,, Most recently, an NLP model was developed to identify EHR documentation of previous conversations between patients and clinicians about goals of care. These experiences show that NLP can potentially expedite the review of large amounts of free-text EHR data. However, the work done to date is limited in its capacity to help clinicians engage patients at the point of care, as it depends on existing documentation of conversations about goals of care.
Our project, which started in 2018, initially developed an algorithm by training an NLP model on a set of 958 notes randomly selected from a pool of approximately 5.3 million clinical notes belonging to Medicare patients receiving care with Duke Health. A team of general medicine and palliative care physicians labeled each note to indicate whether they were relevant in preparing for a goals-of-care conversation with a new patient. Among this population, the model performed very well, with an area under curve (AUC) of 0.84.
Next, our team further validated the model using a patient population that would more closely resemble the group of patients for whom the tool was ultimately intended: adults admitted to the hospital who had a high risk of mortality within the next 6 months. For these patients, hospitalization can be a pivotal time to clarify their wishes and make decisions about what they want to prioritize in their lives. Patients in such circumstances usually have had numerous encounters with the healthcare system before they are hospitalized, meaning that they have even more information in their medical record to sort through than the average patient would.
Over several months in the fall of 2019, we reviewed data on all patients admitted to the general medicine service at Duke University Hospital. We partnered with the Duke Institute for Health Innovation (DIHI) to employ a novel tool for predicting mortality risk and identify patients who were at increased risk of mortality during their inpatient stay, within 30 days of admission, or within 6 months of admission. We selected 100 patients to represent note samples across the range of probabilities identified by the model and accessed their clinical notes from the EHR for the year leading up to the hospitalization. From each of these patients, approximately 20 notes that were assigned a higher probability by the model were randomly selected and reviewed by a group of hospital medicine physicians to judge whether the note would be relevant to preparing for a goals-of-care conversation. A total of 1,977 notes were labeled. The hospital medicine physicians were prompted to answer the following two “yes/no” questions about each note:
“If you only had 10 minutes to review a patient’s past clinical notes to prepare for a goals of care conversation, is this a note you would want to read?”
“If yes, is this an extremely important note?”
These questions were selected to best represent the intended use case for the model: clinicians with limited time preparing to meet with a patient and discuss goals of care. The NLP model was trained on these labeled notes. In this population of high-risk hospitalized patients, the model operated with an AUC of 0.81—excellent performance, given the complexity of the task. In the hospital setting, a clinician might need to sort through tens if not hundreds of patient notes while searching for discussions about goals of care. This NLP model capably narrows the search to those notes most likely to be helpful.
Other groups are developing NLP models for similar purposes. The study by Lee and colleagues mentioned earlier6 resulted in the development of an NLP algorithm that can identify notes containing documentation of conversations about goals of care with exceptional accuracy. However, the tool’s performance was not perfect: in one of their randomly selected samples (n=300) of patients with serious illness, the tool did not identify any notes containing goals-of-care documentation. While this likely reflects the broader issue of insufficient documentation, it also underscores the point that searching explicitly for documented goals-of-care discussions is not always useful at the point of care. Our algorithm differs in that it does not identify notes solely containing documented discussions; rather, the model was developed to predict whether the clinician would find the note helpful in preparing for a discussion about goals of care.
Sounding a Cautionary Note
Although the performance of this model is promising for its eventual application in clinical care, there are some limitations to consider. The model was trained and validated only on data from our single academic medical center. Although we also conducted external validation, the data used for that purpose was collected from our academic medical center as well. So while we’re confident of the model’s performance on notes in our center, we can’t be equally confident of the model’s generalizability to other settings. That said, we hypothesize that with retraining the model would perform well elsewhere.
An additional issue is that using NLP techniques comes with some downsides, one of which is difficulty in identifying how the model weighs note content and generates predictions. We did not conduct a deep dive to evaluate textual elements that were most highly weighted by the model. Nevertheless, clearly identifying such content is not a typical practice in NLP model development nor does the lack of such information diminish the performance of these models.
Our developed model is also limited by the predefined outcome label. We limited the note labeling by clinicians to two questions with binary responses. In addition, each note was only rated by one clinician. It’s fair to expect that clinicians will vary in terms of their impressions of note relevance to discussions about goals of care. We therefore anticipate that the aforementioned items contributed to worse model performance than would be possible using multiple raters per note and a more robust measure of note relevance.
Next Steps and Further Challenges
Despite these limitations, our team judged the model’s performance in the setting of our academic medical center to be excellent. We are therefore looking to the next steps in the model’s evaluation and improvement. These include assessing model performance from the clinicians’ perspective in the context of a live clinical workflow and determining the model’s impact on clinician practices. This may include increased documentation of advance care planning and goals of care, or more important clinical measures, such as delivering goal-concordant care. Given the subjectivity of classifying notes as ones a provider “would want to read,” there is also potential to train models to match the preferences of a specific provider. This may take the form of asking clinicians using the tool the question “Was this note helpful or not?” and adapting the model based on their responses. In time, the provider would be working with a “smart” EHR that can quickly bring to the provider’s attention to the content that they will find most useful or applicable to their clinical work.
However, even after accounting for these limitations and assuming additional improvements as development proceeds, significant barriers to implementation remain. Even with a model that works perfectly, there are challenges to integrating any new software into the EHR, let alone an NLP model operating on free-text clinical documentation to support the clinical workflow. Our team encountered several obstacles while exploring implementation at Duke Health, chief of which was the lack of a feasible pipeline for interval data extracts. Other challenges include applying the model to assign probabilities to notes and then uploading the data back into the EHR. Such a process would require hardware infrastructure for data storage and model operation, as well as software infrastructure to support the data extraction and upload from and back to the EHR. Put simply, there would be significant costs to model integration—costs that would greatly exceed the costs of model development and validation. These realities, coupled with the parallel development in our academic medical center of a complementary solution to the same clinical problem, ultimately led us to pause further work on model integration.
As EHR vendors build additional capabilities for implementing machine learning and artificial intelligence, there will be more opportunities to integrate similar NLP models in a financially viable way. Despite its strong performance, our model to identify notes helpful to prepare for conversations about goals of care is not yet ready for prime time. Nonetheless, we’re encouraged by our experience and believe that it demonstrates that the joint efforts of clinicians and data scientists can lead to innovation with the potential for significant impact on clinical care. Advances such as this will be necessary to continue improving the quality of care, and particularly the patient experience of care, as both EHRs and the healthcare system continue to grow in complexity.
Many Duke colleagues made critical contributions to the design, development, and validation of the model. In addition to those named above, these include Mina Boazak, Julie Childers, Victoria Christian, Allison Dunning, Brian Griffith, Ricardo Henao, Erich Huang, Andy Mumm, Andrew Olson, Eric Poon, Ursula Rogers, Shelley Rusincovitch and Myung Woo; Health Data Science Interns Matias Benitez and Qi (Dylan) Liu; and collaborators from the Duke Institute for Health Innovation, including Suresh Balu, Michael Gao, and Mark Sendak. The development of the model was supported by the Duke Forge and a Joint Liability Steering Committee Safety Grant award.
 Secunda K, Wirpsa MJ, Neely KJ, Szmuilowicz E, Wood GJ, Panozzo E, McGrath J, Levenson A, Peterson J, Gordon EJ, Kruser JM. Use and Meaning of "Goals of Care" in the Healthcare Literature: a Systematic Review and Qualitative Discourse Analysis. J Gen Intern Med. 2020 May;35(5):1559-1566. doi: 10.1007/s11606-019-05446-0. Epub 2019 Oct 21. PMID: 31637653; PMCID: PMC7210326.
 Lindvall C, Lilley EJ, Zupanc SN, Chien I, Udelsman BV, Walling A, Cooper Z, Tulsky JA. Natural Language Processing to Assess End-of-Life Quality Indicators in Cancer Patients Receiving Palliative Surgery. J Palliat Med. 2019 Feb;22(2):183-187. doi: 10.1089/jpm.2018.0326. Epub 2018 Oct 17. PMID: 30328764.
 Chan HYL, Lee DTF, Woo J. Diagnosing Gaps in the Development of Palliative and End-of-Life Care: A Qualitative Exploratory Study. Int J Environ Res Public Health. 2019 Dec 24;17(1):151. doi: 10.3390/ijerph17010151. PMID: 31878235; PMCID: PMC6982034.
 Lee KC, Walling AM, Senglaub SS, Bernacki R, Fleisher LA, Russell MM, Wenger NS, Cooper Z. Improving Serious Illness Care for Surgical Patients: Quality Indicators for Surgical Palliative Care. Ann Surg. 2020 Jun 3. doi: 10.1097/SLA.0000000000003894. Epub ahead of print. PMID: 32502076.
 Udelsman BV, Moseley ET, Sudore RL, Keating NL, Lindvall C. Deep Natural Language Processing Identifies Variation in Care Preference Documentation. J Pain Symptom Manage. 2020 Jun;59(6):1186-1194.e3. doi: 10.1016/j.jpainsymman.2019.12.374. Epub 2020 Jan 9. PMID: 31926970.
 Lee RY, Brumback LC, Lober WB, Sibley J, Nielsen EL, Treece PD, Kross EK, Loggers ET, Fausto JA, Lindvall C, Engelberg RA, Curtis JR. Identifying Goals of Care Conversations in the Electronic Health Record Using Natural Language Processing and Machine Learning. J Pain Symptom Manage. 2021 Jan;61(1):136-142.e2. doi: 10.1016/j.jpainsymman.2020.08.024. Epub 2020 Aug 25. PMID: 32858164; PMCID: PMC7769906.
 Brajer N, Cozzi B, Gao M, Nichols M, Revoir M, Balu S, Futoma J, Bae J, Setji N, Hernandez A, Sendak M. Prospective and External Evaluation of a Machine Learning Model to Predict In-Hospital Mortality of Adults at Time of Admission. JAMA Netw Open. 2020 Feb 5;3(2):e1920733. doi: 10.1001/jamanetworkopen.2019.20733. PMID: 32031645.