The success of the IBM Watson team and their automated question-answering system on Jeopardy opened the eyes of many to the potential of computerized “understanding” of human language. The victorious IBMers were quick to clarify that their goal isn’t to create a robot army of game show champs but to apply these technologies to tackle important societal problems — starting with healthcare.
They couldn’t have picked a better target. By some estimates 70% of all clinically useful information is formatted as unstructured free text. And thanks to a gradualshift toward payfor-performance and fee-per-patient models of reimbursement, healthcare organizations are finally getting serious about using their data to realize efficiencies. With so much useful information captured only in clinicians’ narratives, hospital administrators and policymakers are grappling with what to do with all that free text.
Natural language processing, or NLP, is software designed to turn unstructured free text into structured values. The hope is that NLP can help answer doctors’ questions at the point of care much in the way Watson responded to Alex Trebek. NLP ideally would automatically populate important variables in patient registries or as part of quality-improvement initiatives. Effective NLP could play a critical role in finally answering healthcare’s most obvious and important unanswered questions: What are we doing, to whom are we doing it and is it working? All that leads to this question: Is natural language processing ready for wide use?
Vendors beyond IBM have recognized these needs and are responding with NLP-based systems ranging from the automated assignment of billing codes to “enterprise” natural language processing systems. In the Intro to Clinical NLP tutorial I teach at the annual meeting of the American Medical Informatics Association, I’ve noticed a shift in the audience from curious researchers and students to hospitalists and EMR vendorsinterested in implementing NLP.This exciting shift has led me to explain the fundamentals that people must know about clinical NLP to make decisions about using this technology in their organizations. I exclude important but familiar software-related concerns such as “buy versus build,” product selection criteria and vendor lock-in. Think of these as the NLP-specific factors you need to keep in mind when considering NLP-based systems.
1. How good is good enough?
For over 50 years, researchers have shown that NLP can be applied with high levels of accuracy for any number of tasks, from extracting symptoms, treatments and tests from the texts of medial records to automatically assigning billing codes. So why hasn’t NLP been widely adopted? There are economic reasons, of course, but the nature of the technology itself is largely to blame.
It’s important to recognize that, except in the most trivial of applications, NLP won’t be 100% accurate. Understanding “how good is good enough” is therefore the first
question potential users must answer. If youcan tolerate accuracy in the high 70% to mid-90% range, you’re in the ballpark for using NLP.
A related question is what the estimated prevalence of the “target” that you’re trying to understand is. For example, are you interested in extracting tumor stage from cancer-related pathology reports,which should be present in nearly every report? Or are you looking for a rarer target such as evidence of falls in a collection of nursing notes? If the target appears only once for every thousand records, you may have trouble accumulating enough records to “train” a system. You still might be tempted to try out a specialized “fall detection” NLP commercial system for that rare target. But beware — evaluating the system in your own data with so few instances can be equally challenging.
2. What type of system will you use?
NLP systems can take different approaches, such as rules-based or grammar-based. Asking which is better is like judging the utility of a hammer versus a screwdriver. The answer depends on the intended use. This fact has prevented vendors from rolling out a one size-fits-all product and complicates the selection of an appropriate system. A basic understanding of the different approaches to NLP is helpful in matching the approach to the problem. Below is a gross oversimplification of the different approaches to NLP and their pros and cons.
Rules-based: To extract a measure that consistently appears in records, such as a reported blood pressure score, ejection fraction or tumor stage,a simple rules-based approach might be best. These approaches involve searching for patterns in documents. The upside is that these are simple to get started because nearly every programming language features some flavor of “regular expressions.” A drawback is that you’re only as good as the rules you’ve defined; the slightest unaccounted for variation equals a miss. Once you start accounting for all variations of a targeted concept plus negation (e.g.,“patient shows no signs of ...”) the number of rules created and maintained can become unwieldy.
Grammar-based: A “discharge” from a hospital and a “discharge” from a wound are two very different things. The sheer number of ways a medical concept can be expressed may make it necessary to introduce an understanding of how the expression is used.Grammar-based approaches consider how a term is used in order to map terms to dictionaries of concepts such as the National Library of Medicine’s Unified Medical Language System.
These approaches are helpful in dealing with the complexities of medical language but can be slow to run and often leave the user with long lists of potential matches to sort through (“APC” = activated protein c, aerobic plate count, antibody producing cells, age period cohort ...).
Machine learning-based: Machine learning is a subfield of artificial intelligence that combines math with computational brute force to “learn” patterns.Machine learning approaches are good at finding “most likely” matches, and many algorithms provide a weight or score describing the algorithm’s
confidence in the match. Most machine learning algorithms used in NLP are what’s referred to as “supervised” approaches, meaning they learn by example. To find descriptions of pneumonia in free text, supervised approaches rely on having examples of known cases with pneumonia. Anyone using machine learning must therefore be aware of the potential cost of providing those training examples. Systems will blur the lines among these approaches, but most can be safely classified
into one of these three categories.
3. What’s your process to implement NLP?
There are two key elements that every NLP implementation process needs in some form: training and testing. Training involves teaching a system the nuances of your data, using a body of data known as a training set. Will the system be trained using your actual data? You might find it acceptable for the training to not rely on your real data if you’re confident that your clinical data is similar enough to the data originally used to train the system. Either way, you can only be confident that a system works in
your environment by conducting a proper evaluation using a test set, a set of notes in which you will evaluate the performance of a system.
In NLP testing, accuracy is most often gauged with the metrics of recall, precision and harmonic mean, otherwise referred to as F-measure. These relatively simple measures are akin to the more clinically familiars sensitivity,specificity and area under receiver operating characteristic. Of course, “good enough” can be measured in a number of dimensions, from ease of implementation and maintenance to speed of computation to accuracy.
4. What problem are you trying to solve?
Since there is no one “best approach”for clinical NLP, it becomesimportant to be very clear on the problem you’re trying to solve.The specific use will also dictate the acceptable performance criteria of a system. Is your application intended to contribute to clinical decision support at the point of care? If so you likely require rapid computation, and you’re probably more interested in presenting information that you’re highly confident in (higher precision) versus capturing all possibly relevant information (higher recall). If NLP is used to find patients with similar conditions, whether for observational studies or registry population, precision may again be important but “real time” becomesless critical. If NLP is embedded in a biosurveillance application, end users may place an emphasis on recall,or casting a wider net to avoid missing any possible outbreaks.
I recommend that anyone considering implementing NLP document the specific goal of the system and acceptable performance in terms of computational time, recall, precision and F-measure before engaging vendors or consultants. Goals such as “facilitate quality improvement” aren’t granular enough to guide decisions. Goals need to be at the level of “extract tumor stage values from the postoperative pathology reports of patients with prostate cancer.”
5. Have you conducted a walk-through?
There are usually a lot of steps that need to be automated in a production system before IT gets to using NLP. Consider the previous example of extracting tumor stage data. The system user wants the reports narrowed down by several filters: they need to be pathology reports, the reports need to be post-operative and only for prostate cancer. Can you narrow your patient population with a technique easier and more precise than NLP? Are there structured data elements such as standard note titles or ICD-9 codes you can use? Or do you have to use NLP just to figure out which reports to use NLP on (a not uncommon conclusion)? These are all questions that should be answered before thinking too seriously about specific NLP systems.
The best way to begin answering these questions is to conduct a “mock” end-to-end pipeline from raw data in the source system to the NLP software. Start with 10 or so “cases” and be sure to capture
every part of the processin which automation will be required.
This is also a good opportunity to get a feel for how the “target” appears in the record. Is your target in only one record type? Is it semistructured? Is it spread throughout the document or only present in one section —for example, in the “findings”section versus“history and physical”? This will help you determine which type of system might be most appropriate. Keep careful track of the raw documents used in this pilot exercise so you can use them in the next critical aspect of implemented NLP — the evaluation.
6. Who owns the evaluation?
No NLP system should be adopted in production until it has proved itself to be “good enough” in your environment using your data. Here’s why: NLP systems trained to work in one environment may not work as well in a new environment with different clinicians, coders, nurses, note titles and EMR systems. Without a proper evaluation with your own data, you may be blind to certain types of misses that occur consistently, introducing bias into your results.This is especially true of rules-based systems, which are rigid in what they catch or miss.
While the details of a thorough evaluation are the topic of textbooks and scholarly articles, you can apply a few guidelines to be sure you’re getting what you bargained for. “Owning” the evaluation means setting the metrics and acceptable performance criteria in advance and blinding any proposed system and its owners to the “answers” contained in the test set. First, be sure your test set includes a meaningful number of cases. It takes at least 30 for a sample to begin to be representative. When we conduct evaluations, we conduct a statistical power analysis to determine what number to include.
While that might seem academic, depending on the size of the investment and importance of the accuracy of the system, it may be worth gauging your evaluation on this relatively simple statistical
result, which considers sample size and estimated prevalence of the target. Second, be sure your test set is a mirror image of the environment you intend to deploy it in. If only one in 100 cases has some record of the treatment you’re interested in finding, don’t conduct an evaluation with 50% containing the treatment and 50% without. Such results will look great on paper and fail in the real world.
Some vendors may require that your data is shared with them before an evaluation to ensure that their system is properly trained for your environment. That’s fine, as long as the system and its owners remain blinded to the “answers” in the test set ultimately used in the evaluation. Data in the test set shouldn’t appear in the training set — that’s like taking a test with a cheat sheet in front of you.
Answering these questions gives health IT leaders a lot to do. But human language is a tricky thing, full of nuance and double meaning. With a basic familiarity with the nature of the technology and some best practices surrounding its implementation, however, NLP can unlock insights that would otherwise remain buried in our mountains of electronic medical record data. As former Jeopardy champions Ken Jennings and Brad Rutter learned, ready or not, machine interpretation of human language is coming.