Monday, January 13, 2014

Lessons Learned Building the Infrastructure for the Department of Veterans Affairs Million Veteran Program

Originally appeared in Information Week under the title "Big Government Software Projects: 11 Lessons Learned"
The technical difficulties surrounding the launch of Healthcare.gov might eventually prove to be one of the most consequential failed government IT rollouts of our time. It joins a long line of large-scale IT failure stories. Yet despite the finger pointing and political circus surrounding the launch, government will continue to require new information systems in order to fulfill its obligation to provide services to its citizens. We need to learn from these megaprojects.
For the last five years I had the great fortune of leading the information technology portion of what has been called one of the "5 Most Innovative Government Big Data Projects." The Department of Veterans Affairs' Million Veteran Program (MVP) set out to enroll and gather the health and biological information of 1 million veterans to facilitate faster and more cost-effective discovery of the relationships between human genes and our health. To make this possible we developed several integrated applications that pull from multiple systems across the VA to manage the mail, calls, surveys, schedules, samples, and logistics across several departments within this giant government bureaucracy.  
Today, MVP systems have helped enroll more than 200,000 veterans, and I'm personally off to a new set of challenges, making this a good time to reflect back on lessons learned. Some of what follows are descriptions of what we did right. Most, however, are lessons learned the hard way. Of course, no two IT projects are the same, and what we did isn't directly comparably to Healthcare.gov. However it's my hope that some of my experiences in attempting to garner resources, build teams, identify barriers, and negotiate solutions across institutional silos will be useful to people involved in large-scale system development and implementation.
The need for more dataFirst, here's a little about the project to give you some context. Our DNA affects everything from how we process nutrients to how we react to illness and process drugs. But which of our 30,000 genes make us more likely to suffer from, say, alcoholism? Figuring out the patterns that matter requires the participation of enormous numbers of patients -- a factor that has limited scientific discovery. When the Honorable Eric Shinseki was appointed Secretary of Veterans Affairs by President Obama, one of the "Transformative Initiatives" he invested in was the Million Veteran Program, meant to address the shortage of DNA for research and to ensure our veterans receive the cutting-edge care that can come from genomic discovery.
With a core of nine to 12 people we developed a series of applications that on a weekly basis send out 20,000 pieces of mail, process approximately 2,000 blood samples and surveys, field 2,000 calls in a call center, and manage consent forms from 40 VA hospitals. This project will only truly be considered a success when it leads to better healthcare for veterans. That said, MVP's recruitment and enrollment systems are working, and the first genomic samples are being shipped off for processing. 
Here are 11 lessons we learned along the way.
1. Chance favors the prepared 
The organization that did this work, the Massachusetts Veterans Epidemiology Research & Information Center, or MAVERIC, is a multi-disciplinary research-and-development organization with experience conducting large-scale research projects for the VA. When Secretary Shinseki declared his intent to transform the VA, we had already presented a whitepaper and held conversations with research leaders on the importance of capturing and making genomic data reusable. We invested the time of a small team to develop prototypes to hash out critical details before any real dollars were approved to develop the infrastructure for MVP. By investing a small amount of resources in the potential "next big thing" for your organization, the worst thing that can happen is that your team is engaged and sharpening its skills. If you guess right, you've got a head start when leaders look for projects to fund. With luck, you get to help set the agenda. 
2. Talk to people who've done it 
Rather than just rely on vendors, whitepapers, and a key contact or two in the early research stages, we did an all-out cold-call campaign to everyone we could find with experience on similar projects. We identified some 15 organizations and asked about the scope, the resources required, how they defined success, and what they planned to accomplish in the next six months. A great question, we learned, was "What would you do differently if you had it to do over?" This might seem intrusive or even inappropriate, but we found professionals quite happy to share and grateful to have new contacts doing similar work. In several cases we were hosted for half-day or all-day visits.
The research helped in three ways. It previewed the technical and political challenges we would face. It gave us confidence in our early estimates of timelines and required resources. And it provided contacts for future advice. And we also realized that, when it came to building a system to enroll enormous cohorts, no one had all the answers. This made us more comfortable hacking through the inevitable problems as they arose. 
3. Hire great people…Hiring highly motivated and talented individuals is the single biggest reason we made good progress. The best hires shared one trait: All were very strongly attracted to the mission. The chance to build information systems that can improve lives is rare and important enough that people even found it worth accepting a government salary and the inevitable red tape that comes with government work. 
4. …and remove people who aren't great, quickly 
There will be people who seem terrific on paper and even during interviews but for whatever reason don't work out. It's very tempting to hope that things will change or to ignore the issue altogether. Instead, after you've taken the right steps to rectify the situation, these individuals should be removed as soon as it becomes clear to you and key team members that things aren't working out. This was almost unheard of in the government organization I worked for, and I wasn't excited to break from tradition. However, sticking with a poor fit longer than absolutely necessary has a reverberating effect. A results-oriented team quickly realizes when one individual is hurting its combined efforts and grows resentful.
5. Insource first, outsource selectively 
The nature of our project required developing and releasing several different applications in rapid succession. My initial plan was to charge my in-house team, with their close proximity to the business, to the most complex and time-sensitive modules. I assigned development of smaller but critical modules to consulting firms to develop in parallel. This turned out to be a mistake. After several disappointing results it became clear that the blame belonged not with our external partners (many of whom were excellent) but to the structure I had chosen to accomplish the work.
I am now firmly in the camp of Fred Brooks and his "mythical man month" idea: In a nutshell, nine women cannot make a baby in one month by each carrying in parallel. My initial setup meant that our in-house team was completing its modules and then correcting the shortcomings of contractor deliverables. The project in its early stages was simply too complex and amorphous to be successfully parallelized. Instead, we shifted to a model where development of every module was led by in-house developers, with contractors used to supplement our in-house team. We brought project management in-house and instead contracted for embedded, onsite developers, analysts, and QA people -- whose work was overseen by the in-house team. This change improved communication and teamwork, and it let us quickly spot sub-par contractors and replace them with high performers. This oversight is especially important in government where the decision of which vendor to partner with is often determined by contracting offices, not the person responsible for delivering the project. After the change, the quality and quantity of deliverables increased substantially.
6. Find a champion and use the clout as needed 
Enterprise-scale software systems by definition cross the fiefdoms of many chieftains, each of whom can kill your small but determined caravan as it travels through. The larger the organization, the smaller and less significant your caravan can appear to each chief. Although it is important to understand the incentives of any chief you depend on, there are some who will only be swayed by superior firepower. For such situations there is no substitute for a champion at the highest levels of the organization. It was not by chance that the first people to give their blood samples and health data to the Million Veteran Program were the highest levels of VA leadership. Neither was it coincidence that several resistant chiefs were reminded of the origins of samples on an as-needed basis. 
7. Write your plan in sand 
We were never more confident about what the system would have to do than the day before development began. It went downhill from there as the "business" learned from trial and error how to recruit and enroll enormous numbers of veterans from across the country. This story is familiar to those involved with software development, but it can be blasphemous to those in government project planning.
Most large-scale efforts to build things in government (and elsewhere) require a book-thick collection of specifications and requirements to be signed off in advance of "breaking ground." The whiplash of the dotcom era taught many of us in IT that, although the waterfall method might lead to perfectly capable buildings or fighter jets, the malleable nature of software requires a different kind of planning. As our understanding of the complexity of the project grew and as the business learned just what was possible, our detailed 12-month and even six-month project plans became works of fiction. We learned that three months was about the longest period we could predict deliverables with some confidence. We kept six- and 12-month plans and submitted them as instructed to the appropriate overseers, but shifted them to 20,000- and 40,000-foot views.
Those timelines will vary by project, but the lesson is that in complex projects with novel or unfamiliar business processes, it's essential to use project plans at several levels of granularity and to revisit them often. Note: This is very different from constantly revisiting project goals. If goals change frequently, you've got bigger problems. 
8. Establish one ultimate business owner 
An enterprise-scale system means a long list of stakeholders, each with valuable opinions on how your software should work. We were blessed and cursed by an overabundance of talented and devoted stakeholders with opinions that varied, appropriately, with their worldviews and priorities. We dutifully identified and catalogued such priorities. However, the only way we could sustain progress was by establishing one ultimate business owner who could settle disputes and trump others in a timely manner. With some effort we convinced all stakeholders that the MVP project director -- the primary day-to-day business owner -- would be the arbiter of conflicting priorities. We made clear that delays in reaching consensus would delay deliverable software.
Getting everyone to honor the arrangement wasn't a simple task, even when agreed to in theory. Several influential stakeholders in the habit of being listened to continued "suggesting" alternative approaches directly to the development team, resulting in hours-long distractions. Only when it became clear that the development team would only change direction with the MVP project director's consent did the number of direct requests for changes stop. Enforcing this required empowering every member of the development team to politely convey this process to individuals holding higher rank, and standing firmly behind them when they did.
9. Prepare for an evolving software development methodBefore this project, I'd already treated software development methods such as agile, waterfall, and rational unified process like traffic signs in Boston -- more suggestion than law. What happened in this project, though, was a full metamorphosis from one process to another.
In the earliest stages of the project, we worked in full-fledge agile mode. We benefited tremendously from being able to yell across the hall to the people ultimately responsible for using the system. They were quick to jump in, debate a finer point of development, challenge assumptions, and hash out problems in a hurry. We did rapid prototyping in a short amount of time.
As the project team grew, the project touched more systems and the demands for deliverables increased from above. The emphasis shifted from "how might this work?" to "make sure it works." The interrelated nature of the systems meant that a change to the mail routine now affected the site coordinator application, which affected our reporting system, and so on. We needed to add structure and a greater emphasis on quality assurance and testing.
Where frequent conversations with MVP operations partners were an enormous asset in the earliest stages, now such conversations became a liability. I actually had to curtail unannounced visits by MVP operations and politely ask this same group of people that had been working so successfully with us in agile mode for months to visit less frequently. We went from two-week coding sprints between releases to six- to eight-week runs, and we formalized requirements-gathering sessions. The close relationship the combined team had fostered in our agile days probably lessened the hurt and suspicion as I chased people away from our developers (sometimes literally). Looking back, the ability to prepare for such a shift in advance would have been helpful.
10. Let the business dictate priorities and developers explain the consequencesThe software team naturally develops strong opinions on project priorities after months of automating businesses processes. I learned, however, that expressing those opinions creates a slippery slope. Because business owners have limited experience with software development, they find it frustrating that seemingly simple changes can't be made the way they once were to paper-based processes. Why does a simple request like changing the order of mailings cause a two-week delay in deployment? Amid such frustration, any perception that the software team is dictating business priorities pours fuel on that fire.
If we came to the opinion while reviewing priorities that option A was a higher priority than option B, we learned not to advocate for one or the other. Instead we found it far more effective to outline the effects of one choice versus another from as unbiased a position as possible. This seemingly subtle discrimination made a world of difference.
11. Protect your project from outsiders until it's ready to leave the nestOur project's survival, in no uncertain terms, depended on reaching 100,000 enrolled within one year of deployment. With that ultimatum in mind we treated everything else as a threat to our existence.
Survival meant we had to resist efforts by others, even within our own organization, to subsume, host, develop, or partner with our program until we had a running, working system. It was a double-edged sword, because many of the people suggesting expanding our scope were great supporters and in a position to advance the team's agenda. Their suggestions were exactly the type of work we really did want to be involved with -- such as becoming theplatform for recruitment and enrollment across the VA, or integrating with existing, patient-facing, web-based platforms. How could we protect the current effort while not alienating the group that would offer future opportunities?
At first, I tried to explain the downstream effects of taking on a new element, the time it would take to expand the data model, how slightly divergent business processes would increase development and testing time, and so on. It sounded like IT jargon. I needed a better way to protect the project's timelines while not burning bridges.
Remember when I said that not dictating priorities made a world of difference? Rather than reject any proposal to join forces, switch hosts, or incorporate new goals, I learned to express enthusiasm -- which was easy because they really were projects I wanted to take on, at some point. Then I would explain how the new plans could delay our 100,000-enrollment goal. Unfortunately, I was sure to relay, decisions to change project goals and deadlines felt above my pay grade, but with permission we could make it happen. Not surprisingly, our priorities remained unchanged as our 100,000-enrollment goal came from the top.
Today, as MVP recruits the next 200,000 enrollees and the first samples get queued up for processing, MAVERIC is transitioning the hosting of the MVP recruitment and enrollment systems to the VA's Office of Information Technology. There, it will be dressed with the proper fittings for a production system of its age. Like a nervous parent I'll watch from a distance, hoping we raised it right and trusting that that's enough. I'll be forever grateful for the opportunity and thankful that we were able to avoid making the wrong kind of headlines. But frankly, I'm pretty excited about becoming an empty nester.
A most sincere thanks to visionary, mentor, and friend Dr. Louis Fiore and the MAVERIC team for making so many interesting projects possible.The views expressed in this article are the author's and not those of the Department of Veterans Affairs.

Wednesday, February 20, 2013

6 Questions to Guide Natural Language Processing Strategy

Originally appearing in InformationWeek Healthcare on Feb 18th, 2013


The success of the IBM Watson team and their automated question-answering system on Jeopardy opened the eyes of many to the potential of computerized “understanding” of human language. The victorious IBMers were quick to clarify that their goal isn’t to create a robot army of game show champs but to apply these technologies to tackle important societal problems — starting with healthcare.

They couldn’t have picked a better target. By some estimates 70% of all clinically useful information is formatted as unstructured free text. And thanks to a gradualshift toward payfor-performance and fee-per-patient models of reimbursement, healthcare organizations are finally getting serious about using their data to realize efficiencies. With so much useful information captured only in clinicians’ narratives, hospital administrators and policymakers are grappling with what to do with all that free text.

Natural language processing, or NLP, is software designed to turn unstructured free text into structured values. The hope is that NLP can help answer doctors’ questions at the point of care much in the way Watson responded to Alex Trebek. NLP ideally would automatically populate important variables in patient registries or as part of quality-improvement initiatives. Effective NLP could play a critical role in finally answering healthcare’s most obvious and important unanswered questions: What are we doing, to whom are we doing it and is it working? All that leads to this question: Is natural language processing ready for wide use?

Vendors beyond IBM have recognized these needs and are responding with NLP-based systems ranging from the automated assignment of billing codes to “enterprise” natural language processing systems. In the Intro to Clinical NLP tutorial I teach at the annual meeting of the American Medical Informatics Association, I’ve noticed a shift in the audience from curious researchers and students to hospitalists and EMR vendorsinterested in implementing NLP.This exciting shift has led me to explain the fundamentals that people must know about clinical NLP to make decisions about using this technology in their organizations. I exclude important but familiar software-related concerns such as “buy versus build,” product selection criteria and vendor lock-in.  Think of these as the NLP-specific factors you need to keep in mind when considering NLP-based systems.

1. How good is good enough?
For over 50 years, researchers have shown that NLP can be applied with high levels of accuracy for any number of tasks, from extracting symptoms, treatments and tests from the texts of medial records to automatically assigning billing codes. So why hasn’t NLP been widely adopted? There are economic reasons, of course, but the nature of the technology itself is largely to blame.

It’s important to recognize that, except in the most trivial of applications, NLP won’t be 100% accurate. Understanding “how good is good enough” is therefore the first
question potential users must answer. If you 
can tolerate accuracy in the high 70% to mid-90% range, you’re in the ballpark for using NLP.


A related question is what the estimated prevalence of the “target” that you’re trying to understand is. For example, are you interested in extracting tumor stage from cancer-related pathology reports,which should be present in nearly every report? Or are you looking for a rarer target such as evidence of falls in a collection of nursing notes? If the target appears only once for every thousand records, you may have trouble accumulating enough records to “train” a system. You still might be tempted to try out a specialized “fall detection” NLP commercial system for that rare target. But beware — evaluating the system in your own data with so few instances can be equally challenging.

2. What type of system will you use?
NLP systems can take different approaches, such as rules-based or grammar-based. Asking which is better is like judging the utility of a hammer versus a screwdriver. The answer depends on the intended use. This fact has prevented vendors from rolling out a one size-fits-all product and complicates the selection of an appropriate system. A basic understanding of the different approaches to NLP is helpful in matching the approach to the problem. Below is a gross oversimplification of the different approaches to NLP and their pros and cons.

Rules-based: To extract a measure that consistently appears in records, such as a reported blood pressure score, ejection fraction or tumor stage,a simple rules-based approach might be best. These approaches involve searching for patterns in documents. The upside is that these are simple to get started because nearly every programming language features some flavor of “regular expressions.” A drawback is that you’re only as good as the rules you’ve defined; the slightest unaccounted for variation equals a miss. Once you start accounting for all variations of a targeted concept plus negation (e.g.,“patient shows no signs of ...”) the number of rules created and maintained can become unwieldy.

Grammar-based: A “discharge” from a hospital and a “discharge” from a wound are two very different things. The sheer number of ways a medical concept can be expressed may make it necessary to introduce an understanding of how the expression is used.Grammar-based approaches consider how a term is used in order to map terms to dictionaries of concepts such as the National Library of Medicine’s Unified Medical Language System.

These approaches are helpful in dealing with the complexities of medical language but can be slow to run and often leave the user with long lists of potential matches to sort through (“APC” = activated protein c, aerobic plate count, antibody producing cells, age period cohort ...).

Machine learning-based: Machine learning is a subfield of artificial intelligence that combines math with computational brute force to “learn” patterns.Machine learning approaches are good at finding “most likely” matches, and many algorithms provide a weight or score describing the algorithm’s
confidence in the match. Most machine learning algorithms used in NLP are what’s referred to as “supervised” approaches, meaning they learn by example. To find descriptions of pneumonia in free text, supervised approaches rely on having examples of known cases with pneumonia.  Anyone using machine learning must therefore be aware of the potential cost of providing those training examples. Systems will blur the lines among these approaches, but most can be safely classified
into one of these three categories.

3. What’s your process to implement NLP? 
There are two key elements that every NLP implementation process needs in some form: training and testing. Training involves teaching a system the nuances of your data, using a body of data known as a training set. Will the system be trained using your actual data? You might find it acceptable for the training to not rely on your real data if you’re confident that your clinical data is similar enough to the data originally used to train the system. Either way, you can only be confident that a system works in
your environment by conducting a proper evaluation using a test set, a set of notes in which you will evaluate the performance of a system.

In NLP testing, accuracy is most often gauged with the metrics of recall, precision and harmonic mean, otherwise referred to as F-measure. These relatively simple measures are akin to the more clinically familiars sensitivity,specificity and area under receiver operating characteristic. Of course, “good enough” can be measured in a number of dimensions, from ease of implementation and maintenance to speed of computation to accuracy.

4. What problem are you trying to solve?
Since there is no one “best approach”for clinical NLP, it becomesimportant to be very clear on the problem you’re trying to solve.The specific use will also dictate the acceptable performance criteria of a system. Is your application intended to contribute to clinical decision support at the point of care? If so you likely require rapid computation, and you’re probably more interested in presenting information that you’re highly confident in (higher precision) versus capturing all possibly relevant information (higher recall). If NLP is used to find patients with similar conditions, whether for observational studies or registry population, precision may again be important but “real time” becomesless critical. If NLP is embedded in a biosurveillance application, end users may place an emphasis on recall,or casting a wider net to avoid missing any possible outbreaks.

I recommend that anyone considering implementing NLP document the specific goal of the system and acceptable performance in terms of computational time, recall, precision and F-measure before engaging vendors or consultants. Goals such as “facilitate quality improvement” aren’t granular enough to guide decisions.  Goals need to be at the level of “extract tumor stage values from the postoperative pathology reports of patients with prostate cancer.”

5. Have you conducted a walk-through?
There are usually a lot of steps that need to be automated in a production system before IT gets to using NLP. Consider the previous example of extracting tumor stage data. The system user wants the reports narrowed down by several filters: they need to be pathology reports, the reports need to be post-operative and only for prostate cancer. Can you narrow your patient population with a technique easier and more precise than NLP? Are there structured data elements such as standard note titles or ICD-9 codes you can use? Or do you have to use NLP just to figure out which reports to use NLP on (a not uncommon conclusion)? These are all questions that should be answered before thinking too seriously about specific NLP systems.

The best way to begin answering these questions is to conduct a “mock” end-to-end pipeline from raw data in the source system to the NLP software. Start with 10 or so “cases” and be sure to capture
every part of the processin which automation will be required.

This is also a good opportunity to get a feel for how the “target” appears in the record. Is your target in only one record type? Is it semistructured? Is it spread throughout the document or only present in one section —for example, in the “findings”section versus“history and physical”? This will help you determine which type of system might be most appropriate. Keep careful track of the raw documents used in this pilot exercise so you can use them in the next critical aspect of implemented NLP — the evaluation.

6. Who owns the evaluation?
No NLP system should be adopted in production until it has proved itself to be “good enough” in your environment using your data. Here’s why: NLP systems trained to work in one environment may not work as well in a new environment with different clinicians, coders, nurses, note titles and EMR systems. Without a proper evaluation with your own data, you may be blind to certain types of misses that occur consistently, introducing bias into your results.This is especially true of rules-based systems, which are rigid in what they catch or miss.

While the details of a thorough evaluation are the topic of textbooks and scholarly articles, you can apply a few guidelines to be sure you’re getting what you bargained for. “Owning” the evaluation means setting the metrics and acceptable performance criteria in advance and blinding any proposed system and its owners to the “answers” contained in the test set.  First, be sure your test set includes a meaningful number of cases. It takes at least 30 for a sample to begin to be representative. When we conduct evaluations, we conduct a statistical power analysis to determine what number to include.
While that might seem academic, depending on the size of the investment and importance of the accuracy of the system, it may be worth gauging your evaluation on this relatively simple statistical
result, which considers sample size and estimated prevalence of the target. Second, be sure your test set is a mirror image of the environment you intend to deploy it in. If only one in 100 cases has some record of the treatment you’re interested in finding, don’t conduct an evaluation with 50% containing the treatment and 50% without. Such results will look great on paper and fail in the real world.

Some vendors may require that your data is shared with them before an evaluation to ensure that their system is properly trained for your environment. That’s fine, as long as the system and its owners remain blinded to the “answers” in the test set ultimately used in the evaluation. Data in the test set shouldn’t appear in the training set — that’s like taking a test with a cheat sheet in front of you.

Answering these questions gives health IT leaders a lot to do. But human language is a tricky thing, full of nuance and double meaning. With a basic familiarity with the nature of the technology and some best practices surrounding its implementation, however, NLP can unlock insights that would otherwise remain buried in our mountains of electronic medical record data. As former Jeopardy champions Ken Jennings and Brad Rutter learned, ready or not, machine interpretation of human language is coming.

Monday, November 5, 2012

Business Intelligence ≠ Healthcare Intelligence



Reprinted from editorial original appearing in Health 2.0 News
Dr. G makes her rounds on the medical ward looking for clues. She is worried. There has been an increase in the number of urinary infections related to catheters over the past month and she is not sure why.  Right now, her nurses are going from room to room, taking a manual count of all the patients with catheters in place. They do this every day and tabulate the results onto a spreadsheet. An infection control expert will then review all the positive urine cultures and figure out which came from patients with catheters. That expert will then review the medical records of those patients and decide if each has a true catheter-related urinary tract infection, or CAUTI. It is a tedious and time consuming process. 
These issues have important ramifications for Dr. G as well as the hospital administration. The CAUTI rates get reported on a publicly available website so prospective patients, administrators, and third party payors can compare “quality.”  Given a propensity for manual error, some ambiguity in definitions leaving the final call open to subjectivity, and political and financial pressure to have low rates, data validity and reliability becomes suspect.  This scenario plays out in hospitals across the country daily, frustrating doctors and compromising patient care.
The greatest obstacle to measuring and improving the quality of care is the lack of access to quantifiable data describing pathways of care and their outcomes.  Put simply, those responsible for improving healthcare rarely have access to the three most basic and fundamental questions of quality improvement: what has been done, to whom was it done, and did it work? This problem is only likely to get worse as U.S. healthcare reimbursement shifts from pay for service to pay for performance or global payment models in which the ability to measure and improve care has a direct effect on the bottom line.
To remedy this situation, healthcare is turning to “business intelligence” or BI for short.  This class of technologies has helped other industries realize tremendous efficiencies through the use of data warehouses, reporting packages, dashboards of metrics, and analytics.  The premise of learning from the past to improve the future is obviously correct and there is little doubt of the importance of these technologies in helping healthcare address and improve its quality.
Unfortunately, what has been absent in the rush to implement BI solutions is a recognition of the fundamental differences between the nature of business versus healthcare information.  These differences have a significant effect on the designs and approaches needed to deliver meaningful “intelligence.”  For example:
● Large amounts of useful information is stored as unstructured free text – Whereas most business intelligence data is quantifiable (e.g., sales of a department last quarter), studies have found that up to 70% of the information useful in making care decisions is formatted as narrative free text.  Any healthcare intelligence solution that doesn’t provide for access to this information in quantitative form is therefore working with, at best, 30% of the available information.
● The heterogeneous nature of medical data – Business intelligence applications are, for the most part, designed to handle continuous or discrete variables.  While some of healthcare data is so straightforward, much of it is formatted as any one of several modalities of imaging data, signal data (e.g. EKG, EEG, etc.), as well as biological information such presence or absence of a protein or gene.  While such data can be discretized or converted to formats amenable to analysis, little is currently offered in the way of tools to facilitate such conversions.
● Temporal resolution of events - Whereas much of the temporal data generated in business intelligence applications adheres to a consistent calendar (e.g., quarters or fiscal year), most healthcare data is relative to events. This requires the ability to consider time relative to events such as the number of days a Foley catheter has been inserted in a patient or the length of a stay in a hospital.
● Information where it’s needed most – A reasonable expectation is that healthcare intelligence be delivered at the point of care.  In other words, a goal is the facilitation of clinical decision support, not just “off line” analytics.  This implies an infrastructure capable of fast access to data and high performance computing.
● Context is key – Anyone that’s been involved in measuring quality has heard the question “where did that data come from?”  Yet, one would be hard pressed to identify a BI product that offers data provenance to the end user.  This matters because very little if any of the data stored in the course of caring for patients is intended for secondary uses such as quality measurement.  As a result, understanding the context surrounding information is critical to interpreting its value.
There has been considerable progress in the development and use of technologies capable of addressing these healthcare-specific needs.  Natural language processing technologies are finally becoming available outside of the labs of their creators. Databases can store increasing amounts and formats of data and novel NoSQL data models.  High performance computing clusters have proven capable of serving up data at the speeds required for real time clinical decision support.  Machine learning algorithms can quickly sort cohorts by disease type or risk categories.  There’s nothing preventing database designers and programmers from recording where the data came from.  However, the application of technology without careful consideration of healthcare’s specific needs will lead to little more than a temporary spike in hospital IT investments and BI vendor stock prices.
If, on the other hand, we build solutions based on the needs of healthcare intelligence, Dr. G and her colleagues may one day have access to a button that can automatically calculate CAUTI rates.  Better yet, it will point her toward the antibiotics, wards, providers, and practices most likely to have swayed the rates in either direction. If all hospitals were to use a similar button, patients and clinicians could have confidence in quality measures and comparison between hospitals becomes possible.  Dr. G, in turn, can get back to what she does best – taking care of patients rather than spending time in front of her computer.

Tuesday, April 3, 2012

Why Sequencing the Genome Shouldn't Explain Disease

Part of the series, Preparing for the Personalized Medicine Revolution.

According to Kevin Davies in his great read, The $1000 Genome, when Nobel Laureate biologist Sydney Brenner was asked about what he thought of genomics and personalized medicine, he broke out laughing. He noted the field of astronomy took a strange turn, splitting into astronomy and astrology.  Astronomy is science.  Astrology deals with our odds of meeting a mysterious stranger because of the alignment of celestial orbs.  Today's so called personalized medicine is, as he called it, 'genology'.

This isn't so much a scathing indictment of a failed enterprise so much as a warning against assigning too much credit to a fledgling field.  He's not alone.  A group of scientists put together a policy paper warning of the dangers of getting too far ahead of a science that will yield real fruit over decades, not months.

To understand why this is the case, it helps to understand the type of science that has been mostly used in this space and the role of genes in determining health and disease.

The GWAS

Most genomic science conducted to date looks for correlations between single or, more recently, multiple points in the genome and a binary condition (disease vs. no disease).  The result is the identification of nucleotides (A, C, T, or G) that appear more often in case or control groups.  These types of studies are called Genome Wide Association Studies, or GWAS.  On the right is a typical GWAS graphic for identifying which nucleotide(s) are most significant - the higher the dot appears, the more significant the correlation.

In some cases, these studies have yielded signals that, upon further exploration, indicate a strong association between people that have a certain gene and disease.  Folks with the APoE4 gene are more likely to get Alzheimer's disease.  People with a harmful BRCA1 or BRCA2 mutation are more likely to get breast cancer and certain other cancers.  We learn this by finding individual nucleotides which are part of genes, then researching what's known about what those genes do and how they interact with other genes.

And while our relatively early GWAS work has taught us alot, the GWAS has been blasted for failing to explain disease.  But that's like blaming the hammer for it's inability to loosen a screw.  This particular tool is designed to use simple correlations to tell us where to look.  Like a submarine sending pings off into the abyss, the GWAS explores the wide expanse of millions of data points to seek signals that might indicate that further exploration is worth while.  You don't expect that ping to return the names and intentions of the captain and crew of an approaching vessel - just to tell you that there's something there of note.  The stronger the signal, the more likely there's something of interest worth exploring.

The Incredibly Complex World of Disease

There's more to the discovery story than the inability of the method to produce an avalanche of new therapies and diagnostics.  The truth is most diseases do not evolve from mutations in our genes and so sequencing of our genomes, whether looking at SNPs or the whole sequence, isn't enough to predict nor prevent them (link to a good sci american article on the topic).  Disease, unfortunately, is not so simple.  There's environment, lifestyle and a whole slew of other biological processes such as how proteins behave, the metabolism of substances, the micro-organisms that live within us and play an important role in how our systems functions, including the way we process drugs.  These are the other 'omics that are soon to follow genomics with companies scrambling to drop the price of sequencing the RNA and microbiomes of different organisms.

In fact, understanding disease is less like finding one single marker of interest and more like a George Rhodes kinetic sculpture.  Except instead of a dozen or so balls following a complex but fixed path, there are several million pathways and hundreds of billions of balls in motion and their pathways are determined based on dynamic, rather than fixed rail systems.

Understanding disease and health requires somehow modeling the potentially billions of interactions between these layers of biology and our environment - a task scientists have really only recently begun.  With so many data points to consider and inter-relations at play, scientists capitalize on knowledge bases with known gene, protein, and in some case metabolomic data, combining intuition about disease with the mathematics of complex networks.  The result is less like the simple chart shown above, and more like a network of Internet activity.

Progress has been made in mapping these networks to understand "disease pathways."  In such work, scientists attempt to model a known disease's pathway across the genes, RNA, proteins, and metabolism (most work of this type takes on genes -> proteins).  These pathways help us move from initial discovery of where to look (thanks GWAS!) all the way to which proteins one might want to target in the design of a therapy.

The challenges introduced by this new 'omic era are so significant that rival pharmaceutical companies have begun sharing this pre-clinical data with one another.  This has spawned a new industry with companies selling data warehousing and analytical services to pharmaceutical companies to help them make the most of theirs and others' pre-clinical data.  In effect, drug development has shifted, at least a bit, from a bench science to an information science, where the discovery of meaningful patterns in large data sets is critical to success.

Of course, discovery science and the science needed for clinical validation are two different things.  And as difficult as the above described discovery work is, at least these challenges can be tackled by improvements in computing, sequencing, and math.  Arguably more difficult is re-engineering the well established industrial and federal complexes responsible for approving a drug's use in hospitals.  With a price tag of $80 million & several years to bring a single drug to market, discovery will not be the bottleneck to personalized medicine for long.  The short comings of using existing models such as the randomized controlled trial (RCT) and the need for alternative approaches is of critical concern...and will be the topic of my next post.

Tuesday, March 13, 2012

Personalized Medicine's Next Frontier: Next-Gen Phenotyping

Part of the series, "Preparing for the Personalized Medicine Revolution"

Within just a year or two there will be devices that cost less than $10k that are capable of sequencing entire genomes accurately for $100 a chip.  So what happens when genomic data is a commodity?  In this second article we take on the other half of personalized medicine - the phenotype.

For the last few years of genomic science a large part of the focus has been on managing data from the genotype and understandably so.  It's the kind of big data that leads scientists to publish lines like, "For larger genomes, the choice of assemblers is often limited to those that will run without crashing." (linkThe other reason scientists could afford to focus on genomic data is that most genomic science to date relies on the assumption of a binary "phenotype."  In one corner, 4,000 patients with colon cancer.  In the other, 4,000 without colon cancer.  Run a couple million T-tests and voila - a significant difference between those with the disease and those without.

Unfortunately, disease isn't binary.  The irony is, genomics is proving this in ways we hadn't anticipated.  This week in the New England Journal of Medicine, Gerlinger et al. showed that even within the same tumor one can find different genetic signatures.  That means that a biopsy that catches one part of the tumor may show a positive prognosis while a millimeter to the left, not so much.

As we move from discovering the low hanging fruit and start understanding what exactly these early findings mean, we need to start treating the phenotype for what it really it - a complex intermingling of a patient's age, environmental exposure, other illnesses, and more granular description of the disease of interest. So where do we find this phenotype?

Building on a Foundation of Sand

Our most obvious source of phenotype data is the electronic medical record (EMR).  The good news is, thanks to aggressive financial incentives from the US Government, we're seeing enormous growth in the adoption of EMRs (just topped 50%).  The bad news - EMRs are not created to support research.  They are created to allow clinicians to learn what one another did, to provide a legal document of events, and to facilitate reimbursement. This has very real negative repercussions for attempts to learn from the EMR - whether for quality measurement, research, or phenotyping for discovery or decision support.

First, in many hospitals the data needed to create a useful phenotype models is spread across different systems and locations.  For example, radiology reports documenting the growth of a lung nodule live in a PACs system, the clinic notes describing family history of lung cancer are in an EMR system (if you're lucky), and the biopsy report was generated by a third party lab that was kind enough to store their findings as a scanned pdf.

If one is lucky enough to muster together the relevant source documents it then becomes necessary to wade through the copious unstructured narrative.  By some estimates, >70% of the useful information stored as free text.  There are tools to try turn free text into structured data and methods for assessing the accuracy of data elements.  In the tools department, natural language processing (NLP) and machine learning have shown great promise.  Unfortunately, they've been showing that promise since 1967 with little traction outside the labs of their creators.  This is because most approaches to NLP build out libraries of rules, patterns, vocabularies, etc toward a specific task.  Once validated for that task, in that specific data set, they work wonderfully.  But move them from the very specific use case they were designed for and it's back to square one.

The only widely adopted standards typically found in the EMR are reimbursement related (e.g., ICD-9 codes and CPT codes) and more than a few dozen studies have shown that the accuracy of such codes can be dependent on the disease the patient is being seen for, the department, whether assigned automatically or by humans, whether the patient is inpatient vs outpatient, the position of the moon, etc.  I once collaborated on a research project in which we were asked to pull important quality measures from the post-operative pathology reports of patients with lung cancer, prostate cancer, and colorectal cancer.  As part of that work we discovered that less than 20% of the reports that we identified via ICD-9 code were actually related to the disease of interest.  I was skeptical.  After all, these are the path reports pulled within a pretty specific date range of the appearance of very specific cancer-related codes.  I shared the news with my surgical oncology friend who bet me her career that my numbers were wrong.  Being the paranoid researcher that I am, we revisited each of the several hundred reports.  I'm sorry to say, we were right, she has yet to deliver her career (you know who you are), and I have yet to trust ICD-9 codes since.  Yet these are the foundation of so many of our upcoming pay-for-performance measures...but that's another article.


Phenotype-Driven Decision Support

Solving the phenotype issue is about more than discovering which genetic markers are worth pursuit and development into diagnostic tools or therapeutic targets.  Most EMR-based clinical decision support is about rules-based protocols - a pretty straightforward series of "if - then" statements.  If the patient is on Plavix then consider holding off on Warfarin. If the patient is female and over 50 then consider a mammography.  etc.  This approach works for some of the earliest discovered biomarkers whose correlation to disease is so strong that rules can be made.  For example, if the female patient is BRCA1 and BRCA2 positive then consider prophylactic treatment for breast cancer.  But now that we're moving past the low handing fruit we're quickly moving into the language of probabilities - distributions of risk that change based on any number of characteristics in the phenotype.  To provide decision support at the point of care that can take advantage of the power of personalized medicine we need to be able to plug patient-specific parameters into mathematical models that can feedback tailored answers.  If those systems are to have a shot at providing answers in anything near real time, large portions of complex disease models must be pre-computed using access to structured and accurate phenotype data (remember, we're talking about hundreds of variables, not the 5-7 used in most regression-based systems used today).

A Way Forward

I'm hopeful that as the reimbursement model shifts toward capitated care models and pay-for-performance the need for stronger measurement will lead to an outcry for data that can be learned from.  Or maybe it will be oncology departments that are about to be crushed with an onslaught of new "tailored" treatments that will argue that effective care requires effective information served in realtime.

We've laid some of the foundation for this work in the VA with large scale data warehousing.  The NIH's Clinical Science Translational Awards (CTSAs) have led to similar warehousing efforts in at least 60 academic hospitals. In terms of structuring that data we've released open source machine-learning based tools that that can be used by non-technical folks to find "cases like these" based on examples.  An area we're just starting to explore is the use of NoSQL models to store and access phenotype data faster and pairing that with some pretty cool visualization & analysis tools.  However, it's important to recognize that too many of these efforts are designed to compensate for the real issue.  While there is no demand for the EMR to capture reliable and quantifiable data that can be learned from, the best we can hope for is really slick work-arounds.




Thursday, March 1, 2012

The $100 Genome: Preparing for the Personalized Medicine Revolution

DNA is the blue print of life, telling our cells what to become and when to become it.  Today we are making enormous progress in unlocking the health implications of genes but because of technological and economic barriers, we're looking with blinders on.  To account for our inability to read all 3.2 billion base pairs in the human genome we have concocted some pretty smart methods of looking at the parts that we think matter.  We look at specific markers that few people have (called single nucleotide polymorphisms or SNPs) or at a few of the areas that we believe are involved in processes related to specific diseases (gene-based microarray studies).  While these approaches have been fruitful, it is widely believed that affordable and accurate reads of the entire genome (i.e., whole genome sequencing) will open the door to a whole new level of diagnostic and therapeutic discovery.


Like the desktop computer, laptop, cell phones, and tablets, there is a tipping point where high enough quality and low enough price will cause payers and providers to switch from target-specific tests to whole genome sequencing.  Just as today's smart phones make it difficult to justify purchasing phones, cameras, and gaming devices, the whole genome contains specific targets (e.g., a BRCA 1 & BRCA 2 test for breast cancer) as well as everything else one might want to learn from the gene.  And because our genes do not change often, a single sequence can be referenced again and again as new diseases are suspected, new treatments considered, and new clinical findings discovered.  In other words, referencing our genome will become a routine aspect of patient care.

Many posit that $1000 represents the tipping point for whole genome sequencing.  It's quite an ambitious goal, considering that the first sequence of a whole human genome took 13 years and cost $3 billion.  Yet since that first sequence was published in 2001, prices have dropped in unprecedented fashion.  Today there are several companies claiming to be within a year of the $1000 genome including Life Technologies, Illumina, Oxford Nanopore Technologies, IBM, and Complete Genomics.  Telling of how quickly this field is advancing, there are companies reporting progress toward the $100 genome.  Most notably is Genia, a 10 person start-up that will use a semi-conductor or chip-based approach to provide highly accurate whole genome sequencing from $8000 machines that use $100 chips.  If they deliver as suggested, these machine and their chips will be in mass production within the year.

Ready or not, the tipping point is about to arrive.  No stakeholder of our healthcare system, from scientist, to nurse, to hospital CIO, to patient, will be unaffected by the change that will soon occur.  Now is the time to think carefully about what it will all mean so that we can make the most of the opportunities.  Entire books could be written about each of impending areas of change and by the time they're written they will likely be outdated.  To get the dialog started, here are just a few of the impacts that we need to start preparing for.  Working our way from bench to bedside, here's what I've got so far (more to follow):

Barrier to Personalized Medicine #1: The Free Rider Dilemma

Personalized Medicine's Next Frontier - Next Gen Phenotyping

Why Genome Sequencing Shouldn't Explain Disease - Understanding Discovery Work to Date





Barrier to Personalized Medicine #1: The Free Rider Dilemma

Part of the series, "Preparing for the Personalized Medicine Revolution"

The free rider dilemma is an economics term describing when one portion of a society benefits from the contributions of others without contributing themselves. Drug companies and scientists have long complained of the effects of the free rider dilemma on drug discovery and testing.  Everyone wants access to cutting edge drugs but few want to participate in the research necessary to bring them to market.  Because of the nature of genomic science, this dilemma is sure to stifle the growth of personalized medicine.

Much of today's medically relevant genomic discovery work looks for relationships between specific genomic characteristics, sometimes called the "genotype," and specific patient characteristic or "phenotype."    These studies are referred to as genome wide association studies or GWAS.  Most GWASes have looked at diseases that are known to be at least somewhat likely to be inherited.  Since we inherit our DNA from our parents, and we inherit some risk of certain diseases from parents, the thinking goes that these are diseases that an inspection of our genes can tell us the most about.

To conduct these studies, a group of subjects is divided into two groups, those with the phenotype of interest (e.g., diabetes, cancer, etc) and those without.  Then we compare every variable (in this case, specific values in the genome) to both groups to learn of any of these variables are more likely to occur in one group versus the other.  Most studies up to this point have used "SNP chips" that contain around a million positions in the genome.  Because there are so many variables being compared, there's a threat that we could be finding spurious relationships just due to chance.  The way to compensate for this is to require a very strong measure of correlation (the "p-value").  To give you an idea of just how strong, your average clinical trial looks for a p-value of at least p < .05 to consider a correlation valid.  The widely accepted p-value for a GWAS is p<.0000001.  And how are such strong values reached - with the addition of many more subjects into the study.  As a result, too many studies with thousands of subjects are finding "potential" targets and suggesting that with several thousand more patients, these targets may prove to be meaningful.

In other words, if we do not overcome the free rider dilemma we will be limited in what we can learn thus limiting the development of new diagnostic and therapeutic treatments.  There are institution-specific programs in place to try to fill this gap.  Some hospitals such as Dana Farber and Vanderbilt are consenting all patients for access to 'spent' samples and their EMR.  Related efforts such as the eMERGE consortium will combine de-identified data of multiple hospitals in efforts to make genomic science possible.  At the Department of Veterans Affairs, the Million Veteran Program is underway, collecting blood, EMR, and survey data of one million Veteran volunteers for use in advancing genomic science and personalized medicine.

While these are important institutional efforts, the free rider dilemma is a societal problem - not just an institutional one.  All stakeholders of healthcare must be educated as to just what is at stake.  Genomic discovery is literally redefining disease.  Historically, cancers have been named by the organs they are first discovered in (prostate cancer, colon cancer, etc).  And trials are conducted based on these gross definitions.  But now we're learning that there are actually hundreds of types of colon cancer - some of which have more in common genetically with breast cancer!  And depending on the type, they react quite differently to different types of treatments. Dividing a specific cancer into different types means you divide the eligible patient population.  How do you learn which drugs work best on which specific type of cancer? Without a patient population willing to participate in research, you don't.    


What's missing is an honest, open, and widespread dialog with all of healthcare's stakeholders.  The risks of participating must always be center stage in any conversation about research participation.  However, in light of what's at stake, it is time to also include the very real risk of non-participation.

The good news is, when it comes to discovery work, little more is needed from the patient than consent to learn from tissue and blood that will otherwise be destroyed and access to their medical record to better understand their phenotype.  Even better news - with over half of the cancer-related drugs in the development pipeline designed to target specific cancer types, there is huge opportunity for patients to receive "tailored" drugs in clinical trials that may work far more effectively than our current treatments.  The practice of treating based on an individual's biology won't be limited to cancer treatment.  We're already seeing this pattern emerge in all sorts of areas from mental health, to rare infections, to heart disease.  And that's the promise of personalized medicine.