What Health Data Science and Raising Chickens Have in Common

April 29, 2019

In my most recent blog post, I wrote about Susan Jones*, a 65 year-old patient and Medicare recipient, and described how a multidisciplinary team at Duke constructed an AI-powered workflow to help patients like her.

Building that workflow was a laborious, 18-month-long process. We assembled a band of explorers comprising clinicians, nurses, pharmacists, machine learning experts, statisticians, and informatics experts, and blazed a path into an only partially-understood data wilderness. Clearing this path and developing the machine learning logic to navigate it allows us to predict unplanned hospitalization 1.6 million times a month.

This is the state of data science in health: our current electronic health record systems were never designed for data science, which requires us to “hack” a system that is optimized for billing and documentation to present data in a manner appropriate for statistical analysis or machine learning. This means that each new data science use case is a new expedition into the unknown, with new trails to be blazed. This is a far different state of affairs than we see when Google Assistant tells you about the traffic on the way to work, or Amazon suggests a purchase—where we’re still clearing undergrowth and breaking dirt paths, the tech giants have completely mapped and built modern data highways because they’ve truly become “data science first” companies.

If we are to make Duke a place that will build robust, rapidly testable, and iterative data science and artificial intelligence for the benefit of our patients and clinicians, we must quickly acquire the culture and technical underpinnings to make this feasible.

This is the state of data science in health: our current electronic health record systems were never designed for data science, which requires us to “hack” a system that is optimized for billing and documentation to present data in a manner appropriate for statistical analysis or machine learning.

One example of the outlook needed can be found in a recent white paper by Jonas and colleagues titled Cloud Programming Simplified: A Berkeley View on Serverless. This report from a group of authors at the University of California – Berkeley’s Department of Electrical Engineering and Computer Sciences, is the followup to a seminal white paper from 2009 titled Above the Clouds: A Berkeley View of Cloud Computing.

The paper is a 33-page tome, but it lays down a marker that we can’t ignore. A couple of weeks ago, one of our lead data scientists said that what he wants is “Lambda for machine learning at Duke,” and I absolutely agree—this paper provides a lot of the necessary background to understand this desire.

And what on earth is “Lambda”? It’s a new “serverless” approach to computing created by Amazon Web Services (AWS). With Lambda, you simply issue commands to a “computer in the sky” without having to worry about any of the underlying infrastructure. We daily use an analogous capability, also provided by Amazon: When you tell Amazon you want a Lego Millennium Falcon kit, it arrives on your doorstep without you needing to attend to the details of pulling it from a particular shelf in a particular warehouse, boxing it, selecting the shipper, and finding the most efficient route for the shipper to deliver it to your house. These details are all abstracted away. Lambda performs a similar task, but in this case it abstracts away the underlying computer server hardware, networking, load balancing, operating system, and operating system libraries.


Here’s another analogy.

Last year, while I was reading my children Laura Ingalls Wilder’s Little House… series, I came across what struck me as a great lesson for my kids. After yet another move west, the Ingalls family had started husbanding chickens. The passage provided a remarkable object lesson in just how involved the process of simply being able to eat a chicken was for the family:

  • The Ingalls had to barter for a rooster and hen.
  • They had to build a coop to protect the rooster and hen.
  • They had to feed their pair.
  • The hen had to lay and hatch eggs.
  • The Ingalls had to feed, protect, and grow the chicks to adulthood.
  • In the meanwhile, the family could finally eat eggs.
  • It was many months before they had grown enough chickens to sufficient maturity to even consider eating one.

Caring for a flock of chickens is clearly a lot of work—work that’s further complicated by surprise blizzards, wolves, and other tribulations. It’s also a good analog for cultivating a bunch of servers to execute data science workloads:

  •  You start small.
  • It takes time to reach critical mass.
  • There are multiple predators, environmental threats, and problems with your henhouse and coop along the way before you can even start eating.

To continue our chicken analogy, using “serverless" approaches in the context of data science is roughly equivalent to the modern experience of cooking an excellent chicken cacciatore. In order to get a good result, you merely have to drive to the local grocery store to pick up a whole, plucked, and cleaned organic free-range chicken from the grocery store, combine it with ingredients also bought conveniently at the grocer, and cook it using modern appliances. Your recipe is your “lambda function”.

Serverless allows us to focus on cooking, not animal husbandry.

Good data scientists and their collaborators have great recipes and are great cooks. However, asking them to also be farmers is a heavy lift and not the best use of their time. And just as most of us would be at a loss raising livestock on the South Dakota frontier in the late 19th century, there is a good case for leaving the server wrangling to those with appropriate resources and skillsets.

In other words, serverless allows us to focus on cooking, not animal husbandry.

Modern machine learning is an intensely data-hungry discipline that will require flexible, scalable, secure, and robust infrastructure to facilitate training, implementing, and deploying ML inference—especially in a healthcare setting, where a failed workflow for population stratification can have far more serious consequences than neglecting to dispatch advertisements to a cohort of browsers.

Lambda functions are relatively new, and there are undoubtedly kinks still to be worked out, but new features are being added almost every day, and the direction that the wind is blowing is very clear.

Having been an alpha tester for Google’s Big Science engineering team and a beta tester of what ultimately became the Google Cloud Platform, I think it’s fair to say that the engineering and security teams at GCP, AWS, and Microsoft Azure are really good at industrialized data and compute agriculture. No one outside of big tech can compete.

My hypothesis is that for us to be really productive at actionable data science and machine learning, we need to commoditize infrastructure as much as possible. This will free us to focus on the high-value work where our methodological and clinical expertise will shine.


*A pseudonym, but a real patient.

See more blog posts from Dr. Huang.