18 June 2017

The unappreciated risks of deploying IT within Complex Systems

Put another way: The real-world patient safety implications of undertaking large-scale change (deploying large IT projects) within a tightly coupled complex adaptive system (the National Health Service).

“Complex systems are not responsive to warnings of unimaginable or highly unlikely accidents”
— Charles Perrow 1981¹

I should prefix this with a brief warning: I am not — and do not pretend to be — an expert on sociology, organisational theory or system safety. However I do have a fairly good understanding of the above and plenty of personal professional experience within the specific ‘change field’ of IT deployment, I also have domain experience within the NHS.

The Problems
Theory (a skeleton overview)
Worked Example
Conclusion

The Problems

Normal Accidents and Complex Systems

There is a naïve view within the NHS (and other healthcare organisations around the world) that the introduction of IT systems will do among other things one or many of the following:

Improve productivity & “patient flow”.²³ “For every hour physicians provide direct clinical face time to patients, nearly 2 additional hours is spent on EHR”
Improve patient outcomes.⁴ “Mortality rate … increased from 2.80% … to 6.57% (36 of 548) after … implementation”
Decrease/Eliminate common prescribing errors.⁴⁵⁶ “…that many of these errors are not caught by CPOE systems in use today”
Never have to be used in a major incident/ eliminate business continuity concerns.⁷ “…problems with health IT can disrupt care delivery and harm patients”
Be loved by users, especially after they have had training and so “actually know how to use the system properly and better”.

Unfortunately reality has a different set of options. There are a number of complex and deeply nuanced reasons for this… but that would be a whole different series of posts. The referenced papers against each of the above ‘outcomes’ show that the majority do not achieve much from this list and often are left with the exact opposite. One of the often ignored causes of this harm is the well recognised pattern of system accidents that are inevitable in complex systems, frustratingly much of this harm can be mitigated in healthcare⁸ by learning from the wealth of published theory.

Lack of Evidence, Scrutiny, Monitoring and Effective Change Management

In modern medicine introduction of a new drug, treatment or procedure is highly scrutinised and evidence based. There are trials, audits and research studies that evaluate the effectiveness and safety of these changes. However when is comes to introduction of IT systems to the clinical environment there is no such scrutiny, there is no requirement for evidence, there is no on-going supervisions and reporting of adverse incidents.

IT deployments are a change management process - at risk of veering off-topic - there is a vast amount of theory on how best to manage and implement change and yet within the NHS there is either wilful ignorance of this amongst IT staff/project managers or (perhaps more worryingly) a complete lack of understanding. The emphasis is always on the project or the software never on the “product” (providing safe, efficient healthcare) or the end users who deliver it (health care professionals) which is at odds with change management theories (such as Kotter, Lewin or McKinsey) that all focus on the users (health care professionals) and how the change will improve the product (providing safe, efficient healthcare).

Work as Imagined vs. Work as Done

Mark Graban
@MarkGraban
When the MA takes your temp and weight & writes it on a post it note until she gets to the EMR. Ah the digital revolution. 🤓
5:34 PM · April 17, 2017

This is a classic example of IT for ‘Work as Imagined’ vs. real world needs of IT for ‘Work as Actually Done’ that represents technology in the NHS in a nutshell. The focus is only ever on the problem as perceived by committee or management not the problem as it is in the real world. This is a well recognised problem in literature³ as a cause for failure of IT deployments to produce ‘expected’ benefits and gains.

Theory (a skeleton overview)

There are many theories encompassing the topic explored in this post, I will try to cover the basics and provide a starting point for further reading (at some point I may expand this section, but for now it shall be left as an exercise for the reader to research).

Human Factors & Swiss Cheese Model

The practice of modern healthcare within the NHS is a complex, tightly coupled and decidedly human process that carries risk.⁹ The most common group of theory in this area is Human Factors (the understanding of interactions between humans and other system elements). The Swiss Cheese Model¹⁰ describes the series of defensive measures inherent in a system to reduce risk and prevent accidents.

Normal Accident theory (NAT)

Described by Perrow in 1981 in his analysis of the events surrounding the accident at Three Mile Island.¹ He described how accidents are inevitable in extremely complex systems. That given the nature and tightly coupled complexity of such systems multiple failures which interact with each other will occur despite best efforts to avoid them.

“In normal accidents the particular trigger is relatively insignificant; the interaction is significant”
— Charles Perrow 1981¹

High Reliability theory (HRT)

Is the antithesis of Normal Accident theory. It attempts to describe the factors that contribute to reliability within an organisation.¹¹ The premise being that accidents are avoidable in complex organisations if enough planning and preparation is performed. Briefly it describes four organisational characteristics that limit failure:

Prioritisation of both safety and performance.
A culture of reliability that is decentralised and paradoxically centralised to allow decision making authority to be available where needed.
Learning organisation, learning and changing based following accidents, incidents and near-misses.
Redundancy working beyond technology extending to behaviour.

More recently there has been work on attempting to view NAT and HRT as complementary rather than competitive.¹² This has focused on different temporal perspectives and how reframing that analysis of NAT within an open system can address the non-falsifiability of the theory.

Safety-I and Safety-II

Other relevant reading material includes the excellent White Paper by Professor Erik Hollnagel “From Safety-I to Safety-II”¹³ which puts into context NAT and HRT in a very approachable and healthcare specific manner and introduces the concepts of Safety-I and Safety-II. Two brief quotes for definitions as follows:

“Most people think of safety as the absence of accidents and incidents (or as an acceptable level of risk). In this perspective, which we term Safety-I, safety is defined as a state where as few things as possible go wrong.”
— Hollnagel et al. 2015¹³

“Safety management should therefore move from ensuring that ‘as few things as possible go wrong’ to ensuring that ‘as many things as possible go right’. We call this perspective Safety-II; it relates to the system’s ability to succeed under varying conditions.”
— Hollnagel et al. 2015¹³

Complex Adaptive Systems

Wikipedia does a fine job defining and describing Complex Adaptive Systems as follows:

A complex adaptive system is a system in which a perfect understanding of the individual parts does not automatically convey a perfect understanding of the whole system’s behaviour.

They are complex in that they are dynamic networks of interactions, and their relationships are not aggregations of the individual static entities, i.e., the behaviour of the ensemble is not predicted by the behaviour of the components. They are adaptive in that the individual and collective behaviour mutate and self-organize corresponding to the change-initiating micro-event or collection of events.

Worked Example

A Patients discharge from hospital / Electronic test requesting and reports

This example will examine the unintended impact of the introduction of (another) system to view results and electronic test requesting. The detailed background is beyond scope the salient points to have sufficient context for this example are:

The new system for viewing lab test results introduced the new functionality of having insight into when a specific sample had been received by the laboratory and analysis was ‘in-progress’.
The new electronic test requesting system has a poorly designed ‘Directory’ of available tests not consistent with day-to-day practice or common sense.

Events unfold as follows. This scenario is fictitious but incorporates a number of real examples and situations that have resulted from the deployment of a new IT system.

Decision to discharge.
The decision to discharge the patient was taken and specific criteria were set to enable a safe discharge: serum potassium improving following changes in medication.
Organisation and planning discharge.
Hospital transport is booked and specific start times for packages of care to coincide with the arrival home of the patient are organised.
Planning blood test.
The Junior Doctor at the end of the working day (as it is not time efficient to access a computer, log-in and follow the laborious process of requesting tests) requests the necessary blood test to monitor the serum potassium.
As they are unfamiliar with the online system and/or are tired they forget that unlike ‘traditional’ paper requests – where they would simply write “U/E” – the online directory lists this specific test as “Electrolyte Profile” so when they by habit select the first presented test after typing “U” the wrong test is requested – “Urea” a single electrolyte, and not the potassium needed to support the discharge.
Blood sample taken.
The blood sample is collected from the patient in the early morning so results will be available by the planned discharge time.
Results reviewed.
The team note that that laboratory have only tested serum urea rather than the full electrolyte profile. A manual “add-on” request is sent to lab.
Results reviewed, again.
There are still no results for the remaining electrolytes. Remembering that the new results system displays status information the Junior Doctor opts to use this system to check on the status of the “add-on” test. As the ‘traditional’ method of ringing the laboratory and navigating a time consuming and patronising set of interactive voice prompts would be unproductive. The results viewing system to confirm that additional tests are “in-progress”.
Patient missed arranged hospital transport slot.
Results reviewed, for the third time.
There are still no results for remaining electrolytes. The Junior Doctor decides to spend the time being on hold and navigating the IVR to contact the laboratory for an update. Perhaps there has been an equipment malfunction or emergency? After waiting to get through the laboratory staff report that the results should be available as they have already been released.
Attempt to re-publish the results to the report system.
This fails. The results are verbally released.
Results reviewed.
The potassium has improved following the medication changes and the patient can be safely discharged.
Patient discharged.
Taken home by next available hospital transport.
Due to later than expected discharge, package of care unable to meet patient on arrival so returned to hospital.

In the above example events conspired to result is a delayed discharge - patient harm, equally the circumstances could have been different: this could have simply been a set of routine blood tests that picked up a dangerously high potassium similar events could have conspired and ultimately due to lack of a timely result the patient could have died. You are probably thinking that there are other fail-safes (slices of the Swiss Cheese) to prevent this - and you would be right - but it is disturbingly becoming more routinely (?due to workload or other reasons) for abnormal results to be highlighted by the lab a considerable time (+4h) after they are “published” to reporting systems.

Let us now explore the above example and apply the theory to learn more about how this theoretical scenario unfolded with the aim of understand where different approaches to deployment could have prevented or reduced the risk of harm:

The chain of events leading to harm begins in 3. the root cause (though it would be unfair to blame one single event - see Perrow’s NAT¹) is that it is difficult and time consuming to request blood tests on the new requesting system. This in turn led to the Junior Doctor making the decision (logical but only unwise if they had specific human factors training to be more aware of their limitations) to request all tests at the end of the day - when they are most tired, distracted and prone to make errors. This ultimately leads to the “wrong” test being requested.
The second significant inflection point is in 6. if the virtues of the new reporting system had not been explained to the Junior Doctor they would have rung the lab at this point and the error with the lab systems would have been identified earlier. The other contributing factor is the time consuming and patronising set of interactive voice prompts - this system is in place to serve a purpose but again much like the IT deployment we are analysing it had unintended consequences.
The final error/accident if the malfunction of the laboratory systems to publish released results. If this did not happen the scenario would have been cut short.

The above are the three separate occasions when the pieces of the Swiss Cheese¹⁰ could have been prevented the ultimate outcome. It is also clear that there was no single event that caused this scenario to evolve, instead it was a series of minor errors the combined to form this one event an outcome - exactly as described by Perrow’s NAT.¹ Attempting to apply HRT it is clear that NHS organisations aspire to the four main characteristics to limit failure, however unfortunately for a number of complex and deeply nuanced reasons (a whole other series of posts) in reality it is rare that any of the four are applied in practice. Similarly there is almost no application of Safety-II emphasis is almost exclusively within the Safety-I frame of reference.

For a specific lesson from this scenario: The introduction and design of the new test requesting and reporting system did not make a meaningful attempt to understand the system beyond ‘Work as Imagined’ to encompass ‘Work as Actually Done’. If it had there may have been a realisation of the need to effortlessly and quickly request tests as a real requirement. Further analysis of the whole system may have highlighted a consequence of ignoring this requirement being exposure to increased risk of errors due to human factors as describe above.

Unfortunately reality it is far more complex than this example hints at - the test requesting and reporting system has many users and stakeholders each with different requirements which have influenced is design and deployment - a whole post would barely scratch the surface in exploring the significant interactions and complexities of this. This example is more of a taster to show a whole new set of underappreciated risks.

Conclusion

I would suggest that fundamentally you should not be deploying IT projects - novel or not - in healthcare (especially not at scale) if you:

Don’t even have insight into the inherent system risks or any of the theory touched on in the post.
Don’t actually understand how a system works (not just how it is designed to work - see IT for work as imagined vs. real world needs of IT for work as actually done).
Don’t have effective (and tested) systems in-pace to monitor for and measure adverse consequences.^b
Are knowingly not addressing basic system requirements in order to meet deployment deadlines or targets. Safety trumps all; as outlined in this post basic requirements can have a bigger impact that may be first apparent.

A cautionary lesson (and aside) if you believe that a well designed, validated and redundant systems in a well understood system are infallible when you remove the human component, picture this: An aircraft with multiple redundant and validated systems enters an unexpected failure mode where inbuilt flight envelop protection systems cause the plane to violently pitch-down and hurtle towards the ground completely unresponsive to pilots attempts to save the aircraft. Well this happened to Quantas Flight 72 there have been some lessons leaned but no root-cause identified.^a

My recommendations would therefore be:

At a minimum read: Normal Accidents by Charles Perrow and the references on this post.
Establish out-of-band organisational/system monitoring^b to analyse impact of deployment and pick-up warning signs of failure — a slightly different recommendation than that made by Meeks et al.¹⁴
Teams leading deployment and training should be experienced in:
1. Human factors
2. The (actual not theoretical) workflow of the system within which they are deploying.
TALK TO YOUR ACTUAL END USERS (not just their managers and service leads) to understand how a system works and what impact IT will have end-to-end. But for this to be meaningful keep in mind:

John Watson
@dr_lungs
Replying to @dr_lungs @DrUmeshPrabhu and @doctorcaldwell
Trouble is nobody asks the FY1s and if they do they're too nervous/polite to authority to say what they really think
9:18 PM · May 21, 2017

As ever I welcome any feedback, comments, criticism or suggestions on what I have outlined.

Notes:

[a]: Technical side note/rabbit hole for the curious: Airbus flight control law failure modes don’t account for this failure - you can only enter Direct Law (no envelop protection) or Alternate Law 2 (no high AoA or high speed protection) if two ADIRU’s fail not if there isn’t parity across all three ADIRU (which also isn’t/wasn’t a cause to fail out an ADIRU).

[b]: And no existing “Incident Reporting” systems such as DATIX are not the type of monitoring I am talking about - in the current climate (and for that matter culture of many organisations) in the NHS these systems record a metaphorical scraping of the tip of the iceberg. Dr Gordon Cadwell has tweeted about some of the resons behind this. The type of monitoring that should be aspired to is something similar to the Neuraxial Connector Surveillance programme that is being run in Wales. There needs to be minimally (or ideally completely unobtrusive) baseline monitoring before deployment, during and after in order to pick up issues - again there could be a whole series of posts on the topic of effective monitoring.

“The worst pain a man can suffer: to have insight into much and power over nothing.” ― Herodotus

Published: 18 June 2017
Last Update: 03 April 2023

Update History 3473

03 April 2023 New Paint
17 November 2021 Rename files for better organisation
12 August 2018 Blockcahin Cool-Aid
22 November 2017 fix urls
18 June 2017 references
18 June 2017 minor edits
18 June 2017 The unappreciated risks of IT deployment within Complex Systems