The concept of “GIGO”–Garbage In, Garbage Out–has been around almost as long as computer programming itself.
GIGO is the idea that, no matter how well written and definitive a computer program or algorithm is, if you feed it bad data the resulting output will be “bad”–i.e., have no useful meaning or, at worst, misleading meaning.
Nothing surprising here–as programmers we are well aware of this problem and often take great pains to protect an algorithm implementation against “Garbage In”.
It’s not possible to protect against all such cases, of course, human nature being what it is.
Which brings us to the story behind this blog posting: the improper use of Generative AI to “make decisions” in ways that are impactful in the most damaging ways.
The starting point for this story: the state of Iowa in the United States is one of several states that have recently passed laws aimed at protecting young students from exposure to “inappropriate” materials in the school setting.
The Gazette (a daily newspaper in Cedar Rapids, Iowa) has the story of a school district in its area that has chosen to use AI (Machine Learning) to determine which books may run afoul of this new law.
Their reasoning for using AI? “Assistant Superintendent of Curriculum and Instruction Bridgette Exman told The Gazette that it was “simply not feasible to read every book…”
Sounds reasonable, right?
Well, the school district chose to generate the list of proscribed books by “feeding it a list of proscribed books [provided from other sources]” and seeing if the resulting output list presented “any surprises” to a staff librarian.
See the problem here? As noted in a blog about the news story:
It appears that people who don’t understand how to use Machine Learning misused it–GIGO?–and now have a trained AI that they think will allow them to filter out inappropriate books without having a human read and judge them.
This misuse of AI/ML is not uncommon–we’ve seen cases where law enforcement has trained facial recognition programs in a way which creates serious racial bias, for instance.
We, as IT professionals, need to aware of and on the lookout for such misuses, as we are in the best position to spot such situations and understand how to avoid them.
The simple definition of entropy–the reality is much more complex–is the state of order (or disorder) of a system. It can also be described as the state of information embedded in a system: lower entropy means more information.
An example often use to explain entropy is that of system that starts as an ice cube. An ice cube is a highly ordered state of water–the individual water molecules are arrayed into a regular, repeating pattern of crystals. This pattern can be easily described. Each molecules is locked in placed with no freedom to move. It is highly ordered.
Apply heat to the ice cube. It melts. The individual molecules are now free to move in the resulting liquid–water–and so the overall pattern can no longer be easily described. It is more disordered.
The water, in going from solid to liquid, has increased its entropy. The application of heat has made this change possible
Of course, we can move in the opposite direction: we can remove heat from the liquid water to return it to its highly-ordered, low entropy state.
The universe, as a whole, moves from a state of low entropy to high entropy–stars are running down, galaxies collapsing.
(For an interesting sci-fi take on this concept, read “The Last Question” by Isaac Asimov.)
Only in small, local environments–the freezer compartment of your refrigerator, for instance–can the general trend towards increased entropy be reversed.
To summarize one version of entropy: a low entropy system contains more information than a high entropy system.
What does this have to do with Information Technologists like ourselves?
We are agents of entropy change.
Think about a content delivery system that we might be developing. Certainly there are many ways to describe the purpose of the system–to deliver data to the end-user, to allow new concepts to be generated, and the like.
All of those purposes can be summed up in one simple description:
We have created a system that permits a local decrease in entropy by adding and collecting information. We can create these systems to be used by anyone, anywhere in the world, to increase knowledge and thereby decrease entropy.
We can work against the general trend of the universe. This is an amazing power to hold. We can use it–or allow it to be used–for good or evil purposes.
Artificial intelligence has exploded upon the world in the form of the generative AI chatbot known as ChatGPT.
Only five days after its launch to the general public it had garnered one million users, far outpacing the update–at least by that metric–of any other program or social media system introduced since the dawn of the Internet.
And that amazing pace of uptake has not slowed. By the end of the second month after its introduction, it had shot up to 100 million users.
Others have spent much blog space on examining the why and how of the generative AI revolution that seems to be taking place. And much of that narrative extols the transcendent possibilities of the future of humanity in partnership with this new form of machine intelligence.
I want to take a somewhat different view here–one that is more admonitory and intentional in nature.
As is the case for any new technology we, as IT professionals, will be one of the cheerleading groups for generative AI use more widely in society–though it appears that little help is needed there.
It is also incumbent on us to serve the role of technologyguardian on behalf of the society we inhabit. Most users of this new technology will not have the in-depth knowledge we have about the shortcomings of this new technology, and so cannot make fully informed judgements about its safe and proper use.
Some technology experts have warned of apocalyptic and even existential crises attendant upon the widespread use of ChatGPT and similar technologies. This is well and good–we need adverse voices to make us aware of potential problems to society.
I want to point out another pitfall that appears to await us as we rush to the use of generative AI: the fact that, in one way, generative AI seems to mimic humans all to well.
We, and they, are able to lie with sincerity and authenticity.
If we treated AI with the same sense of skepticism with which we treat other humans–who we are aware harbor the same darker impulses that we are capable of–this would not be a major issue.
But, interacting with AI, we seem to be more willing to suspend this skeptical viewpoint. This seems natural as we do not have the same belief in machines failing, and there are few non-verbal clues we can rely on to determine veracity.
This is made worse by the fact that ChatGPT’s goal is to mimic human behavior and language, and can do so with astonishing ease and rapidity.
So, we are led to consider a new “threat” from ChatGPT: that it can appear to provide definitive and truthful answers that can be taken at face value. And in some cases, those deceptive answers can do great harm.
Does this mean that we need to call an immediate halt to the widespread use of ChatGPT as some groups have already done? For instance, Italy has already banned the use of ChatGPT. Legislation has been introduced in the US Congress to regulate its use (interestingly, the legislation itself was written by ChatGPT).
I think banning or severely restricting may be a step too far. Pausing may be a better step to take as we grapple with the downsides of this new technology.
Even that, however, may be seen as too much.
I would like to suggest another alternative: that we use our unique position as IT leaders and thinkers to cultivate in our clients, our friends, and ourselves a healthy sense of skepticism about the trustworthiness of this new tool.
Much like most of us already do with social media, we need to critically examine the claims that generative AI makes when interact with it. ChatGPT are the like are only as good as the people who train it and the material chosen for that training.
ChatGPI is not an infallible Oracle of Delphi. It’s a tool, trained by humans to interact with humans in a “human” manner.
As my wife and I were watching the coverage of the end of the first NASA Orion program Artemis capsule return, I mentioned to her that at the end of the 1970’s Apollo program I never imagined it would be half a century before we returned to the Moon.
After a pause she asked a question: “Did this program use any of the hardware of the original Apollo program?”
I was a bit taken aback–I often forget that people who are not space enthusiasts like me wouldn’t know such things–but told her that this was all new hardware, and that the original Apollo hardware and their designs were long gone.
Which reminded me of a trope that is common when it comes to the Apollo program–the myth of “Lost Technology”.
What is “Lost Technology”?
The definition often used by “The Lost Technology of XXX” TV programs is any process or product produced in the past that we no longer understand and do not have the original process to reproduce.
Now, in strict terms this may be true in a few cases. We do not know how Damascus Steel was produced in the Near East beginning in the 3rd century CE, a process that was no longer in use by the early 19th century CE. Does this mean this technology is lost?
Modern artisans have produced an equivalent to Damascus Steel, so while we do not know how the ancients produced it, we can make its replacement today using modern processes and materials.
Does this mean the production Damascus Steel is a “Lost Technology”? Yes, in the sense that we do not know how it was originally produced. No, in the sense that we can produce its equivalent today, but using different techniques.
I would argue that this definition of “Lost Technology” has little useful meaning. While there is certainly value in knowing how ancient civilizations accomplished a specific task or produced a specific product, the fact that we can use modern techniques to accomplish the same outcome says we never lost the ability to create the end-product.
What we did lose was the institutional knowledge that the technology used in its original form.
(I would add to this definition that some knowledge was never written down to keep it secret–this seems to have been the case for Damascus Steel.)
This is what happened with the Apollo program processes and designs. While we still have many of the original designs in blueprint or document form, the institutional knowledge is almost completely gone–those who had it are no longer with us, and the few that are still around probably can no longer remember.
So, is the Apollo project technology “lost”? In a very narrow sense, yes. We can no longer produce a Saturn V rocket in the same form that is existed 50 years ago–we don’t have the skilled craftsmen who could, for instance, do the hand-drilling of the rocket engine injector baffle plates or hand-weld the propellant piping seams.
But this is where the definition of “Lost Technology” becomes meaningless.
Why, with the knowledge and processes advanced by 50 years, would we want to try to produce the same rocket engines in the same way it was done then? We can do far better with what we have learned since then, with the systems we now have.
Except for those–I am one–who would love to see that lovely old beast back in operation for one more flight, the fact that we can no longer produce it exactly as it was means little. Today, we can actually do better.
IT processes and products hardly seem old enough to fall prey to this “Lost Technology” syndrome, but computer technology changes much faster than the technologies of old.
And yet, we do see some of the effects of technology obsolescence that are close to producing “lost technologies”.
Quite a few institutions still rely on decades-old programs written in Cobol, a language no longer actively taught and for which few tools still exist.
The Defense Department’s Strategic Automated Command and Control System (DDSACCS), which is used to send and receive emergency action messages to US nuclear forces, runs on a 1970s IBM computing platform. It still uses 8in floppy disks to store data. “Replacement parts for the system are difficult to find because they are now obsolete.”
Whatever you may have, it’s no doubt more current than the system that air traffic controllers use to tell pilots about weather conditions at Paris’s Orly Airport: Windows 3.1. That’s not a typo – these flight-critical systems use an operating system that came out in 1992. When the machines went down in November 2015, planes were grounded while the airport had to find an IT guy who could deal with computers that ancient.
Sparkler Filters of Conroe, Texas, prides itself on being a leader in the world of chemical process filtration. If you buy an automatic nutsche filter from them, though, they’ll enter your transaction on a “computer” that dates from 1948. Sparkler’s IBM 402 is not a traditional computer, but an automated electromechanical tabulator that can be programmed (or more accurately, wired) to print out certain results based on values encoded into stacks of 80-column Hollerith-type punched cards.
All of these, of course, represent situations in which the product or system could be updated using more modern techniques, so they are not truly “lost”, except insofar as the original technologies are no longer in common use, and the users would be hard-pressed to make substantial changes or updates.
And therein, to me, lies the beauty of computer technology, its history, and its likely future.
We IT practitioners work in a world where nothing truly disappears or is lost. We keep old systems alive where appropriate, and we use the latest techniques to build new systems better than the old.
The myth of “Lost Technology” is just that–a myth.
Although I am glad that “lost” technologies are kept around in some form for us to see how far we’ve come, and to appreciate the amazing accomplishments of those who came before us.
First was Ethernet over coax–initially thicknet (10Base5) and then thinnet (10Base2). Both used a bus topology and both were limited in the distance over which they could be deployed–1,500 feet for thicknet and 600 feet for thinnet. Because of those limitations, these versions of Ethernet did not make deep inroads into the market.
In the late 1980s, the invention of a version of Ethernet carried over twisted-pair cabling, and which used a star topology, kicked off a land-rush to connect computers and devices together. Combined with the invention of network bridges, routers, and other devices allowing connection of local Ethernet networks to the burgeoning Internet, wired networking became the dominant model.
While not ideal for some applications, this wired model served the market well until the invention of WiFi in the late 1990s. (It’s interesting to note that radio-based networking predated even Ethernet. The Aloha radio network was launched in 1971 and actually provided the template for Ethernet protocol.)
WiFi met an emerging market need, driven by the desire to interconnect networkable devices in places where wiring was not possible or not cost-effective. Short distances–up to typically several hundreds of feet–could be bridged, allowing devices to be movable or in hard-to-reach places.
Aside from improvements in WiFi speeds and encryption mechanisms, little changed over the following years in terms of the distances over which WiFi could be used. Some systems were built using high-gain antennas and specialized receivers and transmitters that extended the range up to several miles, but these were costly and required modified protocol stacks to deal with error conditions unique to radio links.
The rise of IoT drove the need for a new, low-cost wireless network that could provide connectivity over the distances some IoT sensors required. Soil humidity sensors on farms; engine sensors on mobile machinery; monitoring systems on drones. Now the distances needing to be covered could be several miles. Combined with the need to keep power consumption low, existing WiFi systems were not up to the challenge.
For a while, cellular data systems filled the need, but those required high power budgets and were typically expensive.
LoRa (Low power, long range radio) is a protocol–and accompanying hardware–that provides networking capability that is exactly what is needed for the new IoT world.
LoRa provides a mechanism which allows the user to determine the desired tradeoff between power, distance, and data rate. Of course, these are not independent of one another, but within limits they can be reasonably determined. And LoRa is not without its own limitations–packet size is small, though for most IoT uses it suffices.
As an example, a LoRa system can be set up to provide a data rate in the range of hundreds to thousands of bits per second over distances of several kilometers with ease. Distances of 700 kilometers and more have been achieved in experimental systems; small satellites (cubesats) using LoRa easily communicate with simple ground stations on a daily basis. While the data rates may seem low, they are adequate for most remotely-positioned IoT devices.
And when not transmitting data the LoRa hardware can be shut down (as IoT sensors tend to be episodic in their data delivery) lowering power requirements to the range in which small batteries can power devices for months at a time.
LoRa is, at its most basic, a point-to-point network but the introduction of LoRaWAN standards, and the use of a gateway device, makes it possible to have widely distributed devices that can interconnected in much the same manner as provided by WiFi.
And LoRaWAN has taken off amazingly in the last few years. Estimates are that over 170 million IoT devices are connected using LoRaWAN in 100 countries. In 2016 the Netherlands became the first county to have a nation-wide LoRaWAN network. Other countries have quickly followed this trend.
In fact, a coalition of enthusiasts has used this intersection of open-source software and low-cost hardware to set up open LoRaWAN networks worldwide. The Things Network boasts more than 20,000 LoRaWAN gateways in operation in 151 countries, all available for any member of the public to use.
To see an example of what can be done with LoRa, check out Tiny GS, a group of enthusiasts that set up low-cost satellite ground stations and receive telemetry from cubesats. Info on what my ground station has received can be found by logging into the Tiny GS website, selecting “Stations” from the hamburger menu, and searching for “Fall”.
Learning about LoRa and LoRaWAN by implementing one yourself is a great introduction to this networking concept, and will prepare you for interactions and projects with our IoT-using clients.
In our current peri-COVID world, we all now have far more experience than we could ever have imagined in remote working.
Our homes are now our offices; dress codes have become more relaxed; we can work somewhat more flexible hours to accommodate our personal lives.
This has all come at a cost, of course. The biggest, in my opinion, is the need for higher bandwidth and more reliable Internet connections to our homes. In many cases, Internet Service Providers (ISPs) have been hard-pressed to provide new pipes, and “last mile” service installations have lagged.
The Internet core network has similarly been stressed–in analysis done comparing pre- and peri-COVID data in several cities around the world, backbone data usage has gone up by as much as 40% year-to-year.
Much of this “need for speed” has been driven by widespread use of teleconferencing software. Zoom, Microsoft Teams, Skype, Chime and others are in constant use around the world. Even with clever bandwidth-saving measures, the massively increased use of teleconferencing has created what will probably remain with us post-COVID.
One of the contributors to the need for higher bandwidth in teleconferencing is the requirement to transmit timely and clear representations of speech in a digital format. Generally, audio is highly resistant to most compression technologies–it’s too full of unpredictable data patterns and, with noise added in, becomes even more of a problem.
A number of coder/decoder algorithms have been invented for the problem of transforming speech, in particular, to a digital form. Some are very clever, making use of models of speech generation to build compression models that are reasonably efficient of time and bandwidth. The models are made much more complex by the need to model a wide range of languages–many of which have substantial differences in their phonemes. Add in accents, speaking rate, and other variables and the models become extremely complex.
With the long history of language coder/decoder research, it would be easy to believe that there would be nothing new under the sun.
And that would be wrong.
Google has announced a new speech coding algorithm that appears to use much less bandwidth than existing algorithms, while preserving speech clarity and “normalness” better.
The new algorithm, named “Lyra”, is based on research done on new models for speech coding, generative models.
One of the major issues with using these generative models is their computational complexity. Google has offered a solution to that problem and the solution appears to offer better performance, at lower bandwidth, and with better apparent normalness to the sound quality.
The Google webpage announcing this news has examples of their algorithm in action compared to existing, widely used algorithms. The results are quite impressive.
What impacts will this have on teleconferencing? Google predicts that it will make teleconference possible over lower bandwidth connections, and provide an algorithm that can be incorporated into existing and new applications.
Google plans to continue work in this area, most importantly to provide implementations that can be accelerated through GPUs and TPUs.
Be sure to listen for more exciting developments in speech coding, no matter what algorithm you use….
Most recently OpenAI, a machine learning research organization, announced the availability of CLIP, a general-purpose vision system based on neural networks. CLIP outperforms many existing vision systems on many of the most difficult test datasets.
It’s been known for several years from work by brain researchers that there exist “multimodal neurons” in the human brain, capable of responding not just to a single stimulus (e.g., vision) but to a variety of sensory inputs (e.g., vision and sound) in an integrated manner. These multimodal neurons permit the human brain to categorize objects in the real world.
The first example found of these multimodal neurons was the “Halle Berry neuron“, found by a team of researchers in 2005 and which responds to pictures of the actress–including those that are somewhat distorted, such as caricatures–and even to typed letter sequences of her name.
Many more such neurons have been found since this seminal discovery.
The existence of multimodal neurons in artificial neural networks has been suspected for a while. Now, within the CLIP system, the existence of multimodal neurons has been demonstrated.
This evidence for the same structures in both the human brain and neural networks provides a powerful tool for better understanding how to understand the functioning of both, and how to better develop and train AI systems using neural networks.
The degree of abstraction found in the CLIP networks, while a powerful investigative tool, also exposes one of its weaknesses.
As a result of the multimodal sensory input nature of CLIP, it’s possible to fool the system by providing contradictory inputs.
For instance, providing the system a picture of a standard poodle results in correct identification of the object in a substantial percentage of cases. However, there appears to exist in CLIP a “finance neuron” that responds to pictures of piggy banks and “$” text characters. Forcing this neuron to fire by place “$” characters over the image of the poodle causes CLIP to identify the dog as a piggy bank with an even higher percentage of confidence.
This discovery leads to the understanding that a new attack vector exists in CLIP, and presumably other similar neural networks. It’s been called the “typographic attack”.
This appears to be more than an academic observation–the attack is simple enough to be done without special tools, and thus may appear easily “in the wild”.
As an example of this, the CLIP researchers showed the network a picture of an apple. CLIP easily identified the apple correctly, even going so far as to identify the type of the apple–a Granny Smith–with high probability.
Adding a handwritten note to the apple with the word “iPod” on it caused CLIP to identify the item as an iPod with an even higher probability.
The more serious issues here are easy to see: with the increased use of vision systems in the public sphere it would be very easy to fool such a system into making a biased categorization.
There’s certainly humor in being able to fool an AI vision system so easily, but the real lesson here is two-fold.
The identification of multimodal neurons in AI systems can be a powerful tool to understanding and improving their behavior.
With this power comes the need to understand and prevent the misuse of this power in ways that can seriously undermine the system’s accuracy.
With great power comes great responsibility, as Spiderman has said.
As IT professionals, we are all painfully aware of the need for high-quality security in the systems we work with and deliver.
We know that if a system containing sensitive user information, such as bank account numbers, is not properly protected we risk exposure of that data to hackers and the resultant financial losses.
Encryption of data in flight and at rest; database input sanitizing; array bounds checking; firewalls; intrusion detection systems. All these, and more, are familiar security standards that we daily apply to the systems we design, implement, and deploy. eCommerce websites; B2B communications networks; public service APIs. These are the systems to which we apply these best practices.
If we do not take due care, we risk the public’s confidence in the banking system, the services sector, and even the Internet itself.
Even the widespread issues that could result from breaches of these systems pales in comparison, I believe, to systems that are more pervasive and more directly impactful of our everyday lives.
Much of our modern world is dependent on the workings of its vast infrastructure. Roadways, power plants, airports, shipping ports–all of these are fundamental to our existence. Infrastructure security is such an important issue that the United States government has a agency dedicated to this issue: the Cybersecurity & Infrastructure Security Agency–CISC.
Here in the US we just had a reminder of how important this topic is.
Just yesterday there was an intrusion into a water treatment plant in Oldsmar, Florida in which the attacker attempted to raise the amount of sodium hydroxide by a factor 0f 100, raising it from pipe-protecting levels to an amount that is potentially harmful to humans.
The good news is that the change was noticed by an attentive administrator, who then reserved the change before it could take effect. The system in question has been taken offline until the intrusion is investigated and proper steps taken.
It’s unclear at this point whether the attacker was a bored teenager or a nation-state, or something in-between, but the effect would have been the same: danger to 15,000 people and a resulting lack of trust in the water delivery system.
As of the writing of this blog post there is little detail about how the hack was accomplished, though it appears that the hacker gained the use of credentials permitting remote access to the water treatment management system. From there, it was only a matter of the hacker poking around to find something of interest to “adjust”.
The Florida Governor has called this incident a “national security threat”, and in this case I don’t believe he is indulging in hyperbole.
CISC considers the US water supply one of the most critical infrastructure elements, and devotes an entire team of specialists to this topic.
What should we take as a lesson from this?
I believe this incident is a cogent example of how brittle our national infrastructure is to bad actors. Further, I believe that this incident makes abundantly clear that we need a renewed focus on updating, securing, and minimizing the attack surface of existing infrastructure control systems.
As IT professionals it is our responsibility to lend our expertise and unique viewpoint to inform our leaders in government and industry of the issues, their importance, and their potential solutions. To do so actively, and to do so regularly.
Over the last few years I’ve seen a number of articles on how, as IT professionals, we can work to build users’ trust in the systems we produce. Clearly this is important, as a system that is not trusted by its targeted users will not be used, or will be used in efficiently.
This seems an obvious topic of interest to IT professionals.
For instance, if customers of a bank do not trust that the mobile app allowing them to interact with their funds cannot be trusted to accurately complete requested actions, it won’t be used.
But there’s a flip side to this trust coin that is not often talked about or studied: how do we design systems that we can be sure will not be trusted by all-too-trusting humans when it is inappropriate or unsafe to do so.
We actually experience this in our everyday lives, often without thinking about what it really means.
One example: compared to Google Maps on my phone, I have lower trust in my car’s navigation system to get me to the destination by the quickest route. As an IT professional, I know that Google Maps has access to real-time traffic information that the built-in system does not, and so I will rely on it more if getting to my destination in a timely manner is important.
My wife, who is not in the IT business, has almost complete trust in the vehicle navigation system to get her where she wants to go without making serious mistakes.
In a case like this, it’s not really of monumental important which one of us can be accused of misplaced trust in a system. But there are cases where it’s very important.
For instance, current autonomous vehicles available to the general public are SAE level 3, which means they must be monitored by a human who is ready to intervene should it be necessary. If a Tesla computer cannot find the lane markings, it notifies the driver and hands over control.
But how many reports have we seen of Tesla drivers who treat the system as though it can take care of all situations, thereby making it safe for them to engage fully in other activities from which they cannot easily be interrupted?
One could say “there will always be stupid people” but this just sweeps the important problem under the rug: how do we design systems which install an appropriate level of trust in the user? Clearly the Tesla system in these cases, or the context of the system’s use, instilled too much trust on the part of the user.
Unsurprisingly the study found that a user’s opinion of the technology is the biggest determining factor in the user’s trust in the product. Surprisingly, the study also found that users who had either a positive or negative opinion of the technology tended to have higher levels of trust.
This makes something clear: if we are to design systems that are to be trusted appropriately, we must understand that the relationship between the user’s knowledge, mood, and opinion of the system is more complex than we might imagine. We need to take into account more than just a level of trust we can install through the system’s interaction with the human, but other confounding factors: age, gender, education. How to elicit and use this information in a manner that is not intrusive and doesn’t itself generate distrust is not currently clear–more study is needed.
As IT professionals, we must be aware that instilling a proper level of trust in the systems we build is important and focus on how to achieve that.
I have a fondness for watching documentaries about aviation disasters.
Now, before you judge me as someone with a psychological disorder–we all slow down when we see an accident on the highway, but planes crashing into each other or the ground?–let me explain why I watch these depressing films and what it has to do with IT work.
I should start by noting that, as a private pilot, I have a direct interest in why aviation accidents happen. Learning from others’ mistakes is an important part of staying safe up there.
But, then, there’s another reason I watch the documentaries that’s only recently become clear to me: seeing how mistakes are made in a domain where mistakes can kill can can be generalized to understand how some mistakes can be avoided in other domain where, while the results might be less catastrophic to human life, are still of high concern.
In my case, and likely in anyone’s case who is reading this, that’s the domain of IT work.
The most important fact I take away from the aviation disaster stories is that disasters are rarely the result of a single mistake but result from a chain of mistakes, any one of which if caught would have prevented the negative outcome.
Let me give an example one such case and see how we, as IT professionals, might learn from it.
On the night of July 1, 2002, two aircraft collided over Überlingen, Germany, resulting in the death of 71 people onboard the two aircraft.
The accident investigation that followed determined that the following chain of events led to the disaster:
The Air Traffic Controller in charge of the safety of both planes was overloaded as the result of the temporary departure of another controller in the center.
An optical collision warning system was out of service for maintenance but the controller had not been informed of this.
A phone system used by controllers to coordinate with other ATC centers had been taken down for service during his shift.
A change to the TCAS (Traffic Collision Avoidance Systems) on both aircraft that would have helped–and which was derived from a similar accidents months earlier–had not yet been implemented.
The training manuals for both airplanes provided confusing information about whether TCAS or the ATC’s instructions should take priority if they conflicted.
Another change to TCAS, which would have informed the controller of the conflict between their instructions and TCAS instructions was not yet deployed.
Many issues led to the disaster (which thankfully, have been resolved as of today)–but the important thing to note is that if any one of these issues had not arisen, the accident would likely not have happened.
That being true, what can we learn from this?
I would argue that, in each case, the “system” of air traffic control, airplane systems design, and crew training taken as individual items, each could have recognized that each issue could lead to a disaster and should have been dealt with in a timely manner. This is true even though each issue by itself could have been (and probably was) dismissed as being of little important by itself.
In other words, having a mindset that any single issue should be addressed as soon as possible without detailed analysis of how it could contribute to a negative outcome might have made all the difference here.
And here is where I think we can apply some lessons from this accident, and many others, to our work on IT projects.
We should always assume that if, absent evidence to the contrary, a single issue during a project could result in negative implications that are not immediately obvious, it should be addressed and remediated as soon as practicable.
The difficult part of implementing this advice clearly results from questioning whether a single issue could affect the entire project, and the cost of immediate remediation vs. its cost. There is not an easy answer to this–I tend to believe that unless there is a strong argument showing why a single event cannot become part of a failure chain, then it becomes something that should be fixed now. Alternatively if the cost of immediate remediation is seen as less than the cost of failure, then the issue can be safely put aside–but not ignored–for the time being.
To put this into perspective in our line of work:
Let’s imagine a system to be delivered that provides web-based consumer access to a catalog of items.
Let’s further imagine that the following are true:
The catalog data is loaded into the system database using a CSV export of data from another system of ancient vintage.
Some of the data imported goes into text fields.
Those text fields are directly used by the services layer.
Some of those text fields determine specific execution paths through the service layer code.
That service code assumes the execution paths can be completely specified at design time.
The UI layer is designed assuming that delivery of catalog data for display will be “browser safe”–i.e., no characters that will not display as intended.
This is a simple example, and over-constrained, but I think you can see where this is going.
If the source system has data, to be placed in the target system text fields. has characters that are not properly handled by the services layer and/or the UI layer, bad actions are likely to result.
For instance, some older systems permit the use of text documents produced in MSWord that promote raw single- and double-quote characters to “curly versions” and take the resulting Unicode data in raw form. Downstream this might result in failure within the service layer or improper display in the UI layer.
Most of us, as experienced IT professionals, would likely never let this happen. We would sanitize the data at some point in the process, and/or provide protections in the service/UI layers to prevent such data from producing unacceptable outcomes.
But, for a moment, I want you to think of this as less than an argument for “defense in depth” programming. I want you to think of it as taking each step of the process outlined above as a separate item without knowing how each builds to the ultimate, undesirable outcome, and deciding to mitigate it on the basis of the simple possibility that it might cause a problem.
For example, if the engineer responsible for coding the CSV import process says “the likelihood of having problems with bad data can be ignored or taken care of in the services layer”, my suggested answer would be “you cannot be sure of that, and if we cannot be sure it won’t happen, you need to code against it”.
And, I would give the same answer to the services layer engineer who says “the CSV process will deal with any such issues”. You need to code against it.
It may sound like I’m simply suggesting that “defensive coding” is a good idea–and it is. But–and perhaps the example given is too easy–I would argue that the general idea I am suggesting is that you need to have a mindset that removes each and every item in a possible failure chain without knowing, for certain, that it could be a problem.
This suggestion is not without its drawbacks, and I would encourage you to provide your thoughts, pro or con, in the comments section of this blog.
In the meantime, I’ll be over here watching another disaster documentary….