Featured Product
This Week in Quality Digest Live
Six Sigma Features
Harish Jose
How to generate an OC curve based on sample size and number of rejects
Frances Brunelle
When quality is documented and addressed, the ability to get top asking price is improved
Donald J. Wheeler
Do you know what really happens in phase two?
Anthony D. Burns
It’s overhyped and virtually of no benefit in production. The essential production tool is the control chart.
Harish Jose
Users ultimately determine the purpose of any device

More Features

Six Sigma News
A guide for practitioners and managers
Making lean Six Sigma easier and adaptable to current workplaces
Gain visibility into real-time quality data to improve manufacturing process efficiency, quality, and profits
Makes it faster and easier to find and return tools to their proper places
Version 3.1 increases flexibility and ease of use with expanded data formatting features
Provides accurate visual representations of the plan-do-study-act cycle
SQCpack and GAGEpack offer a comprehensive approach to improving product quality and consistency
Customized visual dashboards by Visual Workplace help measure performance
Helps manufacturers by focusing on problems and problem resolution in real time

More News

Rohit Mathur

Six Sigma

Using Control Charts in Software Applications

Identifying common-cause and special-cause variation in processes is key to process improvement

Published: Monday, July 15, 2019 - 11:03

Whatever the process or type of data collected, all data display variation. This is also true in software development. Any measure or parameter of interest to our business will vary from time period to time period, e.g., number of incidents per week or month, time taken in resolving incidents, number of tickets encountered in a production support environment per month, and defect density in code.

Understanding variation is about being able to describe the behavior of processes or systems over time. This variation can be stable, predictable, and routine, or unstable, unpredictable, and exceptional. Being able to distinguish between stable or common-cause variation, and unstable or special-cause variation, helps us to decide the type of action needed to improve the process. The control chart, developed by Walter Shewhart, is the tool that enables us to do so.

Contrary to what is often taught and practiced per Shewhart’s teachings, the control chart is relevant to all types of processes (manufacturing and service) and all data, without the need for the data to satisfy certain statistical conditions (in particular, the condition of being “normally distributed”). Indeed, Shewhart was specific about the need to have statistical techniques that really work in practical situations and are suitable for our data, rather than the need for data that are suitable for our statistical techniques.

Understanding variation and the use of control charts therefore does not require understanding of probabilities, normal distribution, binomial distribution, or any other probability distribution. Neither does it require understanding of tests of significance, p-values, or confidence intervals, or the techniques for regression and correlation.

However, to be able to use control charts, we must first understand the two types of variation.

Common-cause variation is the routine, random variation inherent in any process based on the circumstances in which it exists and operates. This type of variation is always present; it is inherent in the process and is predictable. For example, variation that arises due to existing procedures, process design, or method of training.

A vast majority (94 percent. according to W. Edwards Deming) of issues or problems are due to common-cause variation. In such cases improvement is possible only through fundamental changes in the system, method, or procedure. It is a waste of time to investigate individual incidents of common-cause variation and take action because the variation is inherent in the system. You can’t chase individual points of common-cause variation; you can only address the overall reason for the variation and take action on that overall cause.

Special-cause variation is anything over and above routine variation. It is exceptional and hence unpredictable. Such variation is not always present and is caused by a specific event. For example, a new person joins a team and afterward a lot of software defects are noticed. Those defects are special cause. They are not due to the system, but to a specific event, the hiring of a new person.

Only about 6 percent of problems and issues are due to special-cause variation. In such cases we need to find out the reasons for variation to determine the special cause and take action to eliminate it. In the employee example above, we might determine that the employee was not properly trained, and so the action would be to train that employee and ensure that future employees are properly trained.

Two types of mistakes

There are two types of mistakes possible when investigating variation in a system.

Type 1: Mistaking a common cause for a special cause: Asking for the reason for variation when the variation is inherent in the process. For example, if an incident occurs due to a “disk space” problem (which is a common cause), we need to proactively monitor “space” and improve the method of working so as to reduce such problems. Investigating in isolation why that particular incident occurred would be treating it as a special cause.

Type 2: Mistaking a special cause for a common cause: Not taking note of a significant deterioration in performance simply because it still remains within specifications. For example, ignoring an upward trend in service level agreement (SLA) timings (e.g., program ends at a particular time) simply because these remain within specified limits.

The concept of common and special causes, and the two types of mistakes can be explained through a simple example given by Deming himself in his book, The New Economics (MIT Press, 2000 second edition) and elaborated by Brian L. Joiner in Fourth Generation Management (McGraw-Hill Education, 1994). This example is illustrated below.

Patrick and the school bus

The chart below was prepared by 11-year-old Patrick Nolan. It shows the time at which the school bus picked him up on successive school days (covering a period of a few weeks), in time order. On most days the pick-up time falls within a narrow band, but there are two days on which specific events delayed the pick-up time considerably: One day there was a new driver, on another a faulty door closer.


Figure 1: Time of arrival of school bus, by Patrick Nolan, 11 (credit: W. Edwards Deming, The New Economics)

Some of the factors that might affect the pick-up time on any day are the weather, the amount of traffic, how long the bus driver waited for the children at previous stops, and what time the driver got started. The variation within the narrow band is due to a combination of these factors—i.e., common causes of variation, present all the time in the process. To investigate the reasons for variation between any two timings in this narrow band will be a futile exercise—a Type 1 mistake, which Deming referred to as “tampering.” You can’t control the weather, traffic, or wait times, so chasing down that variation would be fruitless.

The new driver and the faulty door closer are special causes of variation. These are not always present in the process, appear sporadically, and come from outside the usual process. We need to investigate the reasons for such variation, and if we do not do so, we would be making a Type 2 mistake.

Control charts help distinguish the two types of variation

A control chart or process behavior chart is a statistical tool used to distinguish between process variation resulting from common causes, and variation resulting from special causes. It is a time-series graph with three horizontal lines—a central line (representing the average), and upper and lower control limits (or the upper and lower natural process limits as these are often, more appropriately, referred to). The “process limits” or “control limits” are computed from process data and indicate the extent of common-cause variation.

Individual variations in outputs can have any number of contributory causes. The process behavior chart guides us as to when it is economically worthwhile to look for a reason why a particular result occurred (i.e., look for a special cause), and when it is not worthwhile to do so.

Stable processes
If all outputs from the process are within the process limits and do not show any patterns or trends, then the process is said to be in statistical control or stable.

In such cases, all the outcomes or results are typical of those produced by the whole system of common causes. Common causes do not produce changes in behavior; they produce continuing similar behavior and a “predictable” process.

Stable processes display controlled variation—i.e., variation present due to inherent properties of the process. This variation is due to the way the process has been designed, built, and set up as well as the way people have been trained to work on it.

Controlled variation is caused by a multiple number of random causes acting simultaneously where no single cause is predominant. This means that we cannot assign a particular cause for any variation observed when the variation is controlled within the process limits.

Understanding the above-mentioned implications of stable processes and making use of the guidance provided by process behavior charts can result in huge savings for businesses.

Many organizations put a lot of fruitless effort into explaining monthly, weekly, or daily differences in sales, production, profits, and performance of all kinds, when in fact the variation is due to common causes. This futile practice also increases stress for employees, especially when individual below-average results are often seen as justification for aggressive management action, and above-average results are seen as evidence of the effects of such action.

The simple but important message is that as long as the data lie between the process limits, do not get distracted by short-term data or (even worse) individual data points.

Unstable processes
In contrast to stable processes, unstable processes show changes in behavior and are unpredictable. This is uncontrolled variation—i.e., variation present due to sources outside the process, which prevent it from performing as well as it could.

Uncontrolled variation occur due to causes alien to the process, or outside causes. The variation observed can be attributed to a single cause that is dominant, known as a special cause.

Unlike common causes, it is profitable to discover and remove special causes. (Obviously, we would only want to remove a special cause if the change in behavior is bad. If the change is good, we would still want to identify the cause, but the purpose for doing so would be to see if it is possible to absorb it into the system.)

Here the process behavior chart helps us in two ways: It warns us of the appearance of special causes and also indicates approximately when they arrive, by giving us a signal, which is a vital clue to identify them.

The difference between stable and unstable processes is, therefore, of great practical importance. If we look for the cause of some particular detail in the data, are we likely to find something useful? Or are we beginning a fruitless exercise, mistakenly believing we have seen something important—mistaking a common cause for a special cause (a type 1 mistake)?

Shewhart described process limits as “economic limits” because experience showed that they minimize the combined cost of making the two kinds of mistakes when interpreting process data: i.e., reacting to short-term data when you shouldn’t, and failing to act when you should.

Software case study

For a software case, we consider an “incident” as an unplanned interruption to an IT service or reduction in the quality of an IT service. We have used control charts to reduce the number of incidents to good effect. For various applications the total number of incidents that occurred every week is plotted in a time series, an XmR control chart being the most appropriate.

For the X chart showing the number of incidents every week, the upper and lower control limits are calculated based on the point-to-point changes, i.e., differences between consecutive values in the data, which are known as “moving ranges.”

In addition to the X chart, a chart showing the moving ranges (the mR chart) is also plotted from week to week. This time-series graph gives us a signal to look for a special cause when the mR value (i.e., the difference between two consecutive values of x) crosses the upper control limit in the mR chart.

Here are the various types of signals that we encountered in our control charts:
Signal a: Number of incidents in a particular week crossing the upper control limit in the X chart for number of incidents every week, in time order
Signal b: MR value for a particular week is higher than the upper control limit in the moving ranges chart.
Signal c: More than eight consecutive points on the X chart above the central line, i.e., above the average
Signal d: More than eight consecutive points on the X chart below the central line, i.e., below the average

For us, signals a, b, and c signified a degraded system, and the cause could be internal or external to the system. Finding the cause and eliminating it improved the system. Signal type d signified an improvement in the system, and finding out why it was caused and replicating it led to continuous improvements. This signal was most often the result of changes aimed at process improvement, and it indicated that the changes we had made were fruitful.

Whenever we notice a signal, we explore the reasons with various teams. We have found that in many cases recurring issues cropped up due to certain causes internal to the system. The teams have then worked on these causes to resolve the issues and prevent or reduce the frequency of their recurrence. Also, many times there have been external issues that have caused the problem. Awareness of those causes has also served as a warning in similar situations in the future. In this way we have made improvements in various applications.

To understand how to compute control limits and make control charts, we have found the book Making Sense of Data–SPC for the Service Sector (SPC Press, 2012) by Donald J. Wheeler very useful.

With the help of control charts, process improvements were made, and the average number of incidents per week were drastically reduced. One such example is shown below:



Figure 2:  X chart for Application A, before process improvement

The average for application A was 8.67.

Special causes found
1. It can be seen from the above control chart that during the week of May 8–15, the number of incidents crossed the upper control limit, indicating a “special cause” that needed to be identified. Investigation revealed that it was a software issue—Java process down and a ransomware virus.
2. Noting this, the team worked on solving the issue and fixed it.
3. A number of other systemic improvements were also done, including:
• Purge facility for user, as provided in new implementation
• Unwanted alerts removed from server
• Configuration changes made so that incidents are stopped

From July 17 onward, nine consecutive points were seen below the current average, clearly indicating that the behavior of the process had changed for the better. Therefore, the control chart was redrawn, beginning from this date, as shown below.


Figure 3:  X chart for Application A, after process improvement


From this new control chart, we see that the average number of incidents per week has now reduced to 4.5, a very substantial reduction from the earlier figure of nine per week.

Conclusion

Adopting control charts (or process behavior charts) more widely in software projects could result in great benefit to the software organization as well as its customers. This would be of profound value to management when deciding what type of action to take to solve and prevent issues.

If a special cause is found we need to find out the root cause and eliminate it. If the cause is internal to the process then such action itself will result in process improvement. If the cause is external to the process, we can coordinate with people who are in a position to do something about it, to examine whether it can be either eliminated or frequency of occurrence reduced.

Also, if we carry out process changes aimed at improvement, we can infer that these changes have yielded benefit if, after the changes have been made, we find eight or more consecutive values below the average.

References
Deming, W. Edwards. The New Economics–for Industry, Government & Education, MIT Press, second edition 2000.
Joiner, Brian L. Fourth Generation Management–The New Business Consciousness, McGraw-Hill Education, 1994.
Neave, Henry R. “12 Days to Deming,” active learning course.
Neave, Henry R. “SPC–Back to the Future,” parts 1,2,3 and 4.
Wheeler, Donald J. Making Sense of Data—SPC for the Service Sector, SPC Press, 2012.

Discuss

About The Author

Rohit Mathur’s picture

Rohit Mathur

Rohit Mathur was educated in India and graduated from NIT Silchar in Mech. Engineering. He is a PMP (Project Management Professional) from PMI (Project Management Institute). He has also completed Scrum Master Certification (PSM) from Scrum.org. He has worked for more than 19 years in the software industry in various roles and three years in hard disk drive manufacturing. The last few years he has been working on improving software quality, primarily in production support projects. He has used control charts/process behaviour charts to improve the process. He was involved in data warehouse project earlier and was instrumental in improving the time various codes ran by monitoring the duration on a daily basis and using control charts. The last few years he has extensively used control charts in improving the quality of software applications.

Comments

Common Cause Reduction in Software

Control charts can help monitor software system performance. 

You can also use control charts, Pareto charts and root cause analysis to eliminate bugs that are the common cause of errors. And you don't need much data to do it.

Here's how: https://www.qimacros.com/pdf/dirty30.pdf

Other mistakes

First, good work on this...we don't get enough examples from other industries of the types of measures that could be monitored with process behavior charts. You might consider a follow-on article describing some of those measures you discussed in your first paragraph or so. I would like to be able to add those to a list I keep so when someone tells me "...yeah, sounds good, but you couldn't do that in MY industry..." I can show them examples. Sometimes it helps. 

There are other errors you can make while investigating variation, even if you're using process behavior charts. Subgrouping unlike things is one--fortunately, if you understand what the charts are saying, the voice of the process will tell you about this, too, by hugging the centerline. Lack of good operational definitions will hurt you, too. One of the worst errors I ever made was allowing a company to just send me the data each month so I could put it on a chart for them...I think I will put together an article on this one; Dirk might find it useful. 

Re RIP STAUFFER

Thanks RIP STAUFFER. Yes i beleive this kind of work has not been done much before. Will try and consider a follow up article as well.