Every company needs computer systems that consistently perform as they’re expected to, without a system crash, calculation error or single bit of corrupted data. Yet most organizations will admit that their systems are far from meeting these expectations. At the very least, the software might not be designed to work the way users want it to, or it might not function properly in the user’s operating environment. Yet the cost of a computer system failure can be, and often is, astronomical.
USA Today provide dramatic examples:
All software is defective. Even the best products created by the best developers can fail. News articles such as the following from the May 21, 2002, issue of
“Operations aboard the International Space Station were back to normal a day after a three-hour shutdown of life support and scientific equipment, NASA said. An oxygen generator, the carbon-dioxide removal unit and the air-conditioning system all shut down.… Engineers said bad computer data led to the automatic shutdown. A cooling system in the Russian-built living quarters stopped working, and onboard computers started shutting off equipment as a precaution.”
Research results tell the same story. In a study of 1,500 software projects, defects were identified in all but two within the first year of the software’s use. (See Capers Jones, Conflict and Litigation Between Software Clients and Developers, Version 12, Software Productivity Research, 2004.)
Computer systems aren’t reliable for five reasons:
Defects are unintentionally designed and/or coded into the software.
Software manufacturers typically test only about 10 percent of the situations in which the software will be used.
Software is released with known defects. As the release date approaches, developers decide which defects to fix and which to defer.
Many defects reported by customers aren’t repaired. Instead, they’re prioritized, and some are fixed while others aren’t.
Differences between an organization’s technical environment and that of the developer can cause the software to malfunction.
From a business perspective, computer system reliability is a question of controlling the risks of lost productivity, lost product, inaccurate or lost data, poor decisions and the liabilities associated with defective products.
Because of product safety concerns, the U.S. Food and Drug Administration and other regulators have required companies to demonstrate the reliable performance of computer systems used in medical devices, manufacturing or quality assurance. This proc-ess is known as “software validation.” (See General Principles of Software Validation: Final Guidance for Industry and FDA Staff, Center for Devices and Radiological Health, Center for Biologics Evaluation and Research, Food and Drug Administration, 2002.)
From both a business and a regulatory perspective, it’s reasonable to expect that the time and money spent on ensuring a system’s reliability is commensurate with the risks posed by system failures.
Risk management involves identifying the potential harm that can result from a system failure. The degree of harm coupled with the probability of occurrence determines the level of risk associated with a potential failure. Actions are taken to reduce the potential risk, and the system’s performance is monitored to determine if additional preventive actions are needed.
Computer system validation involves confirming, by objective evidence and examination, that a system can reliably perform to its requirements throughout its normal operating range.
The Computer System Risk Management and Validation Life Cycle (or RiskVal Life Cycle) is a process for procuring, implementing and using systems that meet company needs and regulatory requirements. Risks of system failure are managed and the system is tested to determine if it will perform reliably.
This risk management and validation process includes nine phases:
Select vendors and systems
Design solutions to issues
Construct solutions to issues
Prepare for use
Use in a production environment
Retire the system
Applying the risk management and validation process to a system requires a team effort. The team includes the system owner, a representative from information technology, a validation expert and, in regulated industries, a representative from quality assurance.
A system’s risk level can be used to guide the application of the risk management and validation process. For low-risk systems, some activities might not be necessary. The methods used might be defined only at a high level, with the results documented in memos rather than reports.
For high-risk systems, however, the methods should be well-documented and reviewed for adequacy. Most, if not all, risk management and validation process activities should be performed. The high-risk functions performed by the system must be thoroughly tested, including tests for error handling and use under abnormal conditions. Additional controls and safeguards can be added to the system or business process to reduce the risk of failure. The performance of the risk management and validation process and its subsequent results should be documented in sufficient detail to enable an independent evaluation of the results’ validity.
Moderate-risk systems are treated like high-risk systems, only with less rigor and intensity in applying the risk management and validation process.
www.bpaconsultants.com for additional information on such policies and procedures.)
Before applying the risk management and validation process to a system, an organization must document the policies and procedures it will use. The policies state when risk management and validation are needed and detail the requirements for both. Procedures define the processes used to select, procure, implement, manage and maintain computer systems, as well as simultaneously manage risks and validate the system. (See
During the conceptualization phase, an individual or team determine if something needs to be automated. The team then develops an initial concept of the required system. A high-level risk assessment is conducted to identify the potential dangers associated with such automation and to guide the use of the risk management and validation process.
An automated pharmaceutical plant provides an example. A drug is manufactured by mixing reagents in a reactor. The reaction gives off intense heat. If the reactor overheats, the drug will be contaminated with byproducts. The reactor is kept at an acceptable temperature by circulating cold water through a cooling jacket. In this case, the automation software is considered to be high-risk because a lost batch of the drug could cost more than $100,000, and contamination could result in the death of a patient.
If management supports taking the concept of the potential automation to the next phase, a formal project plan is prepared.
A successful business application is one that reliably does what its users need it to do. Consequently, a system can’t be properly selected or tested without adequately defining requirements. Generally, these are users’ requirements, technical requirements, and requirements for software providers such as financial stability, support services and industry expertise. The requirements are revised as needed during the risk management and validation process.
The purpose of the third phase is to purchase hardware and software, that meet the defined requirements, from qualified vendors.
The best selection processes are those where alternative systems are evaluated firsthand through demo CDs, vendor demonstrations and/or visits to similar organizations that use the system. The selection process should result in:
A clear picture of how well each alternative meets the requirements
Documentation on why the selected system was chosen
Qualification of the selected vendors
The methods used to qualify application developers depend on the system’s risk level. With existing systems, qualification could be based on your experience with the vendor and its products. With new, low-risk systems, vendors
can be qualified based on their reputation, a third-party registration (e.g., ISO 9001) or a survey. For moderate- and high-risk systems, an on-site audit is recommended. If the software was developed using defect prevention and detection techniques, it’s less likely to fail. (See Capers Jones, Software Quality, Analysis and Guidelines for Success, International Thomson Computer Press, 1997.) For high-risk systems, a review of known defects in the software is also strongly recommended.
Training and support services also are contracted, if necessary.
At the beginning of phase 4, the system is installed for use in a test environment. Needed integrations with other systems are designed and issues are addressed.
Even a carefully chosen system often isn’t ideal. It might not have a feature you need, it might not meet all regulatory requirements or it might not have a safeguard that you want. To discover such issues, the project team must identify the requirements the system doesn’t meet and conduct a detailed risk evaluation.
For example, if the drug company wants to run the plant unattended overnight, operators will have to be notified if the reactor temperature exceeds the desired limits. A system that simply sets off alarms in the plant doesn’t meet the requirement. Two of the general options used for resolving system limitations can be applied in this example:
Change the system. Software can be integrated that will page an operator when the alarm is set off.
Change the requirement to eliminate the issue. Drop the requirement that the plant run
Two additional options are available for resolving discrepancies between the system and requirements, or to reduce risk:
Introduce protective measures in the process in which the system is used
Provide instructions and training to users to help prevent operator error
During phase 4, configurations are designed for configurable systems, such as setting up user groups and privileges, naming variables, developing workflows and creating reports.
At this point, a validation plan is written to describe what the validation will include.
In phase 5, the solutions developed in the previous phase are implemented and formally or informally tested. Returning to the example of the pharmaceutical plant, software that pages operators will be purchased and integrated with the plant automation software. It will be configured so that it will page the next operator on the on-call list if the first operator doesn’t respond within five minutes.
The risk evaluations are revised, and if necessary the residual risks associated with potential failures or hazards are reduced. If the risk is still high and can’t be further reduced, a risk-benefit analysis is performed to determine if the potential benefits from using the system outweigh the risks.
Phases 4 and 5 will mesh if solutions are implemented and tested when they’re developed.
In phase 6, the project team learns the system and prepares to use it. For example, pharmaceutical operations personnel will learn how to operate the automation software so that they can prepare user procedures, training materials and, in the next phase, validation tests. The information technology department prepares the procedures that it will need to support users and maintain the system. IT personnel become familiar with the system so that they can support it.
The system’s structure is captured in a specification that lists all the system’s hardware and software with relevant technical requirements. In the future, the specification will be kept up to date with changes to the system.
In phase 7, the system is formally tested before it’s used in operations. Testing demonstrates that the system has been properly installed (installation qualification) and is capable of meeting requirements (operational qualification).
Some aspects of the pharmaceutical plant automation that would be tested include:
Ability to process information and perform operations completely and accurately through defined limits
Security controls on access to the system and the ability to perform specific tasks
Transferring data and/or commands between the control software and devices (e.g., valves) and vice versa (e.g., measuring temperature via thermocouples)
Maintaining data without loss, corruption or unintentional modification
Performance levels (e.g., response times), especially when high-transaction volumes are an issue
Methods that control risks associated with failure (e.g., the paging system)
Operating system services used by the software, such as loading conditions, file operations, handling error conditions and memory constraints
Phase 8 extends from when the system is installed in the production environment to when it’s retired. The major activities of this phase are:
Ensuring that the system works in the production environment. In the pharmaceutical plant, three test batches of the drug will be produced, and data on the reactions will be carefully monitored to ensure that the automation maintains appropriate control of the reactions.
Using the system in compliance with approved user procedures
Supporting and managing the system following approved IT procedures
Using a documented change control process, including appropriate testing, to ensure that the system retains its ability to function properly after changes and to provide records of changes
Monitoring failures to determine if additional actions are needed to reduce their frequency or consequences
Planning for the system’s retirement is normally done at the same time as planning for a new system. The process required to preserve or transfer data to the new system is carefully evaluated, planned and tested.
The RiskVal Life Cycle helps ensure that purchased systems are capable of doing what an organization requires and are compliant with relevant regulatory requirements. The system might still fail, but the failures will be less frequent and the consequences less severe. In addition, the process will create the documentation needed to demonstrate to regulatory bodies that the system has been appropriately validated.
Note: This article is based in part on The Computer System Risk Management and Validation Life Cycle by R. Timothy Stein, which will be available soon from Paton Press.
R. Timothy Stein, Ph.D., is the founder and president of Business Performance Associates (www.bpaconsultants.com). Stein has more than eight years of computer system implementation and validation experience, and splits his time between software validation activities and assisting organizations in establishing quality systems. Paton Press will publish Stein’s book, entitled The Computer System Risk Management and Validation Life Cycle, in the summer of 2005. You can contact him at firstname.lastname@example.org or (408) 366-0848. RiskVal is a registered trademark of R. Timothy Stein.