The following is a series of questions commonly asked by beginners to the RMQ discipline. The written responses are intended to provide a high-level answer with a summary of the relevant topics, practices and/or techniques. As many of the FAQs address broad topics, the responses cannot address the topic in complete detail, but instead point the reader to sources where additional information can be found.
The RMQSI Knowledge Center is here to help. The site is designed to accommodate both reliability beginners, who may not know where to begin, and experienced practitioners looking for information on a specific RMQ practice.
Beginners are encouraged to walk through the RMQ introduction, which begins with the General Terms and Definitions, and continue through the numbered blocks (above) of the RMQ Beginner’s Road Map. Additional FAQs are provided below.
Experienced practitioners can search by specific RMQ Topics using the navigation tool on the Home Page. After selecting a specific topic from the drop-down list, the navigation graphic is activated, and provides users with quick links to training, tools, etc. related to the selected topic. One can also access the full collection of RMQ Topic Pages through the “Knowledge” menu option at the top of the page, under the “By R&M Topic” category.
Search by Resource
Users may also search or browse through all resources of a particular type (e.g., Manuals, Tech Briefs, etc.) by selecting the desired resource under the “By Type” category under the Knowledge menu option.
Ask us a Question
If you’re having trouble finding the information you need, or if you have a very specific question, please feel free to Submit a Free Inquiry to one of our Subject Matter Experts (SMEs) using the provided link.
Testing can be performed for a variety of purposes. One simple example is the difference between failure discovery testing and demonstration testing. In one instance, failures are desirable as they enable engineers to identify how a product or system will fail, and then take steps to correct/modify the design to prevent or prolong such failures from occurring. This approach is commonly known as Reliability Growth Testing (RGT), though Reliability Growth can also be achieved without testing. RGT is typically performed on a collection of prototypes in the later design stages, such that there is time for additional modifications based on any design flaw(s) identified during the testing. Analysis-based reliability growth techniques have grown in popularity because they are performed earlier in the design process, when it is less costly to alter the design.
Demonstration testing, on the other hand, generally is used to evaluate a product’s or system’s performance by demonstrating a certain level of reliability (e.g., how long it will operate before it fails). This Reliability Demonstration/Qualification Testing (RDT/RQT) is typically performed in the latter stages of product development, at which point failures are undesirable and under performance may lead to costly and/or timely design reevaluations. A similar type of demonstration testing, known as Production Reliability Acceptance Testing (PRAT), is usually performed during the production phase of a product’s life cycle to ensure that the production process can achieve the reliability of the design – in other words, ensure the manufacturing process does introduce flaws that hinder performance.
Additional screening may be performed on a production lot to eliminate parts/systems with latent defects (defects that cannot be discovered through inspection). This Reliability Screening may include Environmental Stress Screening (ESS), Burn-In or Highly Accelerated Stress Screening (HASS), which differ in the type and severity of the stimuli applied to products to precipitate premature failures.
Once a general testing approach has been selected, it is not simply performed. Instead, there are a number of factors to consider and plan in order to achieve the desired results (or test the appropriate factor). The Design of Experiments is a common term and practice for the necessary due diligence in planning a test. Factors to consider include cost and/or schedule restrictions, applied stimuli (e.g., force, load cycling, temperature, voltage, etc.), available equipment, acceleration factors, and so on.
Despite the long history between testing and reliability, a more recent approach involves Modeling & Simulation (M&S) as well as additional analyses-based activities. Such alternatives should be considered, as Testing Simulation may provide a more suitable option based on the available resources.
Testing is performed to observe a component’s or system’s behavior, which is often quantified through various performance metrics. Collecting and analyzing the data is a critical component for any test, and ensures that meaningful knowledge is gleaned from the test results. A Failure Reporting, Analysis and Corrective Action System (FRACAS) is an important tool for collecting and analyzing data down to the Root Cause of failure. Weibull Analysis is an important data analysis technique for characterizing life and failure mode characteristics of a part, product or system.
This brief overview of the different reliability testing strategies only scratches the surface. A more in-depth discussion of the different types of tests, and the factors to consider when planning a specific type of test are described in the System Reliability Toolkit V.
The inherent (i.e., designed-in) RMQ of a product is influenced by the strategy defined by the RMQ Program. This strategy originates from the RMQ requirements and available company resources to design, develop, test, manufacture and support a product over its entire life cycle.
Optimizing overall RMQ program costs considers the overall affordability of the intended product, both from the manufacturer’s investment to achieve the required (or desired) levels of RMQ, and from the customer’s perspective in using the product (cost of ownership). There needs to be a balance between the amount of money/resources that a company can afford to invest to achieve profitable success in a competitive environment and the amount of financial “burden” that a customer can be expected to pay to buy and maintain that product.
An automobile that is characterized by “overly robust” RMQ performance (compared to competitors) will have a prohibitive price tag for most customers. The automobile manufacturer would need to invest significant resource to design in those performance levels and may never be able to recover that investment (i.e., make a profit).
An automobile that is characterized by “sub-standard” RMQ performance (compared to competitors) will have invested significantly less in the inherent design and development of the vehicle. They may sell more cars at a cheaper price, and make money on post-warranty service and repairs, but the customer may have to invest significantly more (over the years) than would have been required to purchase the more expensive vehicle. The manufacturer will profit, but at what overall cost (loss of customer base/market share, potential liability, etc.)?
Activities that can help to optimize overall RMQ costs include:
Many “design for reliability” based activities essentially consist of a process that forces engineers to implement sound design principles. More specifically, it requires designers to consider the entire spectrum of potential use conditions (e.g., environments, loading, storage, etc.) and appropriately addresses the relevant factors in the product’s or system’s design. This typically includes Materials Selection considerations based on environmental conditions (e.g., heat, corrosivity) and the potential failure mechanisms (e.g., fatigue, wear) of the product’s application. Reliability Physics/Physics-of-failure Analysis is a popular technique for considering the physics behind a component’s root causes of failure, and estimating its likelihood of occurrence.These types of decisions generally are the basis for subsequent Parts Selection and Application decisions that ultimately dictate the longevity of the system. In some cases, Counterfeit Parts Solutions may be needed to ensure that the components that make their way into the system are, in fact, of the reliability and quality of their design; particularly when utilizing Commercial Off the Shelf (COTS) items. If a system remains operational over an extended period of time, Component Obsolescence Planning may be necessary if spares for a particular part that requires replacement become unavailable.
In addition to components, reliability is also designed into the system through the implementation of similar predictive-based analyses. Component failures are ultimately responsible for eventual systems failures. Accordingly, known and/or observed component failure modes must be considered from an overall system perspective in order to identify the potential causes of a complete system failure. System-level analyses, such as a Failure Modes Effects and Criticality Analysis or a Fault Tree Analysis, are useful approaches to determine the effects of lower-level component failures on higher level assemblies and the entire system. These and other Reliability Modeling techniques help to identify critical components (e.g., linch pins) and to prevent or reduce the susceptibility to single point of failure assemblies, through the subsequent introduction of redundancy or other failure mitigating measures into the system’s design. Considering and/or anticipating possible mechanisms of failure during a product’s design helps to avoid costly oversights and produces a design that is more capable of withstanding the operational stressors that precipitate failures.
A number of Reliability Modeling and Prediction techniques of varying complexity have been developed for different types of analyses. A Failure Modes, Effects and Criticality Analysis (FMECA), for example, is a bottom-up analysis of the system that assists in the identification of possible failure modes, and the subsequent effects on the system. A Fault Tree Analysis, on the other hand, is a top-down review of the possible event combinations that can precipitate an undesired, high-level event (e.g., loss of system power). These are commonly used techniques to evaluate the design of a system, and identify and eliminate single-point failures (e.g., the failure of a single component that can cause a complete system failure).
There are additional techniques like Reliability Block Diagrams (RBD) that model a system as a collection of blocks representing the subsystems and/or components of the system. The blocks are graphically connected in a manner that reflects the series and parallel relationships between these lower-level assemblies/subsystems. Using such diagrams, one can estimate the reliability of the system from the reliability of its “parts”. A more detailed discussion of the various system modeling techniques can be found in the Reliability Modeling RELease Guide.
To utilize these system-level models to predict a system’s reliability, one must first know (or predict) the reliability of the lower-level assemblies and specific parts/components. Several techniques can be used to perform these predictions, differing in complexity, input requirements and perceived accuracy. Some of the most common part/component reliability prediction techniques include:
- Statistical Data Analysis
- The availability of part/component failure data collected from testing, simulations and/or operational use typically lends itself to the traditional statistical data analysis approach. Using this technique, engineers attempt to match an appropriate statistical distribution (e.g., the Weibull distribution) to the dataset. Once the appropriate distribution is identified (i.e., the data matches the behavior of a specific distribution), analysts can predict/estimate a number of important reliability characteristic.
- Physics-of-Failure (PoF) Modeling
- This approach considers the various operational factors that affect the part/component’s predominant failure mechanisms (i.e., the most common types of failure). The developed models then predict the time to, or likelihood of, failure based on the underlying physics of the root failure mechanisms. However, based on the specificity of these models, there are many instances where a valid model may not be available.
- Empirical Approach
- In the absence of part-specific data, failure data from similar part-types or legacy components (i.e., surrogate data) used in similar operating conditions can provide a reasonable estimate of a part’s/component’s behavior.
- One might also use empirical prediction models, which adjust a base failure rate for a specific part-type with coefficients that quantify the impact of different operational stresses (e.g., part size, operating environment, loading, etc.) on a part/component’s likelihood of failure.
- Other Approaches
- Build and test, stress-strength interference analysis, modeling and simulation, etc.
Some of the most popular tools and publications for reliability prediction have been developed by the staff of the RMQSI Knowledge Center. For example, HDBK-217Plus(TM): 2015, the latest version of the electronic reliability prediction model handbook that replaced the industry’s MIL-HDBK-217, is a product offered exclusively by the RMQSI Knowledge Center. Our staff also developed multiple revisions of the Nonelectronic Parts Reliability Databook (NRPD 2016), the Electronics Part Reliability Databook (EPRD-2014) and the Failure Modes/Mechanisms Distributions (FMR-2013), three of the RMQ field’s most popular references for surrogate part/component reliability data.
The Reliability Prediction RELease Guide describes the differences between the various part- and system-level reliability techniques in greater detail, while the System Reliability Toolkit V provides a comprehensive, in-depth discussion on reliability modeling and prediction.
Data is a valuable commodity to a reliability program, because it provides a metric by which performance and longevity can be measured. Analysis of the collected data can also reveal specific characteristics of a part or component’s behavior (characteristic life, infant mortality vs. wearout failure), and even identify the Root Cause(s) of Failure, as well as possible causes of premature failures (e.g., production flaws, improper maintenance, etc.), when applicable. Furthermore, in the absence of failure data for the part/component in question, we can sometimes estimate its reliability (or some related metric) utilizing data from similar and/or legacy products with the same use conditions. Thus, there a number of benefits that emphasize the importance of establishing an effective Reliability Data Collection and Analysis program.
From the RMQ perspective, raw data comes from any of the Testing and Simulation being performed, from production processes on the manufacturing floor, and from actual operation of products/services after delivery to the customers (both before and after the warranty period). Organizations can also obtain and use data generated by others on similar types of products and either compare them with their own experience, or use them as surrogate data sources if they don’t have corresponding data of their own. That being said, a company can collect as much data as it wants, but if it doesn’t invest in the skilled resources necessary to analyze and properly interpret it, then the collected data provides no added value to the organization. Even worse, it can lead the organization to make costly decisions based on “bad” interpretations. However, when done properly, analyzed data becomes a valuable input for component and system Reliability Modeling and Prediction, and can also be useful for system Affordability estimates and considerations during conceptual design stages.
For RMQ activities, useful data comes in the form of accumulated hours, number of failures experienced, root causes of failures, failure modes, number of maintenance actions required (and how long it took to fix each one), quality process monitoring (accept/reject, process capability), effectiveness of corrective actions, dollars invested in performing individual RMQ activities, etc. By collecting quality data during testing and/or operation, and employing the appropriate combination of detailed statistical analyses (e.g., a Weibull Analysis), insight can be gained into product failure rates, predominant failure modes (and their failure characteristics), effectiveness of maintenance activities, identification of processes drifting out of spec, Return on Investment (ROI) for the RMQ Program, and how effective corrective actions implemented to “fix” any noted deficiencies have been. A structured approach to a process like this is known as a Failure Reporting, Analysis and Corrective Action System (FRACAS), in which data is collected and analyzed throughout the product/system life cycle to identify behavioral trends and applicable solutions. This type of information is not only of value to the system for which it was collected/analyzed, but for future systems utilizing similar components and/or assemblies as well.
There are two concepts associated with extending the life of a product. One deals with developing a robust product design to begin with (inherent product RMQ) so that the life of the product is longer than it would have been if less robust materials, parts, processes, etc. had been used.
The other is related to steps that can be taken to extend the service life of a product that has been on the market (or in customers’ possession) for a “very long time” and is approaching its end of life. The resources may not be available to develop or purchase a full-blown replacement product (time, Affordability). For a practical example, consider an old car, with very high mileage, lots of wear and tear, body rust, etc. What are your options if you can’t afford to buy a new car? You can replace the engine, overhaul the drive train, paint or replace rusted body parts, etc. For any of these actions, a conscious decision has been made to extend the life of an existing product. These factors and the decision-making process are what is known as a Lifetime Extension Assessment.
On a larger scale, the same types of decisions can be made (and frequently are) about extending the useful life of aircraft, ships, bridges, dams, trains. Life extension activities can cover mechanical or electronic commodities, or even major software upgrades in lieu of hardware replacement. Hardware upgrades may include the use of newer technology materials (e.g., lighter composites) and more advanced electronics (reduced size, but with significantly more functionality). Based on the age of the system, Component Obsolescence Planning is often required as aging components become unavailable. In such cases, the selection of suitable replacement components requires the same Parts/Materials Selection considerations as in the original design phase. The Reliability Data Collection and Analysis from the system’s development and operation, particularly from a Failure Reporting, Analysis and Corrective Action System (FRACAS), can often provide useful information for component selection decisions and life extension assessments in general.
Reliability and quality are different product attributes related to overall performance and perceptions in the marketplace.
On one hand, reliability can be viewed as being inherent to the product design: it will perform reliably in the intended environment for the intended time under the intended conditions based on the level of reliability inherent in the parts, materials, software and processes used to develop that product. A reliable product does not, however, necessarily translate into a high quality product. If a product is not easy to use, if it does not look attractive, or if poor quality processes degrade the inherent reliability built into the product, then the product will likely be viewed as having poor quality and, possibly, unreliable.
On the other hand, a product can have “poor” inherent reliability in that it doesn’t meet it’s reliability requirements, but may still be considered to have “high” quality if the product meets all of the quality attributes/processes defined for it by the manufacturer (easy to use, attractive, manufacturing processes are in statistical control, quality processes do not degrade the inherent reliability designed into the product). A high-quality product, therefore, does not necessarily mean that the product is highly reliable (or even acceptably reliable). In this case, the customer may also consider the product to be of poor quality (even though the quality processes are “within spec” and unreliable.
A combination of high designed-in product reliability and high quality processes that do not degrade that inherent product reliability will likely translate into a customer’s perception of high quality for a company’s products. Quality Management represents the practice of implementing controls to ensure that the inherent reliability of a product or system’s designed is maintained through the item’s production and introduction to operation in the field. This often involves statistical controls, including the Testing and Simulation and subsequent Data Collection and Analysis of components during and after production, and the Root Cause Analysis of identified shortfalls in a component’s performance. The activities performed to design reliability into the system are covered in a number of the other FAQs on this page.
Even with an understanding of the various reliability practices and approaches, it can still be difficult to determine which activities are appropriate for one’s own system or organization. To begin, one must first define their reliability program, which should include reliability requirements, available resources (personnel, expertise, equipment), organizational culture and responsibilities, and other important (and unique) organizational factors. Reliability Problem Solving is then the process of identifying the appropriate reliability activities, and tailoring these practices to the conditions defined in the reliability program. While fairly simple in theory, it can be rather complex to perform, as special circumstances often come into play. The type of product(s) designed and/or manufactured by a company will often dictate, or at least influence, the activities that are performed. For example, reliability screening (i.e., tests to remove products with latent (unnoticeable) defects) is performed far more often for electronic components. Similarly, an organization that specializes in a particular type of system (e.g., pumps, relays, etc.) will often forgo failure discovery testing, because they are very familiar with the typical failure modes of these types of systems. It is not feasible to address even a fraction of the possible scenarios in a brief overview such as this, but the RMQSI Knowledge Center provides a free reliability self-assessment tool called RASTER to help get engineers started with this process.
From a different perspective, an effective reliability program may not address the development of reliable products, but instead ensuring that reliable products perform to expectations. In other words, some organizations rely on effective Asset Management practices to ensure that they get the most out of their investment into products and systems. This requires a combination of data collection and maintenance activities to monitor the system’s performance and appropriately address anomalies to identify and/or correct the root cause. Such collection, analyses and remediation efforts are typically performed as part of a Reliability Centered Maintenance program, which, when effectively applied, provides a cost-effective solution for asset management.