RSC Topic: Risk Management & Risk Register

  • Failure Mode and Effects Analysis

    Failure Mode and Effects Analysis (FMEA) is a structured, step-by-step method used to identify and evaluate potential failures in a product, process, or system before they occur. It focuses on how something can fail (failure modes), why it might fail (causes), and what happens if it fails (effects).

    In manufacturing and operations, FMEA is typically performed by a cross-functional team and follows a standardized sequence:

    • Define the scope of the product, process, or system being reviewed.
    • List functions and requirements for each step, component, or subsystem.
    • Identify potential failure modes for each function (ways it might not meet requirements).
    • Determine potential effects of each failure mode on the customer, process, or downstream operations.
    • Identify potential causes and existing controls that detect or prevent each failure mode.
    • Assign ratings for severity, occurrence, and detection, then calculate a risk priority metric (such as Risk Priority Number, RPN).
    • Prioritize failure modes based on the ratings and define specific actions to reduce risk.
    • Update the analysis after actions are implemented, revising ratings and documentation.

    FMEA can be applied at different levels, such as design FMEA (DFMEA) for product designs and process FMEA (PFMEA) for manufacturing or service processes. It is often used alongside Root Cause Analysis (RCA): RCA investigates why an actual failure occurred, while FMEA anticipates and ranks potential failures so they can be addressed systematically.

  • product safety

    Product safety commonly refers to the set of practices, controls, and requirements used to ensure that a product does not introduce unacceptable risk to people, property, or the environment throughout its lifecycle. In industrial and regulated manufacturing, it links design, production, testing, documentation, and field feedback to prevent hazards arising from normal use, reasonably foreseeable misuse, or failures.

    Key elements of product safety in manufacturing

    In an operations context, product safety typically includes:

    • Hazard identification and risk assessment: Systematically analyzing how a product might cause harm (for example, mechanical, electrical, chemical, software, or usability-related hazards) and evaluating the associated risks.
    • Design controls: Engineering features, materials, and interfaces so that safety requirements are built into the design, including fail-safes, redundancy, and protective limits where appropriate.
    • Process controls and validation: Ensuring manufacturing processes consistently produce products that meet defined safety requirements through documented procedures, qualifications, and in-process checks.
    • Inspection and testing: Verifying and validating safety-related characteristics, such as pressure tests, functional safety checks, labeling, and traceable inspection records.
    • Labeling and information for use: Providing clear markings, warnings, instructions, and limitations of use needed for safe operation, maintenance, and disposal.
    • Change and configuration control: Managing design and process changes so that product safety impacts are assessed, documented, and communicated before implementation.
    • Field feedback and corrective action: Monitoring in-service performance, incidents, and customer feedback, and using structured CAPA processes when safety-related nonconformities are identified.

    Product safety and regulated environments

    In regulated industries such as aerospace, medical devices, automotive, and certain process industries, product safety is closely tied to quality management systems and sector-specific standards. It is typically addressed through:

    • Documented safety requirements and acceptance criteria integrated into design and production records.
    • Traceability of critical components, materials, and process parameters that affect safety.
    • Formal review, approval, and version control of safety-related documents, such as specifications, work instructions, test methods, and software.
    • Evidence packages supporting audits and regulatory reviews, including risk analyses, verification/validation results, and change histories.

    Operational view in OT/IT and MES/ERP environments

    From a systems perspective, product safety appears in how data and workflows are set up across OT and IT:

    • MES integration: Routing, work instructions, and data collection steps that enforce safety-critical operations, signoffs, and tests at the right process stages.
    • ERP and configuration management: Managing bills of material, approved supplier lists, and controlled revisions for safety-critical parts and materials.
    • Electronic records: Capturing and preserving evidence that each unit or lot met defined safety requirements, including test results, deviations, and concessions.
    • Access control and permissions: Restricting who can modify safety-related parameters, documents, or software in production systems.

    Common confusion

    • Product safety vs. worker safety (occupational safety): Product safety focuses on the safety of the delivered product in use. Worker safety focuses on protecting employees and contractors while they manufacture, test, or service the product. The two areas are related but governed by different requirements and practices.
    • Product safety vs. product quality: Quality covers whether a product meets specified requirements. Product safety focuses specifically on avoiding harm, which may involve requirements beyond traditional quality attributes like performance or aesthetics.

    Relation to aerospace quality standards such as AS9100

    In aerospace and similar high-consequence sectors, standards such as AS9100 incorporate product safety expectations into design, production, configuration management, and risk management clauses. Organizations using these standards typically treat product safety as a cross-functional responsibility that connects engineering, operations, quality, and supply chain, supported by documented processes and objective evidence.

  • Fault Tree Analysis (FTA)

    Fault Tree Analysis (FTA) is a structured, top-down method used to analyze how combinations of component, process, or human failures can lead to a defined undesired event, such as a safety incident, equipment failure, or critical nonconformance. It represents the logical relationships between basic faults and the top event in a graphical tree using standardized symbols and logic gates.

    How Fault Tree Analysis works

    FTA typically starts with a clearly defined top event (for example, “loss of containment in reactor” or “incorrect part installed on aircraft assembly”). The analysis then proceeds by repeatedly asking what conditions or failures could cause that event, and mapping them in a tree-like structure:

    • Top event: The system-level failure or hazardous condition being analyzed.
    • Intermediate events: Higher-level causes that can contribute to the top event.
    • Basic events: Lowest-level causes that are not further decomposed (e.g., component failure, human error, software fault, incorrect parameter).
    • Logic gates: Symbols (commonly AND and OR gates) that show how combinations of events lead to higher-level failures.

    In regulated manufacturing and safety-critical industries, FTA is often used to:

    • Identify combinations of failures that could lead to hazardous conditions or noncompliant product.
    • Support risk assessments, safety cases, and reliability analyses for complex systems.
    • Provide a traceable structure for investigations and root cause analysis, especially where evidence must support regulatory or customer review.

    Use in industrial and regulated environments

    Within industrial operations, FTA commonly appears in:

    • Process and equipment design: Evaluating how instrument, control, and mechanical failures can propagate through a production line, production cell, or automated system.
    • Safety and risk management: Supporting functional safety, hazard analysis, and risk mitigation activities, especially where formal justification of risk controls is required.
    • Quality and reliability engineering: Mapping failure paths that could lead to scrap, rework, escapes, or field failures, often in conjunction with FMEA and other analysis methods.
    • Root cause analysis: Providing a structured, evidence-based fault-tree-style investigation when simple tools (such as 5 Whys) are not sufficient for complex or safety-critical issues.

    FTA can be performed qualitatively, focusing on structure and logic of failure paths, or quantitatively, where probabilities are assigned to basic events to estimate the likelihood of the top event.

    What FTA includes and excludes

    FTA typically includes:

    • Systematic mapping of potential failure paths for a single defined top event.
    • Hardware, software, process, and human failure modes that can be logically combined.
    • Explicit assumptions, conditions, and boundary definitions for the analysis.

    FTA typically does not include:

    • Bottom-up enumeration of all possible failure modes without a specific top event (that is more characteristic of FMEA).
    • Project management, scheduling, or resource planning functions.
    • Guarantees of system safety or compliance; it is an analysis tool, not a certification.

    Common confusion

    Fault Tree Analysis vs. FMEA: FTA is top-down, starting from a defined undesired event and working backward to identify contributing faults. Failure Modes and Effects Analysis (FMEA) is bottom-up, starting from component or process failure modes and examining their effects. In practice, both may be used together in manufacturing and regulated environments.

    Fault Tree Analysis vs. 5 Whys / Ishikawa diagrams: 5 Whys and fishbone (Ishikawa) diagrams are simpler tools often used for early problem structuring. FTA is more formal and logic-based, and is commonly used when a rigorous, documented analysis of system failures is required.

    Connection to root cause and investigations

    In aerospace, pharmaceutical, medical device, and other regulated sectors, FTA-style analysis is frequently used as part of root cause analysis and safety investigations. The fault tree structure helps document how evidence supports specific failure paths, how alternative paths were evaluated, and where controls or design changes may interrupt those paths.

  • Risk-Based Prioritization

    Risk-based prioritization is the practice of ranking actions, issues, or investments according to their assessed level of risk, so that limited resources are directed first to the items with the highest potential impact and likelihood of occurrence.

    In industrial and manufacturing environments, risk-based prioritization commonly refers to using structured risk assessments to decide which problems, projects, or controls should be addressed first. This can apply to equipment maintenance, CAPA activities, process deviations, cybersecurity vulnerabilities, change controls, or compliance gaps.

    Key characteristics

    Risk-based prioritization typically involves:

    • Defining risk criteria, such as safety, product quality, regulatory impact, financial loss, data integrity, or supply continuity.
    • Scoring likelihood and impact using a consistent scale (for example, qualitative categories or numeric scores).
    • Combining scores into an overall risk rating (for example, a risk matrix or risk ranking formula).
    • Ordering the backlog or plan so that higher risk-rated items are addressed before lower risk items, within practical constraints.
    • Reviewing and updating priorities as new data, incidents, or regulatory expectations emerge.

    Operational use in manufacturing and regulated environments

    In operations and manufacturing systems, risk-based prioritization commonly shows up in:

    • Quality and CAPA: Prioritizing investigations, corrective actions, and preventive actions based on product and patient impact, recurrence risk, and regulatory exposure.
    • Maintenance and asset management: Choosing which equipment to inspect, repair, or upgrade first, based on failure consequences for safety, quality, and uptime.
    • Change control: Determining which process or system changes require deeper review, validation effort, or phased implementation.
    • Cybersecurity and OT/IT controls: Ranking vulnerabilities, system patches, and network hardening tasks according to potential operational disruption and data or safety impact.
    • Audit and remediation planning: Sequencing remediation tasks after internal or external audits according to risk level and deadlines.

    What it includes and excludes

    Risk-based prioritization includes the decision logic and process used to order work items based on risk. It does not itself define how risk is measured; that is provided by the organization’s risk assessment method or framework. It is also distinct from simple time-based or first-in-first-out prioritization, which do not consider risk level.

    Common confusion

    Risk-based prioritization is often mentioned alongside related concepts:

    • Risk assessment: Identifying and analyzing risks. Risk-based prioritization uses the output of risk assessment to order actions.
    • Risk management: The broader lifecycle of identifying, assessing, treating, and monitoring risk. Risk-based prioritization is one step within that lifecycle.
    • Criticality analysis: Focused on ranking assets or processes by their importance. Risk-based prioritization may use criticality as one input but typically considers likelihood and impact more broadly.

    Relation to standards and systems

    Many industry and quality standards encourage or reference risk-based thinking, and organizations often implement risk-based prioritization within MES, QMS, CMMS, and cybersecurity tools. In these systems, risk scores or criticality ratings are used to sort queues, escalate events, or trigger workflows according to defined thresholds.

  • Risk Appetite

    Risk appetite commonly refers to the amount and type of risk an organization is willing to accept in pursuit of its objectives. It is a high-level, strategic concept that sets boundaries for decision making across functions such as operations, quality, IT/OT, and supply chain.

    What risk appetite includes

    In an industrial or manufacturing context, risk appetite typically:

    • Is defined by senior leadership and, where applicable, overseen by a board or governance body
    • Expresses how much variation from targets, disruption, or uncertainty is acceptable to achieve business goals
    • Covers multiple risk types such as safety, quality, regulatory compliance, cybersecurity, supply chain, and financial risk
    • Guides the design of controls, monitoring, and escalation thresholds in OT/IT systems, MES, QMS, and ERP workflows
    • Provides a reference point for prioritizing mitigations, investments, and contingency planning

    Risk appetite is usually articulated in qualitative terms (e.g., “very low appetite for product quality and safety risk”) and may be supported by quantitative indicators (e.g., defect rates, downtime levels, or incident frequencies that are considered acceptable).

    What risk appetite is not

    Risk appetite is distinct from:

    • Risk tolerance: More specific acceptable variation around a particular metric or objective (for example, tolerance for a certain number of minor deviations per quarter). Tolerances are often numeric and operational.
    • Risk capacity: The maximum level of risk the organization could theoretically bear before threatening its viability. Appetite is chosen; capacity is a constraint.
    • Individual risk decisions: Day-to-day approvals, deviations, or change controls should align with risk appetite, but are not the appetite itself.

    Operational role in manufacturing and regulated environments

    In regulated industrial operations, risk appetite is used to align how strict or flexible processes and systems should be. Examples include:

    • Setting how conservative safety interlocks, alarm limits, or access controls should be in OT and automation systems
    • Determining how much residual risk is acceptable when qualifying new equipment, materials, or suppliers
    • Defining when deviations, nonconformances, or cyber events must be escalated, investigated, or result in production holds
    • Informing investment decisions in redundancy, backup systems, and business continuity for critical manufacturing lines
    • Aligning quality management and CAPA priorities with the organization’s stated appetite for quality and compliance risk

    Risk appetite is often documented as part of enterprise risk management, information security governance, or quality and safety policies, and then referenced in procedures such as change control, vendor qualification, incident response, and validation.

    Common confusion

    Risk appetite is commonly confused with:

    • Risk tolerance: Appetite is high level and strategic; tolerance is detailed and metric-specific. For example, an organization may have low appetite for data loss, with a tolerance of zero loss of regulated records but a limited tolerance for temporary reporting delays.
    • Risk attitude of individuals: Personal comfort with risk does not define organizational risk appetite. Formal governance defines appetite to maintain consistency across teams and sites.

    Connection to systems and standards

    While specific frameworks and standards may use slightly different terms, risk appetite generally underpins how organizations implement controls across MES, ERP, QMS, and cybersecurity programs. It influences how strictly requirements are interpreted, how exceptions are handled, and how much residual risk is accepted when balancing throughput, cost, and compliance.

  • hazard analysis

    What hazard analysis is

    Hazard analysis is a systematic process used to identify, describe, and characterize potential sources of harm (hazards) associated with a system, activity, or operation before and during its use.

    It focuses on understanding **what could go wrong**, under what conditions, and with what possible consequences, without yet deciding how to control or accept the risk.

    Typical elements of a hazard analysis

    A hazard analysis commonly includes:

    – Defining the system, process, or operation being studied, including boundaries and interfaces.
    – Identifying conditions, events, or failures that could lead to undesired outcomes such as injury, environmental impact, product nonconformance, loss of function, or equipment damage.
    – Describing how each hazard could be initiated, propagate through the system, and be detected.
    – Characterizing each hazard in terms of context, affected components, potential consequences, and any existing safeguards.
    – Documenting assumptions, data sources, methods used, and rationale for identified hazards.

    Outputs are usually a structured list or database of hazards with their causes and potential consequences, plus traceable documentation that can be reviewed and updated over the lifecycle of the system or process.

    Use in industrial and manufacturing environments

    In industrial operations and regulated manufacturing, hazard analysis is often:

    – An input to formal **risk assessment** and risk control activities.
    – Used during process design, technology transfers, and changes to equipment, automation, or control systems.
    – Integrated with quality and safety approaches such as FMEA, HAZOP, and process hazard analysis to support compliant, documented decision-making.

    It supports engineering, operations, quality, and safety teams in prioritizing risks, specifying controls, and maintaining configuration and change records.

  • Risk register

    A risk register is a structured document or database used to record, track, and update risks over time. It typically lists each identified risk along with key attributes such as description, likelihood, impact, owner, current controls, planned actions, and status. In industrial and regulated manufacturing environments it is a core tool for formal risk management.

    Typical contents of a risk register

    While formats vary, most risk registers capture at least the following information for each risk:

    • Risk ID or unique reference
    • Risk description, including cause and potential consequence
    • Category (for example: safety, quality, compliance, cybersecurity, supply chain)
    • Likelihood and impact ratings, often combined into a risk priority or score
    • Current controls or mitigations in place
    • Planned actions or additional mitigations, with target dates
    • Risk owner responsible for monitoring and follow-up
    • Status (for example: open, in progress, closed, accepted)
    • Review dates and notes from periodic reassessments

    Use in industrial and regulated environments

    In manufacturing operations, a risk register commonly supports:

    • Operational risk management, such as equipment failures, OT/IT downtime, or loss of utilities
    • Quality and compliance risks, including process deviations, data integrity issues, or nonconformances
    • Safety and environmental risks, alongside formal hazard analyses and safety studies
    • Cybersecurity and OT risks, for example unauthorized access to control systems or loss of manufacturing data
    • Supply chain and supplier risks, such as single-source materials or long lead times

    The risk register may be maintained as a controlled spreadsheet, a module in a quality or risk management system, or part of broader governance, risk, and compliance (GRC) tooling. It is typically referenced during audits, management reviews, and change control, and is updated when new risks are identified or when controls change.

    Common confusion

    • Risk register vs. risk assessment: A risk assessment is the process and analysis used to evaluate risks. The risk register is the ongoing record where those evaluated risks and their current status are logged.
    • Risk register vs. issue log: A risk register focuses on uncertain future events. An issue log records problems that have already occurred. In practice, issues may trigger new entries in the risk register.
    • Risk register vs. hazard log: A hazard log is often specific to safety or environmental hazards. A risk register is broader and can include safety, quality, cybersecurity, and business risks in one place.
  • controls

    In industrial and regulated environments, controls are specific, intentional measures used to manage risk, enforce requirements, and ensure that processes and systems operate within defined limits. They can be technical, procedural, or organizational, and they are usually documented, implemented, and monitored as part of a broader management system.

    What controls include

    Controls commonly refer to:

    • Technical controls: Configuration settings, system functions, or automated mechanisms, such as access control rules in a MES, network firewalls for OT systems, sensor interlocks, or automated validation checks in an ERP/MES integration.
    • Procedural controls: Documented procedures, work instructions, standard operating procedures (SOPs), or checklists that direct how tasks must be performed, reviewed, and recorded.
    • Organizational controls: Governance structures, defined roles and responsibilities, segregation of duties, approval workflows, and training requirements.

    In information security and standards like ISO 27001, controls are selected and applied to treat identified risks and to protect the confidentiality, integrity, and availability of information. Examples include password policies, backup procedures, change management processes, and supplier security requirements.

    How controls show up in operations

    Within manufacturing and industrial operations, controls typically appear as:

    • Configured system rules (for example, enforcing electronic signatures or limiting who can release batches).
    • Quality and compliance procedures (for example, in-process inspection steps and deviation handling workflows).
    • Physical safeguards (for example, machine guards, badge readers, or restricted access areas).
    • Monitoring and review mechanisms (for example, audit logs, exception reports, and periodic access reviews).

    Controls are often tied to specific risks, requirements, or standards. They are usually traceable, with evidence that they are defined, implemented, and operating as intended.

    What controls are not

    • They are not guarantees that incidents or nonconformities will never occur.
    • They are not the same as the overall management system; they are elements within that system.
    • They are not limited to IT or cybersecurity; they also cover production, quality, safety, and supplier-related processes.

    Common confusion

    • Controls vs. policies: Policies set high-level intent and expectations. Controls are concrete mechanisms that implement or enforce those policies.
    • Controls vs. procedures: A procedure can be a control when it is specifically designed to manage a defined risk or requirement, but not every procedural step is necessarily a control.
    • Controls vs. control systems: In automation, a control system is the hardware and software that regulates a process (such as a PLC or DCS). Within that system, individual settings, logic blocks, and interlocks act as controls in the risk and compliance sense.

    Relation to ISO 27001 and similar frameworks

    In frameworks like ISO 27001, controls are cataloged options that organizations select and tailor based on risk assessment results. In industrial and OT contexts, this often includes:

    • Access, authentication, and authorization controls across MES, historian, and shop-floor systems.
    • Change and configuration management controls for production recipes, control logic, and system patches.
    • Monitoring controls, such as log collection and review for both IT and OT assets.

    These controls support structured, evidence-based management of information and operational risks across existing systems and suppliers.

  • Criticality Segmentation

    Core meaning

    Criticality segmentation is a structured process for classifying and grouping assets, systems, or processes according to how important they are to safety, regulatory compliance, product quality, security, or business continuity.

    In industrial and manufacturing environments, it is commonly applied to:

    – Physical assets (machines, utilities, infrastructure)
    – OT and IT systems (control systems, MES, historians, ERP interfaces)
    – Production lines or process areas
    – Data flows and network zones

    The outcome is usually a set of defined criticality tiers (for example: critical, high, medium, low) that are used to guide risk assessments, protection measures, and operational priorities.

    How it is used in industrial operations

    In regulated and manufacturing environments, criticality segmentation typically supports:

    – **Risk and security planning**: Deciding where to apply stricter cyber, safety, or access controls based on the potential impact of failure or compromise.
    – **Maintenance and reliability**: Prioritizing preventive maintenance and spares for highly critical equipment that affects safety, compliance, or major production capacity.
    – **Business continuity planning**: Identifying which systems or lines must be restored first after an outage.
    – **Quality and compliance controls**: Applying more stringent data integrity, traceability, and change control to assets and systems that influence product release decisions or regulatory records.

    Segmentation can be reflected in physical layouts (separate areas or equipment), logical groupings (tags in CMMS, EAM, or MES), or network zones (for example, OT security zones with different control levels).

    What is and is not included

    Criticality segmentation **includes**:

    – Systematic ranking or grouping against defined impact criteria (such as safety, environment, quality, throughput, legal/regulatory, or financial impact)
    – Use of those groups to differentiate controls, monitoring, and response procedures
    – Application to both cyber-physical systems (OT) and supporting IT platforms

    Criticality segmentation **does not automatically include**:

    – The detailed risk assessment itself (that is a separate activity, though it uses the segmentation)
    – The technical implementation of network segmentation, firewalls, or zoning (these are implementation mechanisms informed by the segmentation)
    – A guarantee of compliance with any specific safety, cybersecurity, or regulatory standard

    Common confusion and related concepts

    Criticality segmentation is often confused or intertwined with:

    – **Risk assessment**: Risk assessment evaluates likelihood and impact for specific threats or failure modes. Criticality segmentation is a higher-level categorization of importance that often feeds into, or is refined by, risk assessments.
    – **Network segmentation or zoning**: Network segmentation is a technical design of communication paths and control boundaries. Criticality segmentation is the policy or classification layer that informs how strict those segments should be.
    – **Asset classification**: Asset classification may be based on type, owner, or location. Criticality segmentation specifically focuses on importance and impact.

    In practice, organizations may merge these concepts in their procedures, but they remain distinct steps conceptually.

    Application in OT, IT, and manufacturing systems

    Within manufacturing and regulated operations, criticality segmentation is commonly applied to:

    – **Control systems**: Grouping PLCs, DCS nodes, safety instrumented systems, and SCADA according to their impact on safety and production continuity.
    – **Manufacturing execution systems (MES)**: Classifying MES components (such as batch management, electronic batch records, quality workflows, and interfaces to ERP or LIMS) by their relevance to product quality and release decisions.
    – **Quality and data systems**: Segmenting historians, LIMS, QMS, and document control systems by how critical their data is for regulatory records, investigations, and audits.
    – **Infrastructure and utilities**: Ranking power supply, HVAC, compressed air, purified water, or cleanroom systems by their direct impact on product quality and regulatory requirements.

    The segmentation results are often encoded into asset registers, CMMS/EAM systems, configuration management databases (CMDB), or MES/ERP master data so that other processes (change control, maintenance, cybersecurity controls, incident response) can use the classification consistently.

    Role in risk and safety management

    In risk and safety management practices, criticality segmentation:

    – Provides a structured basis for focusing more detailed analyses (such as HAZOP, FMEA, cyber risk assessments) on high-criticality items.
    – Supports the definition of differentiated control measures, monitoring frequencies, and escalation workflows.
    – Helps demonstrate a reasoned, documented approach to prioritizing safeguards and resources across complex plants and system landscapes.

    While the exact criteria and tiers differ between organizations and industries, the core idea is to maintain a consistent, traceable mapping of what is most critical and to use that mapping across operations, engineering, IT/OT, and quality functions.