Solution to Disk Failures in Oracle RAC

Hello Friends ,

Introduction: Why Disc Failures Are Not Disasters

Disc failures in Oracle Real Application Clusters (RAC) are unavoidable—but with Automatic Storage Management (ASM), they are transformed from crises to manageable events. This in-depth look of Oracle RAC demonstrates how it automates failure recovery while preserving data integrity, as well as providing battle-tested recovery methodologies.

Underneath the Hood: ASM's Secret Weapons

  1. GMON (Disk Group Monitor) is a background process that removes disks after their repair time ends.  
  2. OFSD (ASM I/O Servers) provide direct access to storage devices, bypassing OS caches.  
  3. KFFRD (Kernel Files Flash Recovery Data) reconstructs data from mirrors during failed disk reads.

I. The Anatomy of a Disk Failure: ASM's Auto-Pilot

When a disk fails, Oracle RAC initiates a scripted sequence.

1. Failure Detection: The 60-Second Rule

ASM's Heartbeat Mechanism: After 5 failed I/O retries (~60 seconds), ASM offlines the disk.Every 1.5 seconds, each disk sends out heartbeat blocks. Missed 40 heartbeats? ASM declares the disk dead. The I/O Retry Protocol is controlled via hidden settings.

_asm_io_retry_count = 5     -- Retry attempts  
_asm_io_retry_delay = 1000  -- Delay (ms) between retries  
After 5 failed I/O retries (~60 seconds), ASM offlines the disk.

2. Propagation of Alerts across the Cluster

CSS (Cluster Synchronization Service) broadcasts disk failures to all RAC nodes via interconnect in milliseconds. ASM Instance Coordination: Updates GV$ASM_DISK is used to mark disks as OFFLINE across the cluster.

3. Automatic Rebalance: Data Rescue Operations ARBx Processes:

Parallel slaves redistribute data from the failing disk using mirrored extents (NORMAL/HIGH redundancy).

ALTER DISKGROUP data REBALANCE POWER 11;  -- Turbo mode (0-11)  
Intelligent Throttling uses ASM_POWER_LIMIT to balance speed and performance.

4. The Repair Timer: ASM's Grace Period (disk_repair_time) The default countdown time is 3.6 hours for Oracle 11g+ and 12 hours for 19c+.


SELECT group_number, name, value 
FROM V$ASM_ATTRIBUTE 
WHERE name = 'disk_repair_time';  
If repaired inside the window: ALTER DISKGROUP data ONLINE DISK DATA_0002; If the disk has expired, ASM will automatically drop it. ############################################################################

II. Step-by-Step Recovery: The DBA's Actual activity steps

*Scenario: Disk /dev/sdb1 in diskgroup DATA fails in a RAC cluster with three nodes.*

Phase 1: Diagnosis Check for ASM Alerts.


SELECT path, name, state, failgroup FROM V$ASM_DISK 
WHERE state != 'NORMAL';  

→ STATE = 'OFFLINE'

2. Validate Hardware Failure:


# Linux: Check last disk errors  
grep -i error /var/log/messages | grep sdb1  
multipath -ll | grep -i failed  

Phase 2: Disk Replacement & ASM Repair

1.Offline the Disk (If Not Automatic):


ALTER DISKGROUP DATA OFFLINE DISK DATA_0002;  

Storage Administration Steps:

Replace the actual disk/LUN. Rescan SCSI buses:

echo 1 > /sys/class/scsi_device/2\:0\:0\:0/device/rescan  
multipath -r   # Refresh multipath maps  

3.Add New Disk to ASM:


ALTER DISKGROUP DATA DROP DISK DATA_0002;  -- Remove failed disk  
ALTER DISKGROUP DATA ADD DISK '/dev/mapper/newdisk1' 
  NAME DATA_0002;  -- Auto-triggers rebalance  

4.Monitor Rebalance:


SELECT * FROM GV$ASM_OPERATION;  -- Check progress across all nodes  

Phase 3: Post-Recovery Validation

Disk Group Health:

SELECT failgroup, path, state, reads, writes 
FROM V$ASM_DISK;  

→ STATE must be MOUNTED on all nodes.

Redundancy Integrity:

SELECT failgroup, path, state, reads, writes 
FROM V$ASM_DISK; 

Hope this helps

Newest
Previous
Next Post »