Hello Friends ,
Introduction: Why Disc Failures Are Not Disasters
Disc failures in Oracle Real Application Clusters (RAC) are unavoidable—but with Automatic Storage Management (ASM), they are transformed from crises to manageable events. This in-depth look of Oracle RAC demonstrates how it automates failure recovery while preserving data integrity, as well as providing battle-tested recovery methodologies.Underneath the Hood: ASM's Secret Weapons
- GMON (Disk Group Monitor) is a background process that removes disks after their repair time ends.
- OFSD (ASM I/O Servers) provide direct access to storage devices, bypassing OS caches.
- KFFRD (Kernel Files Flash Recovery Data) reconstructs data from mirrors during failed disk reads.
I. The Anatomy of a Disk Failure: ASM's Auto-Pilot
When a disk fails, Oracle RAC initiates a scripted sequence.1. Failure Detection: The 60-Second Rule
ASM's Heartbeat Mechanism: After 5 failed I/O retries (~60 seconds), ASM offlines the disk.Every 1.5 seconds, each disk sends out heartbeat blocks. Missed 40 heartbeats? ASM declares the disk dead. The I/O Retry Protocol is controlled via hidden settings.
_asm_io_retry_count = 5 -- Retry attempts
_asm_io_retry_delay = 1000 -- Delay (ms) between retries
After 5 failed I/O retries (~60 seconds), ASM offlines the disk.
2. Propagation of Alerts across the Cluster
CSS (Cluster Synchronization Service) broadcasts disk failures to all RAC nodes via interconnect in milliseconds. ASM Instance Coordination: Updates GV$ASM_DISK is used to mark disks as OFFLINE across the cluster.3. Automatic Rebalance: Data Rescue Operations ARBx Processes:
Parallel slaves redistribute data from the failing disk using mirrored extents (NORMAL/HIGH redundancy).
ALTER DISKGROUP data REBALANCE POWER 11; -- Turbo mode (0-11)
Intelligent Throttling uses ASM_POWER_LIMIT to balance speed and performance.
4. The Repair Timer: ASM's Grace Period (disk_repair_time) The default countdown time is 3.6 hours for Oracle 11g+ and 12 hours for 19c+.
SELECT group_number, name, value
FROM V$ASM_ATTRIBUTE
WHERE name = 'disk_repair_time';
If repaired inside the window: ALTER DISKGROUP data ONLINE DISK DATA_0002;
If the disk has expired, ASM will automatically drop it.
############################################################################
II. Step-by-Step Recovery: The DBA's Actual activity steps
*Scenario: Disk /dev/sdb1 in diskgroup DATA fails in a RAC cluster with three nodes.*Phase 1: Diagnosis Check for ASM Alerts.
SELECT path, name, state, failgroup FROM V$ASM_DISK
WHERE state != 'NORMAL';
→ STATE = 'OFFLINE'
2. Validate Hardware Failure:
# Linux: Check last disk errors
grep -i error /var/log/messages | grep sdb1
multipath -ll | grep -i failed
Phase 2: Disk Replacement & ASM Repair
1.Offline the Disk (If Not Automatic):
ALTER DISKGROUP DATA OFFLINE DISK DATA_0002;
Storage Administration Steps:
Replace the actual disk/LUN. Rescan SCSI buses:
echo 1 > /sys/class/scsi_device/2\:0\:0\:0/device/rescan
multipath -r # Refresh multipath maps
3.Add New Disk to ASM:
ALTER DISKGROUP DATA DROP DISK DATA_0002; -- Remove failed disk
ALTER DISKGROUP DATA ADD DISK '/dev/mapper/newdisk1'
NAME DATA_0002; -- Auto-triggers rebalance
4.Monitor Rebalance:
SELECT * FROM GV$ASM_OPERATION; -- Check progress across all nodes
Phase 3: Post-Recovery Validation
Disk Group Health:
SELECT failgroup, path, state, reads, writes
FROM V$ASM_DISK;
→ STATE must be MOUNTED on all nodes.
Redundancy Integrity:
SELECT failgroup, path, state, reads, writes
FROM V$ASM_DISK;
Hope this helps
ConversionConversion EmoticonEmoticon