跳到主要内容

某省知名电视台核心计费系统数据库节点驱逐故障处理

提示

本文为站长原创文章,版权所有,未经允许,禁止转载!

信息

数据库环境信息:
硬件:IBM P520 小型机
操作系统:AIX 5300-11
oracle版本:Oracle 10.2.0.4 RAC 2 nodes

故障处理过程

客户告知,其核心计费系统RAC数据库有一节点总是异常宕机,让帮助分析处理一下,抵达客户现场后,查看其中一个节点数据库alert日志,关键信息如下:

Wed Oct  1 03:06:14 2014
Errors in file /oracle/admin/xxtv/bdump/xxtv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 5 - see DBWR trace file
ORA-01110: data file 5: '/dev/rHRP_ORDERD_08'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/xxtv/bdump/xxtv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 6 - see DBWR trace file
ORA-01110: data file 6: '/dev/rHRP_ORDERD_09'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/xxtv/bdump/xxtv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 7 - see DBWR trace file
ORA-01110: data file 7: '/dev/rHRP_ORDERX_06'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/xxtv/bdump/xxtv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 8 - see DBWR trace file
ORA-01110: data file 8: '/dev/rHRP_ORDERX_07'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/xxtv/bdump/xxtv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 9 - see DBWR trace file
ORA-01110: data file 9: '/dev/rHRP_LOGISD_05'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/xxtv/bdump/xxtv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 10 - see DBWR trace file
ORA-01110: data file 10: '/dev/rHRP_LOGISD_06'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/xxtv/bdump/xxtv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 11 - see DBWR trace file
ORA-01110: data file 11: '/dev/rHRP_LOGISX_03'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/xxtv/bdump/xxtv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 12 - see DBWR trace file
ORA-01110: data file 12: '/dev/rHRP_CUSTD_06'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/xxtv/bdump/xxtv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 13 - see DBWR trace file
ORA-01110: data file 13: '/dev/rHRP_CUSTD_07'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/xxtv/bdump/xxtv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 14 - see DBWR trace file
ORA-01110: data file 14: '/dev/rHRP_CUSTX_02'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/xxtv/bdump/xxtv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 15 - see DBWR trace file
ORA-01110: data file 15: '/dev/rHRP_ORDERX_08'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11

另外一个节点alert关键信息如下:

Tue Sep 30 12:22:21 2014
ARC1: Standby redo logfile selected for thread 2 sequence 134858 for destination LOG_ARCHIVE_DEST_3
Tue Sep 30 12:23:15 2014
Trace dumping is performing id=[cdmp_20140930122315]
Tue Sep 30 12:24:36 2014
Trace dumping is performing id=[cdmp_20140930122435]
Tue Sep 30 12:25:54 2014
Trace dumping is performing id=[cdmp_20140930122554]
Tue Sep 30 12:27:12 2014
Trace dumping is performing id=[cdmp_20140930122712]
Tue Sep 30 12:28:46 2014
Trace dumping is performing id=[cdmp_20140930122846]
Tue Sep 30 12:30:39 2014
Trace dumping is performing id=[cdmp_20140930123038]
Tue Sep 30 12:32:32 2014
Trace dumping is performing id=[cdmp_20140930123232]
Tue Sep 30 12:34:39 2014
Trace dumping is performing id=[cdmp_20140930123438]
Tue Sep 30 12:36:58 2014
Trace dumping is performing id=[cdmp_20140930123658]
Tue Sep 30 12:41:58 2014
Thread 2 advanced to log sequence 134860 (LGWR switch)
Current log# 10 seq# 134860 mem# 0: /dev/rHHS_REDO2_4
Tue Sep 30 12:42:07 2014
ARC1: Standby redo logfile selected for thread 2 sequence 134859 for destination LOG_ARCHIVE_DEST_3
Tue Sep 30 12:43:42 2014
Reconfiguration started (old inc 4, new inc 6)
List of nodes:
1
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out

.....
Wed Oct 1 01:24:13 2014
Errors in file /oracle/admin/xxtv/udump/xxtv2_ora_622802.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: skgxpvfymmtu
ORA-27303: additional information: MTU could not be verified. Did not receive valid message.
Wed Oct 1 01:24:13 2014
Errors in file /oracle/admin/xxtv/udump/xxtv2_ora_737644.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: skgxpvfymmtu
ORA-27303: additional information: MTU could not be verified. Did not receive valid message.
Wed Oct 1 01:24:21 2014
Errors in file /oracle/admin/xxtv/udump/xxtv2_ora_410402.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: skgxpvfymmtu
ORA-27303: additional information: MTU could not be verified. Did not receive valid message.
Wed Oct 1 01:24:21 2014
Errors in file /oracle/admin/xxtv/udump/xxtv2_ora_389452.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: skgxpvfymmtu
ORA-27303: additional information: MTU could not be verified. Did not receive valid message.
Wed Oct 1 01:24:21 2014
Errors in file /oracle/admin/xxtv/udump/xxtv2_ora_713526.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: skgxpvfymmtu
ORA-27303: additional information: MTU could not be verified. Did not receive valid message.
Wed Oct 1 01:24:35 2014
Errors in file /oracle/admin/xxtv/udump/xxtv2_ora_806970.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: skgxpvfymmtu
ORA-27303: additional information: MTU could not be verified. Did not receive valid message.
Wed Oct 1 01:24:35 2014
Errors in file /oracle/admin/xxtv/udump/xxtv2_ora_356560.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: skgxpvfymmtu
ORA-27303: additional information: MTU could not be verified. Did not receive valid message.
Wed Oct 1 01:24:36 2014
Errors in file /oracle/admin/xxtv/udump/xxtv2_ora_656220.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: skgxpvfymmtu
ORA-27303: additional information: MTU could not be verified. Did not receive valid message.
Wed Oct 1 01:24:36 2014
Errors in file /oracle/admin/xxtv/udump/xxtv2_ora_639020.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: skgxpvfymmtu
ORA-27303: additional information: MTU could not be verified. Did not receive valid message.
Wed Oct 1 01:24:51 2014
Errors in file /oracle/admin/xxtv/udump/xxtv2_ora_475960.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: skgxpvfymmtu
ORA-27303: additional information: MTU could not be verified. Did not receive valid message.
Wed Oct 1 01:25:06 2014
Errors in file /oracle/admin/xxtv/udump/xxtv2_ora_824178.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: skgxpvfymmtu
ORA-27303: additional information: MTU could not be verified. Did not receive valid message.
Wed Oct 1 01:25:11 2014
Errors in file /oracle/admin/xxtv/udump/xxtv2_ora_889782.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: skgxpvfymmtu
ORA-27303: additional information: MTU could not be verified. Did not receive valid message.
Wed Oct 1 01:25:19 2014
Errors in file /oracle/admin/xxtv/udump/xxtv2_ora_234074.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: skgxpvfymmtu
ORA-27303: additional information: MTU could not be verified. Did not receive valid message.
Wed Oct 1 01:25:21 2014
Errors in file /oracle/admin/xxtv/udump/xxtv2_ora_930324.trc:
ORA-00603: ORACLE server session terminated by fatal error
ORA-27504: IPC error creating OSD context
ORA-27300: OS system dependent operation:sskgxp_select failed with status: 3
ORA-27301: OS failure message: No such process
ORA-27302: failure occurred at: skgxpvfymmtu
ORA-27303: additional information: MTU could not be verified. Did not receive valid message.
Wed Oct 1 01:30:12 2014
ARC1: Standby redo logfile selected for thread 1 sequence 131337 for destination LOG_ARCHIVE_DEST_3
Wed Oct 1 01:44:05 2014
Thread 2 advanced to log sequence 134900 (LGWR switch)
Current log# 11 seq# 134900 mem# 0: /dev/rHHS_REDO2_5
Wed Oct 1 01:44:08 2014
ARC1: Standby redo logfile selected for thread 2 sequence 134899 for destination LOG_ARCHIVE_DEST_3
Wed Oct 1 02:04:04 2014
Thread 2 advanced to log sequence 134901 (LGWR switch)
Current log# 12 seq# 134901 mem# 0: /dev/rHHS_REDO2_6

根据alert日志分析,node1 DB实例崩溃, node2 DB实例进行recover完成后node2 DB实例随后出现报错,接着看同时间点的node2的CRS ALERT日志信息:

2014-10-01 01:23:38.503
[cssd(545112)]CRS-1612:node node1 (0) at 50% heartbeat fatal, eviction in 0.000 seconds
2014-10-01 01:23:45.506
[cssd(545112)]CRS-1611:node node1 (0) at 75% heartbeat fatal, eviction in 0.000 seconds
2014-10-01 01:23:46.500
[cssd(545112)]CRS-1611:node node1 (0) at 75% heartbeat fatal, eviction in 0.000 seconds
2014-10-01 01:23:50.507
[cssd(545112)]CRS-1610:node node1 (0) at 90% heartbeat fatal, eviction in 0.000 seconds
2014-10-01 01:23:51.502
[cssd(545112)]CRS-1610:node node1 (0) at 90% heartbeat fatal, eviction in 0.000 seconds
2014-10-01 01:23:52.504
[cssd(545112)]CRS-1610:node node1 (0) at 90% heartbeat fatal, eviction in 0.000 seconds
2014-10-01 01:23:52.567
[cssd(545112)]CRS-1607:CSSD evicting node node1. Details in /oracle/product/10.2/crs/log/node2/cssd/ocssd.log.
[cssd(545112)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node2 .
2014-10-01 01:23:52.679
[crsd(193496)]CRS-1005:The OCR upgrade was completed. Version has changed from 169870336 to 169870336. Details in /oracle/product/10.2/crs/log/node2/crsd/crsd.log.
2014-10-01 01:23:55.261
[crsd(193496)]CRS-1204:Recovering CRS resources for node node1.
[cssd(545112)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node1 node2 .
[cssd(545112)]CRS-1603:CSSD on node node2 shutdown by user.
2014-10-01 06:52:24.773
[cssd(144024)]CRS-1605:CSSD voting file is online: /dev/rvoting. Details in /oracle/product/10.2/crs/log/node2/cssd/ocssd.log.
[cssd(144024)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node2 .
2014-10-01 06:55:45.412
[evmd(376964)]CRS-1401:EVMD started on node node2.
2014-10-01 06:55:45.431
[crsd(135526)]CRS-1005:The OCR upgrade was completed. Version has changed from 169870336 to 169870336. Details in /oracle/product/10.2/crs/log/node2/crsd/crsd.log.
2014-10-01 06:55:45.432
[crsd(135526)]CRS-1012:The OCR service started on node node2.
2014-10-01 06:55:47.564
[crsd(135526)]CRS-1201:CRSD started on node node2.

2014-10-01 01:23:52.567
[cssd(545112)]CRS-1607:CSSD evicting node node1. Details in /oracle/product/10.2/crs/log/node2/cssd/ocssd.log.
[cssd(545112)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node2 .
2014-10-01 01:23:52.679
[crsd(193496)]CRS-1005:The OCR upgrade was completed. Version has changed from 169870336 to 169870336. Details in /oracle/product/10.2/crs/log/node2/crsd/crsd.log.
2014-10-01 01:23:55.261
[crsd(193496)]CRS-1204:Recovering CRS resources for node node1.
[cssd(545112)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node1 node2 .
[cssd(545112)]CRS-1603:CSSD on node node2 shutdown by user.
2014-10-01 06:52:24.773
[cssd(144024)]CRS-1605:CSSD voting file is online: /dev/rvoting. Details in /oracle/product/10.2/crs/log/node2/cssd/ocssd.log.
[cssd(144024)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node2 .

2014-10-01 01:23:52时node2驱逐了node1节点,看看node2的cssd日志发生了什么事情:

[    CSSD]2014-08-06 03:00:57.693 [3857] >TRACE:   clssgmReconfigThread:  completed for reconfig(2), with status(1)
[ CSSD]2014-08-06 03:00:57.860 [2829] >TRACE: clssgmCommonAddMember: clsomon joined (2/0x1000000/#CSS_CLSSOMON)
[ CSSD]2014-10-01 01:23:38.503 [3086] >WARNING: clssnmPollingThread: node node1 (1) at 50 2.481040e-265artbeat fatal, eviction in 14.059 seconds
[ CSSD]2014-10-01 01:23:38.503 [3086] >TRACE: clssnmPollingThread: node node1 (1) is impending reconfig, flag 1039, misstime 15941
[ CSSD]2014-10-01 01:23:38.503 [3086] >TRACE: clssnmPollingThread: diskTimeout set to (27000)ms impending reconfig status(1)
[ CSSD]2014-10-01 01:23:45.506 [3086] >WARNING: clssnmPollingThread: node node1 (1) at 75 2.481040e-265artbeat fatal, eviction in 7.056 seconds
[ CSSD]2014-10-01 01:23:46.500 [3086] >WARNING: clssnmPollingThread: node node1 (1) at 75 2.481040e-265artbeat fatal, eviction in 6.062 seconds
[ CSSD]2014-10-01 01:23:50.507 [3086] >WARNING: clssnmPollingThread: node node1 (1) at 90 2.481040e-265artbeat fatal, eviction in 2.056 seconds
[ CSSD]2014-10-01 01:23:51.502 [3086] >WARNING: clssnmPollingThread: node node1 (1) at 90 2.481040e-265artbeat fatal, eviction in 1.060 seconds
[ CSSD]2014-10-01 01:23:52.504 [3086] >WARNING: clssnmPollingThread: node node1 (1) at 90 2.481040e-265artbeat fatal, eviction in 0.059 seconds
[ CSSD]2014-10-01 01:23:52.567 [3086] >TRACE: clssnmPollingThread: Eviction started for node node1 (1), flags 0x040f, state 3, wt4c 0
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmDoSyncUpdate: Initiating sync 3
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmDoSyncUpdate: diskTimeout set to (27000)ms
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmSetupAckWait: Ack message type (11)
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmSetupAckWait: node(2) is ALIVE
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmSendSync: syncSeqNo(3)
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmWaitForAcks: Ack message type(11), ackCount(1)
[ CSSD]2014-10-01 01:23:52.567 [1801] >TRACE: clssnmHandleSync: diskTimeout set to (27000)ms
[ CSSD]2014-10-01 01:23:52.567 [1801] >TRACE: clssnmHandleSync: Acknowledging sync: src[2] srcName[node2] seq[1] sync[3]
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmWaitForAcks: done, msg type(11)
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmDoSyncUpdate: Terminating node 1, node1, misstime(30004) state(5)
[ CSSD]2014-10-01 01:23:52.567 [1] >USER: NMEVENT_SUSPEND [00][00][00][06]
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmSetupAckWait: Ack message type (13)
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmSetupAckWait: node(2) is ACTIVE
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmWaitForAcks: Ack message type(13), ackCount(1)
[ CSSD]2014-10-01 01:23:52.567 [1801] >TRACE: clssnmSendVoteInfo: node(2) syncSeqNo(3)
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmWaitForAcks: done, msg type(13)
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmCheckDskInfo: Checking disk info...
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmEvict: Start
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmEvict: Evicting node 1, node1, birth 1, death 3, impendingrcfg 1, stateflags 0x40f
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmSendShutdown: req to node 1, kill time 3023695098
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmDiscHelper: node1, node(1) connection failed, con (1108667f0), probe(0)
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmWaitOnEvictions: Start
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmWaitOnEvictions: node 1, node1, undead 0
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmSetupAckWait: Ack message type (15)
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmSetupAckWait: node(2) is ACTIVE
[ CSSD]2014-10-01 01:23:52.567 [3600] >TRACE: clssnmSendUpdate: syncSeqNo(3)
[ CSSD]2014-10-01 01:23:52.568 [3600] >TRACE: clssnmWaitForAcks: Ack message type(15), ackCount(1)
[ CSSD]2014-10-01 01:23:52.568 [1801] >TRACE: clssnmUpdateNodeState: node 0, state (0/0) unique (0/0) prevConuni(0) birth (0/0) (old/new)
[ CSSD]2014-10-01 01:23:52.568 [1801] >TRACE: clssnmUpdateNodeState: node 1, state (5/0) unique (1407264856/1407264856) prevConuni(1407264856) birth (1/1) (old/new)
[ CSSD]2014-10-01 01:23:52.568 [1801] >TRACE: clssnmDeactivateNode: node 1 (node1) left cluster

原来node1节点VOTE DISK磁盘读取超时后节点被驱逐,之后node1重启,重启后接着看node1的crs日志:

2014-10-01 01:20:36.043: [  CRSRES][11719]32ora.node1.LISTENER_NODE1.lsnr target set to OFFLINE before stop action
2014-10-01 01:20:36.043: [ CRSRES][11719]32StopResource: setting CLI values
2014-10-01 01:20:36.075: [ CRSRES][11719]32Attempting to stop `ora.node1.LISTENER_NODE1.lsnr` on member `node1`
2014-10-01 01:20:36.380: [ CRSRES][11719]32Stop of `ora.node1.LISTENER_NODE1.lsnr` on member `node1` succeeded.
2014-10-01 03:05:47.302: [ default][1][ENTER]32
Oracle Database 10g CRS Release 10.2.0.4.0 Production Copyright 1996, 2004, Oracle. All rights reserved
2014-10-01 03:05:47.319: [ default][1]32CRS Daemon Starting
2014-10-01 03:05:47.335: [ CRSMAIN][1]32Checking the OCR device
2014-10-01 03:05:47.339: [ CRSMAIN][1]32Connecting to the CSS Daemon
2014-10-01 03:05:47.693: [ COMMCRS][259]clsc_connect: (110673d50) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node1_crs))

2014-10-01 03:05:47.693: [ CSSCLNT][1]clsssInitNative: connect failed, rc 9

2014-10-01 03:05:47.693: [ CRSRTI][1]32CSS is not ready. Received status 3 from CSS. Waiting for good status ..

2014-10-01 03:05:49.037: [ COMMCRS][261]clsc_connect: (110673d50) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node1_crs))

2014-10-01 03:05:49.038: [ CSSCLNT][1]clsssInitNative: connect failed, rc 9

2014-10-01 03:05:49.038: [ CRSRTI][1]32CSS is not ready. Received status 3 from CSS. Waiting for good status ..

2014-10-01 03:05:50.375: [ COMMCRS][263]clsc_connect: (110673d50) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node1_crs))

2014-10-01 03:05:50.376: [ CSSCLNT][1]clsssInitNative: connect failed, rc 9

2014-10-01 03:05:50.376: [ CRSRTI][1]32CSS is not ready. Received status 3 from CSS. Waiting for good status ..

2014-10-01 03:05:51.716: [ COMMCRS][265]clsc_connect: (110673d50) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_node1_crs))

2014-10-01 03:05:51.716: [ CSSCLNT][1]clsssInitNative: connect failed, rc 9

2014-10-01 03:05:51.716: [ CRSRTI][1]32CSS is not ready. Received status 3 from CSS. Waiting for good status ..

2014-10-01 03:05:53.195: [ CLSVER][1]32Active Version from OCR:10.2.0.4.0
2014-10-01 03:05:53.195: [ CLSVER][1]32Active Version and Software Version are same
2014-10-01 03:05:53.196: [ CRSMAIN][1]32Initializing OCR
2014-10-01 03:05:53.202: [ OCRRAW][1]proprioo: for disk 0 (/dev/rshare), id match (1), my id set (1754974617,1028247821) total id sets (1), 1st set (1754974617,1028247821), 2nd set (0,0) my votes (2), total votes (2)

# crsd日志信息
2014-10-01 03:05:53.477: [ CRSMAIN][1]32Using Authorizer location: /oracle/product/10.2/crs/crs/auth/
2014-10-01 03:05:53.495: [ CRSMAIN][1]32Initializing RTI
2014-10-01 03:05:53.590: [CRSTIMER][8492]32Timer Thread Starting.
2014-10-01 03:05:53.627: [ CRSRES][1]32Parameter SECURITY = 1, running in USER Mode
2014-10-01 03:05:53.631: [ CRSMAIN][1]32Initializing EVMMgr
2014-10-01 03:05:53.832: [ COMMCRS][8749]clsc_connect: (1138cca70) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.acceptor.auth))

2014-10-01 03:05:54.347: [ COMMCRS][8750]clsc_connect: (1138cf5f0) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=SYSTEM.evm.acceptor.auth))

2014-10-01 03:05:55.556: [ CRSMAIN][1]32CRSD locked during state recovery, please wait.
2014-10-01 03:05:55.725: [ CRSMAIN][1]32CRSD recovered, unlocked.
2014-10-01 03:05:55.748: [ CRSMAIN][1]32QS socket on: (ADDRESS=(PROTOCOL=ipc)(KEY=ora_crsqs))
2014-10-01 03:05:55.748: [ CRSMAIN][1]32QS socket on: (ADDRESS=(PROTOCOL=ipc)(KEY=ora_crsqs))
2014-10-01 03:05:55.761: [ CRSMAIN][1]32CRSD UI socket on: (ADDRESS=(PROTOCOL=ipc)(KEY=CRSD_UI_SOCKET))
2014-10-01 03:05:55.779: [ CRSMAIN][1]32E2E socket on: (ADDRESS=(PROTOCOL=tcp)(HOST=node1_pri)(PORT=49896))
2014-10-01 03:05:55.780: [ CRSMAIN][1]32Starting Threads
2014-10-01 03:05:55.780: [ CRSMAIN][10036]32Starting runCommandServer for (UI = 1, E2E = 0). 0
2014-10-01 03:05:55.780: [ CRSMAIN][10293]32Starting runCommandServer for (UI = 1, E2E = 0). 1
2014-10-01 03:05:55.780: [ CRSMAIN][1]32CRS Daemon Started.
2014-10-01 03:05:55.879: [ CRSRES][1]32 startup = 1
2014-10-01 03:05:55.932: [ CRSRES][1]32 startup = 1
2014-10-01 03:05:55.976: [ CRSRES][1]32 startup = 1
2014-10-01 03:05:56.028: [ CRSRES][1]32 startup = 1
2014-10-01 03:05:56.053: [ CRSRES][1]32 startup = 1
2014-10-01 03:05:56.094: [ CRSRES][1]32 startup = 1
2014-10-01 03:05:56.365: [ CRSRES][11321]32startRunnable: setting CLI values
2014-10-01 03:05:56.388: [ CRSRES][11064]32StopResource: setting CLI values
2014-10-01 03:05:56.434: [ CRSRES][11321]32Attempting to start `ora.HUTV.HUTV1.inst` on member `node1`
2014-10-01 03:05:56.526: [ CRSRES][11064]32Attempting to stop `ora.node1.vip` on member `node2`
2014-10-01 03:05:57.114: [ CRSRES][11064]32Stop of `ora.node1.vip` on member `node2` succeeded.
2014-10-01 03:05:57.146: [ CRSRES][11064]32startRunnable: setting CLI values
2014-10-01 03:05:57.146: [ CRSRES][11064]32Attempting to start `ora.node1.vip` on member `node1`
2014-10-01 03:05:59.572: [ CRSRES][11064]32Start of `ora.node1.vip` on member `node1` succeeded.
2014-10-01 03:05:59.655: [ CRSRES][11064]32startRunnable: setting CLI values
2014-10-01 03:05:59.711: [ CRSRES][11064]32Attempting to start `ora.node1.LISTENER_NODE1.lsnr` on member `node1`
2014-10-01 03:06:01.468: [ COMMCRS][9522]clsc_receive: (114e8eb10) error 2

2014-10-01 03:06:02.298: [ CRSRES][11064]32Start of `ora.node1.LISTENER_NODE1.lsnr` on member `node1` succeeded.
2014-10-01 03:06:02.505: [ CRSRES][12106]32CRS-1002: Resource 'ora.node1.LISTENER_NODE1.lsnr' is already running on member 'node1'

2014-10-01 03:07:31.203: [ CRSD][12121]32SM: Recovery processing done.0
2014-10-01 03:07:31.203: [ CRSD][12121]32SM: Ongoing commands done.0
2014-10-01 03:07:31.223: [ CRSEVT][12121]32Skipping ora.HUTV.HUTV1.inst because it\'s busy
2014-10-01 03:07:31.233: [ CRSD][12121]32SM: stoppedResources: all =0 m_state = 0
2014-10-01 03:07:31.270: [ CRSRES][12378]32StopResource: setting CLI values
2014-10-01 03:07:31.280: [ CRSRES][12378]32Attempting to stop `ora.node1.LISTENER_NODE1.lsnr` on member `node1`
2014-10-01 03:07:37.021: [ CRSMAIN][10293]32UI Cmd received after shutdown began.
2014-10-01 03:07:37.245: [ CRSRES][12378]32Stop of `ora.node1.LISTENER_NODE1.lsnr` on member `node1` succeeded.
2014-10-01 03:07:37.264: [ CRSRES][12378]32StopResource: setting CLI values
2014-10-01 03:07:37.277: [ CRSRES][12378]32StopResource: setting CLI values
2014-10-01 03:07:37.284: [ CRSRES][12378]32Attempting to stop `ora.node1.vip` on member `node1`
2014-10-01 03:07:37.741: [ CRSRES][12378]32Stop of `ora.node1.vip` on member `node1` succeeded.
2014-10-01 03:07:47.804: [ CRSD][11109]32SM: Recovery processing done.0
2014-10-01 03:07:47.804: [ CRSD][11109]32SM: Ongoing commands done.0
2014-10-01 03:07:47.818: [ CRSEVT][11109]32Skipping ora.HUTV.HUTV1.inst because it\'s busy
2014-10-01 03:07:47.828: [ CRSD][11109]32SM: stoppedResources: all =0 m_state = 0
2014-10-01 03:07:57.859: [ CRSD][11111]32SM: Recovery processing done.0
2014-10-01 03:07:57.859: [ CRSD][11111]32SM: Ongoing commands done.0
2014-10-01 03:07:57.873: [ CRSEVT][11111]32Skipping ora.HUTV.HUTV1.inst because it's busy
2014-10-01 03:07:57.883: [ CRSD][11111]32SM: stoppedResources: all =0 m_state = 0
2014-10-01 03:08:07.914: [ CRSD][11113]32SM: Recovery processing done.0

node1 CRS 启动正常后开始把实例拉起来但是失败了,再看下 node1 DB 的 aelrt 日志:

ALTER DATABASE OPEN
Picked broadcast on commit scheme to generate SCNs
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/HUTV/bdump/hutv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 5 - see DBWR trace file
ORA-01110: data file 5: '/dev/rHRP_ORDERD_08'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/HUTV/bdump/hutv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 6 - see DBWR trace file
ORA-01110: data file 6: '/dev/rHRP_ORDERD_09'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/HUTV/bdump/hutv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 7 - see DBWR trace file
ORA-01110: data file 7: '/dev/rHRP_ORDERX_06'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/HUTV/bdump/hutv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 8 - see DBWR trace file
ORA-01110: data file 8: '/dev/rHRP_ORDERX_07'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/HUTV/bdump/hutv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 9 - see DBWR trace file
ORA-01110: data file 9: '/dev/rHRP_LOGISD_05'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/HUTV/bdump/hutv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 10 - see DBWR trace file
ORA-01110: data file 10: '/dev/rHRP_LOGISD_06'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/HUTV/bdump/hutv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 11 - see DBWR trace file
ORA-01110: data file 11: '/dev/rHRP_LOGISX_03'
ORA-27041: unable to open file
IBM AIX RISC System/6000 Error: 6: No such device or address
Additional information: 11
Wed Oct 1 03:06:14 2014
Errors in file /oracle/admin/HUTV/bdump/hutv1_dbw0_144316.trc:
ORA-01157: cannot identify/lock data file 12 - see DBWR trace file
ORA-01110: data file 12: '/dev/rHRP_CUSTD_06'
ORA-27041: unable to open file

综上,可以看到实例启动后读取数据文件对应裸设备读取失败导致实例启动失败,另外一个节点也是类似的报错,然后两个节点互相反复相互驱逐。

故障总结

至此可以判断RAC节点宕机的原因在于操作系统或硬件层面原因。之后,再次跟进这个Cases时,客户告知其联系硬件工程师进行排查后确认其小型机主板存在故障之后更换后问题再未出现,至此故障解决。