某省人社厅Oracle Exadata X3-2一体机故障处理
提示
本文为站长原创文章,版权所有,未经允许,禁止转载!
信息
数据库环境信息: 
硬件:Oracle Exadata X3-2 一体机 
操作系统:Oracle Enterprise Linux 5.8  
oracle版本:11.2.0.4
事件概述
这套Exadata一体机当时是由我作为技术负责人与当时Oracle原厂工程师共同部署实施,当时我这边部署了整个人社厅的基础架构(NTP服务器、DNS服务器、数据库管理服务器、数据库容灾服务器等)和省人社厅业务系统数据迁移(企业养老、医保、社保、农保、生存认证等业务系统),整个项目实施交付后一直稳定运行了2年多。
最近客户反馈,其Exadata一体机有一台计算节点总是异常重启,请求帮忙分析排查原因,下为故障处理过程细则。
故障处理详述
经分析计算节点ex01db02发现最近一次2016年10月22日11点30分左右, ex01db02被ex01db01 ex01db03 ex01db04 驱逐群集后计算节点ex01db02重启
ex01db01节点	10.136.8.130	
/u01/app/11.2.0.3/grid/log/ex01db01/alertex01db01.log
2016-10-22 11:29:44.785
[cssd(8527)]CRS-1612:Network communication with node ex01db02 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 29.800 seconds
2016-10-22 11:29:59.790
[cssd(8527)]CRS-1611:Network communication with node ex01db02 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 14.800 seconds
2016-10-22 11:30:08.792
[cssd(8527)]CRS-1610:Network communication with node ex01db02 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 5.800 seconds
2016-10-22 11:30:15.209
[cssd(8527)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ex01db01 ex01db03 ex01db04 .
ex01db04计算节点日志信息
2016-10-22 11:29:44.926
[cssd(8103)]CRS-1612:Network communication with node ex01db02 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 29.670 seconds
2016-10-22 11:29:59.929
[cssd(8103)]CRS-1611:Network communication with node ex01db02 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 14.670 seconds
2016-10-22 11:30:08.932
[cssd(8103)]CRS-1610:Network communication with node ex01db02 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 5.670 seconds
2016-10-22 11:30:15.096
[cssd(8103)]CRS-1632:Node ex01db02 is being removed from the cluster in cluster incarnation 264175490
2016-10-22 11:30:15.207
[cssd(8103)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ex01db01 ex01db03 ex01db04 .
2016-10-22 11:30:15.223
[crsd(11023)]CRS-5504:Node down event reported for node 'ex01db02'.
2016-10-22 11:30:17.307
[crsd(11023)]CRS-2773:Server 'ex01db02' has been removed from pool 'Generic'.
2016-10-22 11:30:17.307
[crsd(11023)]CRS-2773:Server 'ex01db02' has been removed from pool 'ora.dbm'.
2016-10-22 11:30:17.307
[crsd(11023)]CRS-2773:Server 'ex01db02' has been removed from pool 'ora.xxsbcw'.
2016-10-22 11:30:17.307
[crsd(11023)]CRS-2773:Server 'ex01db02' has been removed from pool 'ora.nbdb'.
2016-10-22 11:30:17.307
[crsd(11023)]CRS-2773:Server 'ex01db02' has been removed from pool 'ora.qyyl'.
ex01db02 infiniband交换机日志信息:
head: cannot open 'VERSION_FILE' for reading: No such file or directory
        [ DB Machine Infiniband Cabling Topology Verification Tool ]
                [Version IBD VER 2.c ]
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,6) failed, skipping port
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,13,6) failed, skipping port
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,6) failed, skipping port
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,13,6) failed, skipping port
Use of uninitialized value in hash element at /opt/oracle.SupportTools/ibdiagtools/topologies/fetchTopology.pm line 665, <SFILE> line 1.
Use of uninitialized value in hash element at /opt/oracle.SupportTools/ibdiagtools/topologies/fetchTopology.pm line 665, <SFILE> line 2.
External non-Exadata-image nodes found: check for ZFS if on T4-4 - else ignore
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,6) failed, skipping port
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,13,6) failed, skipping port
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,6) failed, skipping port
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,13,6) failed, skipping port
Bad link:Switch 0x2128f56da3a0a0 Port 6A - nbusrv Port 17A
        Reason : 2.5 Gbps Speed found. Could be 5 Gbps
        Possible cause : Cable isn't fully seated in
Spine switch found: ex01sw-ibs0.xxxx.com (2128f56d3ca0a0)
Leaf switch found: ex01sw-ibb0.xxxx.com (2128f56da3a0a0)
Leaf switch found: ex01sw-iba0.xxxx.com (2128f56bf6a0a0)
Found 2 leaf, 1 spine, 0 top spine switches
Check if all hosts have 2 CAs to different switches...............[SUCCESS]
Leaf switch check: cardinality and even distribution..............[ERROR]
 
Leaf Switch ex01sw-ibb0.xxxx.com with GUID 0x2128f56da3a0a0has fewer than 7 links to storage cells
It has 6 links ( 17A 17B 16A 16B 15A 14B)to storage cells
                                                                [ERROR]
 
Leaf Switch ex01sw-iba0.xxxx.com with GUID 0x2128f56bf6a0a0has fewer than 7 links to storage cells
It has 6 links (17A 17B 16A 16B 15A 14B )to storage cells
                                                                [ERROR]
2 switches did not meet this requirement
Spine switch check: Are any Exadata nodes connected ..............[SUCCESS]
Spine switch check: Any inter spine switch links..................[SUCCESS]
Spine switch check: Any inter top-spine switch links..............[SUCCESS]
Spine switch check: Correct number of spine-leaf links............[SUCCESS]
Leaf switch check: Inter-leaf link check..........................[SUCCESS]
Leaf switch check: Correct number of leaf-spine links.............[SUCCESS]
ibwarn: [85414] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0,1,6)
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,6) failed, skipping port
ibwarn: [85414] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0,1,13,6)
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,13,6) failed, skipping port
Switch 0x002128f56d3ca0a0 SUN DCS 36P QDR ex01sw-ibs0.xxxx.com:
           1    1[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1    2[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1    3[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1    4[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1    5[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1    6[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1    7[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1    8[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1    9[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   10[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   11[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   12[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   13[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   14[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   15[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   16[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   17[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   18[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   19[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       2   32[  ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
           1   20[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   21[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       3   32[  ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
           1   22[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   23[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   24[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   25[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   26[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   27[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   28[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   29[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   30[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   31[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   32[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   33[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   34[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   35[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           1   36[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
Switch 0x002128f56da3a0a0 SUN DCS 36P QDR ex01sw-ibb0.xxxx.com:
           2    1[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      19    2[  ] "ex01cel02 C 192.168.10.6 HCA-1" ( )
           2    2[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      21    2[  ] "ex01cel01 C 192.168.10.5 HCA-1" ( )
           2    3[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       9    2[  ] "ex01cel04 C 192.168.10.8 HCA-1" ( )
           2    4[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       7    2[  ] "ex01cel03 C 192.168.10.7 HCA-1" ( )
           2    5[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      17    2[  ] "ex01cel06 C 192.168.10.10 HCA-1" ( )
           2    6[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>             [  ] "" ( )
           2    7[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      25    2[  ] "ex01db01 S 192.168.10.1 HCA-1" ( )
           2    8[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      11    2[  ] "ex01cel07 C 192.168.10.11 HCA-1" ( )
           2    9[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       5    2[  ] "ex01db03 S 192.168.10.3 HCA-1" ( )
           2   10[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      13    2[  ] "ex01db02 S 192.168.10.2 HCA-1" ( )
           2   11[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           2   12[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      23    2[  ] "ex01db04 S 192.168.10.4 HCA-1" ( )
           2   13[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       3   14[  ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
           2   14[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       3   13[  ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
           2   15[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       3   16[  ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
           2   16[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       3   15[  ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
           2   17[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       3   18[  ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
           2   18[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       3   17[  ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
           2   19[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           2   20[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           2   21[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           2   22[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           2   23[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           2   24[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           2   25[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           2   26[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           2   27[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           2   28[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           2   29[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           2   30[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           2   31[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       3   31[  ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
           2   32[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       1   19[  ] "SUN DCS 36P QDR ex01sw-ibs0.xxxx.com" ( )
           2   33[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           2   34[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           2   35[  ] ==( 4X 2.5 Gbps Active/  LinkUp)==>      26    1[  ] "nbusrv HCA-1" ( Could be 10.0 Gbps)
           2   36[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
Switch 0x002128f56bf6a0a0 SUN DCS 36P QDR ex01sw-iba0.xxxx.com:
           3    1[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      18    1[  ] "ex01cel02 C 192.168.10.6 HCA-1" ( )
           3    2[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      20    1[  ] "ex01cel01 C 192.168.10.5 HCA-1" ( )
           3    3[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       8    1[  ] "ex01cel04 C 192.168.10.8 HCA-1" ( )
           3    4[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       6    1[  ] "ex01cel03 C 192.168.10.7 HCA-1" ( )
           3    5[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      16    1[  ] "ex01cel06 C 192.168.10.10 HCA-1" ( )
           3    6[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>             [  ] "" ( )
           3    7[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      24    1[  ] "ex01db01 S 192.168.10.1 HCA-1" ( )
           3    8[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      10    1[  ] "ex01cel07 C 192.168.10.11 HCA-1" ( )
           3    9[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       4    1[  ] "ex01db03 S 192.168.10.3 HCA-1" ( )
           3   10[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      12    1[  ] "ex01db02 S 192.168.10.2 HCA-1" ( )
           3   11[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   12[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>      22    1[  ] "ex01db04 S 192.168.10.4 HCA-1" ( )
           3   13[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       2   14[  ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
           3   14[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       2   13[  ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
           3   15[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       2   16[  ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
           3   16[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       2   15[  ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
           3   17[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       2   18[  ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
           3   18[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       2   17[  ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
           3   19[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   20[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   21[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   22[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   23[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   24[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   25[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   26[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   27[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   28[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   29[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   30[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   31[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       2   31[  ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
           3   32[  ] ==( 4X 10.0 Gbps Active/  LinkUp)==>       1   21[  ] "SUN DCS 36P QDR ex01sw-ibs0.xxxx.com" ( )
           3   33[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   34[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   35[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
           3   36[  ] ==( 4X 2.5 Gbps   Down/Disabled)==>             [  ] "" ( )
故障结论
一开始怀疑是infiniband交换机问题,之后又收集了oswatcher日志信息后分析发现,为软件bug导致内存泄露最终内存耗尽所致,因软件版本升级不属于我这边的工作,因此临时修改系统参数后让故障暂时规避,操作如下:
sysctl -w vm.zone_reclaim_mode=1 
sysctl -w vm.min_free_kbytes = 51200