跳到主要内容

某省人社厅Oracle Exadata X3-2一体机故障处理

提示

本文为站长原创文章,版权所有,未经允许,禁止转载!

信息

数据库环境信息:
硬件:Oracle Exadata X3-2 一体机
操作系统:Oracle Enterprise Linux 5.8
oracle版本:11.2.0.4

事件概述

这套Exadata一体机当时是由我作为技术负责人与当时Oracle原厂工程师共同部署实施,当时我这边部署了整个人社厅的基础架构(NTP服务器、DNS服务器、数据库管理服务器、数据库容灾服务器等)和省人社厅业务系统数据迁移(企业养老、医保、社保、农保、生存认证等业务系统),整个项目实施交付后一直稳定运行了2年多。

最近客户反馈,其Exadata一体机有一台计算节点总是异常重启,请求帮忙分析排查原因,下为故障处理过程细则。

故障处理详述

经分析计算节点ex01db02发现最近一次2016年10月22日11点30分左右, ex01db02被ex01db01 ex01db03 ex01db04 驱逐群集后计算节点ex01db02重启

ex01db01节点  10.136.8.130    
/u01/app/11.2.0.3/grid/log/ex01db01/alertex01db01.log

2016-10-22 11:29:44.785
[cssd(8527)]CRS-1612:Network communication with node ex01db02 (2) missing for 50% of timeout interval. Removal of this node from cluster in 29.800 seconds
2016-10-22 11:29:59.790
[cssd(8527)]CRS-1611:Network communication with node ex01db02 (2) missing for 75% of timeout interval. Removal of this node from cluster in 14.800 seconds
2016-10-22 11:30:08.792
[cssd(8527)]CRS-1610:Network communication with node ex01db02 (2) missing for 90% of timeout interval. Removal of this node from cluster in 5.800 seconds
2016-10-22 11:30:15.209
[cssd(8527)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ex01db01 ex01db03 ex01db04 .

ex01db04计算节点日志信息

2016-10-22 11:29:44.926
[cssd(8103)]CRS-1612:Network communication with node ex01db02 (2) missing for 50% of timeout interval. Removal of this node from cluster in 29.670 seconds
2016-10-22 11:29:59.929
[cssd(8103)]CRS-1611:Network communication with node ex01db02 (2) missing for 75% of timeout interval. Removal of this node from cluster in 14.670 seconds
2016-10-22 11:30:08.932
[cssd(8103)]CRS-1610:Network communication with node ex01db02 (2) missing for 90% of timeout interval. Removal of this node from cluster in 5.670 seconds
2016-10-22 11:30:15.096
[cssd(8103)]CRS-1632:Node ex01db02 is being removed from the cluster in cluster incarnation 264175490
2016-10-22 11:30:15.207
[cssd(8103)]CRS-1601:CSSD Reconfiguration complete. Active nodes are ex01db01 ex01db03 ex01db04 .
2016-10-22 11:30:15.223
[crsd(11023)]CRS-5504:Node down event reported for node 'ex01db02'.
2016-10-22 11:30:17.307
[crsd(11023)]CRS-2773:Server 'ex01db02' has been removed from pool 'Generic'.
2016-10-22 11:30:17.307
[crsd(11023)]CRS-2773:Server 'ex01db02' has been removed from pool 'ora.dbm'.
2016-10-22 11:30:17.307
[crsd(11023)]CRS-2773:Server 'ex01db02' has been removed from pool 'ora.xxsbcw'.
2016-10-22 11:30:17.307
[crsd(11023)]CRS-2773:Server 'ex01db02' has been removed from pool 'ora.nbdb'.
2016-10-22 11:30:17.307
[crsd(11023)]CRS-2773:Server 'ex01db02' has been removed from pool 'ora.qyyl'.

ex01db02 infiniband交换机日志信息:

head: cannot open 'VERSION_FILE' for reading: No such file or directory

[ DB Machine Infiniband Cabling Topology Verification Tool ]
[Version IBD VER 2.c ]
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,6) failed, skipping port
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,13,6) failed, skipping port
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,6) failed, skipping port
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,13,6) failed, skipping port
Use of uninitialized value in hash element at /opt/oracle.SupportTools/ibdiagtools/topologies/fetchTopology.pm line 665, <SFILE> line 1.
Use of uninitialized value in hash element at /opt/oracle.SupportTools/ibdiagtools/topologies/fetchTopology.pm line 665, <SFILE> line 2.
External non-Exadata-image nodes found: check for ZFS if on T4-4 - else ignore
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,6) failed, skipping port
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,13,6) failed, skipping port
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,6) failed, skipping port
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,13,6) failed, skipping port

Bad link:Switch 0x2128f56da3a0a0 Port 6A - nbusrv Port 17A
Reason : 2.5 Gbps Speed found. Could be 5 Gbps
Possible cause : Cable isn't fully seated in
Spine switch found: ex01sw-ibs0.xxxx.com (2128f56d3ca0a0)
Leaf switch found: ex01sw-ibb0.xxxx.com (2128f56da3a0a0)
Leaf switch found: ex01sw-iba0.xxxx.com (2128f56bf6a0a0)

Found 2 leaf, 1 spine, 0 top spine switches

Check if all hosts have 2 CAs to different switches...............[SUCCESS]
Leaf switch check: cardinality and even distribution..............[ERROR]

Leaf Switch ex01sw-ibb0.xxxx.com with GUID 0x2128f56da3a0a0has fewer than 7 links to storage cells
It has 6 links ( 17A 17B 16A 16B 15A 14B)to storage cells
[ERROR]

Leaf Switch ex01sw-iba0.xxxx.com with GUID 0x2128f56bf6a0a0has fewer than 7 links to storage cells
It has 6 links (17A 17B 16A 16B 15A 14B )to storage cells
[ERROR]
2 switches did not meet this requirement

Spine switch check: Are any Exadata nodes connected ..............[SUCCESS]
Spine switch check: Any inter spine switch links..................[SUCCESS]
Spine switch check: Any inter top-spine switch links..............[SUCCESS]
Spine switch check: Correct number of spine-leaf links............[SUCCESS]
Leaf switch check: Inter-leaf link check..........................[SUCCESS]
Leaf switch check: Correct number of leaf-spine links.............[SUCCESS]



ibwarn: [85414] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0,1,6)
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,6) failed, skipping port
ibwarn: [85414] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0,1,13,6)
src/ibnetdisc.c:507; Query remote node (DR path slid 0; dlid 0; 0,1,13,6) failed, skipping port
Switch 0x002128f56d3ca0a0 SUN DCS 36P QDR ex01sw-ibs0.xxxx.com:
1 1[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 2[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 3[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 4[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 5[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 6[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 7[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 8[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 9[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 10[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 11[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 12[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 13[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 14[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 15[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 16[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 17[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 18[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 19[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 32[ ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
1 20[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 21[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 3 32[ ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
1 22[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 23[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 24[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 25[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 26[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 27[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 28[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 29[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 30[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 31[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 32[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 33[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 34[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 35[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
1 36[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
Switch 0x002128f56da3a0a0 SUN DCS 36P QDR ex01sw-ibb0.xxxx.com:
2 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 19 2[ ] "ex01cel02 C 192.168.10.6 HCA-1" ( )
2 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 21 2[ ] "ex01cel01 C 192.168.10.5 HCA-1" ( )
2 3[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 9 2[ ] "ex01cel04 C 192.168.10.8 HCA-1" ( )
2 4[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 7 2[ ] "ex01cel03 C 192.168.10.7 HCA-1" ( )
2 5[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 17 2[ ] "ex01cel06 C 192.168.10.10 HCA-1" ( )
2 6[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> [ ] "" ( )
2 7[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 25 2[ ] "ex01db01 S 192.168.10.1 HCA-1" ( )
2 8[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 11 2[ ] "ex01cel07 C 192.168.10.11 HCA-1" ( )
2 9[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 5 2[ ] "ex01db03 S 192.168.10.3 HCA-1" ( )
2 10[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 13 2[ ] "ex01db02 S 192.168.10.2 HCA-1" ( )
2 11[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
2 12[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 23 2[ ] "ex01db04 S 192.168.10.4 HCA-1" ( )
2 13[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 3 14[ ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
2 14[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 3 13[ ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
2 15[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 3 16[ ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
2 16[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 3 15[ ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
2 17[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 3 18[ ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
2 18[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 3 17[ ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
2 19[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
2 20[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
2 21[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
2 22[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
2 23[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
2 24[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
2 25[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
2 26[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
2 27[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
2 28[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
2 29[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
2 30[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
2 31[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 3 31[ ] "SUN DCS 36P QDR ex01sw-iba0.xxxx.com" ( )
2 32[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 1 19[ ] "SUN DCS 36P QDR ex01sw-ibs0.xxxx.com" ( )
2 33[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
2 34[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
2 35[ ] ==( 4X 2.5 Gbps Active/ LinkUp)==> 26 1[ ] "nbusrv HCA-1" ( Could be 10.0 Gbps)
2 36[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
Switch 0x002128f56bf6a0a0 SUN DCS 36P QDR ex01sw-iba0.xxxx.com:
3 1[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 18 1[ ] "ex01cel02 C 192.168.10.6 HCA-1" ( )
3 2[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 20 1[ ] "ex01cel01 C 192.168.10.5 HCA-1" ( )
3 3[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 8 1[ ] "ex01cel04 C 192.168.10.8 HCA-1" ( )
3 4[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 6 1[ ] "ex01cel03 C 192.168.10.7 HCA-1" ( )
3 5[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 16 1[ ] "ex01cel06 C 192.168.10.10 HCA-1" ( )
3 6[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> [ ] "" ( )
3 7[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 24 1[ ] "ex01db01 S 192.168.10.1 HCA-1" ( )
3 8[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 10 1[ ] "ex01cel07 C 192.168.10.11 HCA-1" ( )
3 9[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 4 1[ ] "ex01db03 S 192.168.10.3 HCA-1" ( )
3 10[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 12 1[ ] "ex01db02 S 192.168.10.2 HCA-1" ( )
3 11[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 12[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 22 1[ ] "ex01db04 S 192.168.10.4 HCA-1" ( )
3 13[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 14[ ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
3 14[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 13[ ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
3 15[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 16[ ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
3 16[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 15[ ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
3 17[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 18[ ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
3 18[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 17[ ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
3 19[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 20[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 21[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 22[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 23[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 24[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 25[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 26[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 27[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 28[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 29[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 30[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 31[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 2 31[ ] "SUN DCS 36P QDR ex01sw-ibb0.xxxx.com" ( )
3 32[ ] ==( 4X 10.0 Gbps Active/ LinkUp)==> 1 21[ ] "SUN DCS 36P QDR ex01sw-ibs0.xxxx.com" ( )
3 33[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 34[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 35[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )
3 36[ ] ==( 4X 2.5 Gbps Down/Disabled)==> [ ] "" ( )

故障结论

一开始怀疑是infiniband交换机问题,之后又收集了oswatcher日志信息后分析发现,为软件bug导致内存泄露最终内存耗尽所致,因软件版本升级不属于我这边的工作,因此临时修改系统参数后让故障暂时规避,操作如下:

sysctl -w vm.zone_reclaim_mode=1 
sysctl -w vm.min_free_kbytes = 51200