当前位置: 首页 > Oracle, oracle 10g > 正文

由于私有网卡直连RAC一个节点故障导致整个集群瘫痪

最近一个客户RAC环境出现严重故障,影响业务正常运行1个多小时,客户电话描述说RAC的第二个节点主机自动重启,然后发现节点1不能正常工作,应用已经无法连接数据库,ping不通节点1的IP,通过KVM查看后,发现节点1服务器白屏,已经死机,此时节点2操作系统已经启动,但是集群(CRS)无法启动(客户反映是节点2ASM磁盘组无法挂载),我们到现场后发现,节点1的CRS、数据库等都没有任何日志信息,主机日志从10:29到11:38之间没有任何日志信息,客户说,他们他们发现节点1服务器死机后,就直接把服务器的电源给摁了,然后启动是服务器自动校验,半个小时还没起来,他们就又给摁了,有过了半个多小时,服务器才起来,而且在几点1没有启动这段时间,节点2一直无法启动集群,直到节点1操作系统起来之后,节点2的集群环境才启动成功。

客户环境是suse linux oracle10.2.0.5。

通过检查节点2的CSS日志,发现在10:29分,节点1挂掉。

[CSSD]2013-01-14 10:29:51.609 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 50% heartbeat fatal, eviction in 29.970 seconds seedhbimpd 0
[CSSD]2013-01-14 10:29:51.609 [1216416064] >TRACE:   clssnmPollingThread: node db3 (1) is impending reconfig, flag 1039, misstime 30030
[CSSD]2013-01-14 10:29:51.610 [1216416064] >TRACE:   clssnmPollingThread: diskTimeout set to (57000)ms impending reconfig status(1)
[CSSD]2013-01-14 10:29:52.614 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 50% heartbeat fatal, eviction in 28.960 seconds seedhbimpd 1
[CSSD]2013-01-14 10:30:06.614 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 75% heartbeat fatal, eviction in 14.960 seconds seedhbimpd 1
[CSSD]2013-01-14 10:30:07.609 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 75% heartbeat fatal, eviction in 13.970 seconds seedhbimpd 1
[CSSD]2013-01-14 10:30:17.609 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 90% heartbeat fatal, eviction in 3.970 seconds seedhbimpd 1
[CSSD]2013-01-14 10:30:18.613 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 90% heartbeat fatal, eviction in 2.960 seconds seedhbimpd 1
[CSSD]2013-01-14 10:30:19.609 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 90% heartbeat fatal, eviction in 1.970 seconds seedhbimpd 1
[CSSD]2013-01-14 10:30:20.613 [1216416064] >WARNING: clssnmPollingThread: node db3 (1) at 90% heartbeat fatal, eviction in 0.960 seconds seedhbimpd 1
[CSSD]2013-01-14 10:30:21.577 [1216416064] >TRACE:   clssnmDiscHelper: db3, node(1) connection failed, con (0x78f860), probe((nil))
[CSSD]2013-01-14 10:30:21.577 [1216416064] >TRACE:   clssnmDiscHelper: node 1 clean up, con (0x78f860), init state 5, cur state 5
[CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE:   clssnmDoSyncUpdate: Initiating sync 9
[CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE:   clssnmDoSyncUpdate: diskTimeout set to (57000)ms
[CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE:   clssnmSetupAckWait: Ack message type (11)
[CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE:   clssnmSetupAckWait: node(2) is ALIVE
[CSSD]2013-01-14 10:30:21.578 [1107310912] >TRACE:   clssnmvReadDskHeartbeat: node(1) is down. rcfg(9) wrtcnt(6243552) LATS(4115761046) Disk lastSeqNo(6243552)
[CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE:   clssnmSetMinMaxVersion: min product/protocol (10.2/1.4)
[CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE:   clssnmSetMinMaxVersion: max product/protocol (10.2/1.4)
[CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE:   clssnmDoSyncUpdate: Terminating node 1, db3, misstime(60000) state(5), seedhbimpd(1)
[CSSD]2013-01-14 10:30:21.578 [1115703616] >TRACE:   clssnmvReadDskHeartbeat: node(1) is down. rcfg(9) wrtcnt(6243552) LATS(4115761046) Disk lastSeqNo(6243552)

可见,10:29节点1已经出现故障,和客户描述的正好相反,再看节点2接下来的日志。

[CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE:   clssnmSetupAckWait: Ack message type (13)
[CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE:   clssnmCheckDskInfo: Checking disk info...
[CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE:   clssnmCheckDskSleepTime: Node 1, db3, is alive, DHB (1358130621, 1976175180) more than disk timeout of 57000 after the last NHB (1358130562, 1976115440)
[CSSD]2013-01-14 10:30:21.578 [2191697312] >TRACE:   clssgmQueueGrockEvent: groupName(crs_version) count(2) master(0) event(2), incarn 8, mbrc 2, to member 1, events 0x0, state 0x0
[CSSD]2013-01-14 10:30:21.578 [1124096320] >TRACE:   clssnmvReadDskHeartbeat: node(1) is down. rcfg(9) wrtcnt(6243552) LATS(4115761046) Disk lastSeqNo(6243552)
[CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE:   clssnmCheckDskInfo: My cohort: 2
[CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE:   clssnmCheckDskInfo: Surviving cohort: 1
[CSSD]2013-01-14 10:30:21.578 [1233201472] >TRACE:   clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, db4, is smaller than cohort of 1 nodes led by node 1, db3, based on map type 2

节点2认为和节点1失去联系是自己出的问题,为了防止脑列,节点2自杀了,重启了节点2的服务器。

在10:35节点2服务器启动成功,但是此时节点1服务器还没有启动成功,节点2的集群无法启动。

[CSSD]2013-01-14 10:35:59.336 [62072224] >TRACE:   clssnmReadNodeInfo: added node 1 (db3) to cluster
[CSSD]2013-01-14 10:35:59.343 [62072224] >TRACE:   clssnmReadNodeInfo: added node 2 (db4) to cluster
[CSSD]2013-01-14 10:35:59.344 [62072224] >TRACE:   clssnmInitNMInfo: Initialized with unique 1358130959
[CSSD]2013-01-14 10:35:59.347 [62072224] >TRACE:   clssNMInitialize: Initializing with OCR id (1797159872)
[CSSD]2013-01-14 10:36:00.355 [1107310912] >TRACE:   clssnm_skgxninit: Compatible vendor clusterware not in use
[CSSD]2013-01-14 10:36:00.359 [62072224] >TRACE:   clssnmNMInitialize: misscount set to (60)
[CSSD]2013-01-14 10:36:00.360 [62072224] >TRACE:   clssnmStartNM: reboottime set to (3) sec
[CSSD]2013-01-14 10:36:00.360 [62072224] >TRACE:   clssnmNMInitialize: Network heartbeat thresholds are: impending reconfig 30000 ms, reconfig start (misscount) 60000 ms
[CSSD]2013-01-14 10:36:03.376 [1107310912] >TRACE:   clssnmvReadDskHeartbeat: node(1) is down. rcfg(9) wrtcnt(6243889) LATS(965066) Disk lastSeqNo(6243889)
[CSSD]2013-01-14 10:36:03.384 [1115703616] >TRACE:   clssnmvReadDskHeartbeat: read ALL for Joining
[CSSD]2013-01-14 10:36:03.384 [1115703616] >TRACE:   clssnmvReadDskHeartbeat: node(1) is down. rcfg(9) wrtcnt(6243889) LATS(965066) Disk lastSeqNo(6243889)
[CSSD]2013-01-14 10:36:03.404 [1124096320] >TRACE:   clssnmvReadDskHeartbeat: read ALL for Joining
[CSSD]2013-01-14 10:36:03.404 [1124096320] >TRACE:   clssnmvReadDskHeartbeat: node(1) is down. rcfg(9) wrtcnt(6243889) LATS(965086) Disk lastSeqNo(6243889)
[CSSD]2013-01-14 10:36:03.451 [1208023360] >TRACE:   clssgmWaitOnEventValue: after CmInfo State  val 3, eval 1 waited 0

节点2CRS日志信息如下:

2013-01-14 10:35:56.811: [ default][2561460576][ENTER]0
Oracle Database 10g CRS Release 10.2.0.5.0 Production Copyright 1996, 2004, Oracle.  All rights reserved
2013-01-14 10:35:56.827: [ default][2561460576]0CRS Daemon Starting
2013-01-14 10:35:56.828: [ CRSMAIN][2561460576]0Checking the OCR device
2013-01-14 10:35:56.840: [ CRSMAIN][2561460576]0Connecting to the CSS Daemon
2013-01-14 10:35:57.063: [ COMMCRS][1082132800]clsc_connect: (0xb7f660) no listener at (ADDRESS=(PROTOCOL=ipc)(KEY=OCSSD_LL_db4_))
2013-01-14 10:35:57.063: [ CSSCLNT][2561460576]clsssInitNative: connect failed, rc 9
2013-01-14 10:35:57.064: [  CRSRTI][2561460576]0CSS is not ready. Received status 3 from CSS. Waiting for good status ..

直到节点1启动,一直都是0CSS is not ready,在11:36节点1启动后,节点2集群和数据库启动成功,随后节点1集群和数据库也启动成功,经分析和咨询客户,他们RAC的私有网卡是直连的,并没有经过交换机,所以在节点2重启后(节点1关闭状态),节点2在启动集群是发现心跳网卡不可用,集群环境就无法启动,直到节点1操作系统启动,心跳网卡可以工作,节点2才启动集群,ORACLE已经明确不建议使用私有网卡直连的方式搭建RAC,详见MOS文档:RAC: Frequently Asked Questions [ID 220970.1]

00001

有时候,客户对问题的描述不一定是正确的,还需要自己去分析判断。

那么是什么原因导致节点1死机呢,在节点1的操作系统日志里,有如下一段信息:

Jan 14 10:29:22 DB3 syslog-ng[5942]: Changing permissions on special file /dev/xconsole
Jan 14 10:29:22 DB3 syslog-ng[5942]: Changing permissions on special file /dev/tty10
Jan 14 10:29:22 DB3 kernel: printk: 8 messages suppressed.
Jan 14 10:29:22 DB3 kernel: The following is only an harmless informational message.
Jan 14 10:29:22 DB3 kernel: Unless you get a _continuous_flood_ of these messages it means
Jan 14 10:29:22 DB3 kernel: everything is working fine. Allocations from irqs cannot be
Jan 14 10:29:22 DB3 kernel: perfectly reliable and the kernel is designed to handle that.
Jan 14 10:29:22 DB3 kernel: oracle: page allocation failure. order:0, mode:0x20
Jan 14 10:29:22 DB3 kernel:
Jan 14 10:29:22 DB3 kernel: Call Trace: <IRQ> <ffffffff80168162>{__alloc_pages+796}
Jan 14 10:29:22 DB3 kernel: klogd 1.4.1, ---------- state change ----------
Jan 14 10:29:22 DB3 kernel:        <ffffffff801837bc>{kmem_getpages+106} <ffffffff80184bbe>{fallback_alloc+304}
Jan 14 10:29:22 DB3 kernel:        <ffffffff801850c3>{__kmalloc+179} <ffffffff8028f362>{__alloc_skb+93}
Jan 14 10:29:22 DB3 kernel:        <ffffffff881a9266>{:netxen_nic:netxen_alloc_rx_skb+39}
Jan 14 10:29:22 DB3 kernel:        <ffffffff881ab594>{:netxen_nic:netxen_process_rcv_ring+1604}
Jan 14 10:29:22 DB3 kernel:        <ffffffff8010f013>{do_gettimeofday+92} <ffffffff881aa9e9>{:netxen_nic:netxen_process_cmd_ring+46}
Jan 14 10:29:22 DB3 kernel:        <ffffffff881a5eb8>{:netxen_nic:netxen_nic_poll+64} <ffffffff80295144>{net_rx_action+165}
Jan 14 10:29:22 DB3 kernel:        <ffffffff8013a76c>{__do_softirq+85} <ffffffff8010c216>{call_softirq+30}
Jan 14 10:29:22 DB3 kernel:        <ffffffff8010d1a4>{do_softirq+44} <ffffffff8010d56e>{do_IRQ+64}
Jan 14 10:29:22 DB3 kernel:        <ffffffff8010b352>{ret_from_intr+0} <EOI>
Jan 14 10:29:22 DB3 kernel: Mem-info:

这是1月14号死机前仅有的日志信息,可以推断是由于ORACLE无法分配内存页导致节点1主机死机。

本案例大致过程是10:29节点1ORACLE无法分配内存页导致节点1操作系统死机,节点2发现节点1宕机,而节点2又没有争取到表决盘,节点2以为是自己出了问题导致它和节点1失去联系,就把自己重启了,又由于节点1和节点2的心跳网络使用直连的方式,在节点1操作系统没有启动之前,节点2的心跳网络不可用,节点2集群一直无法启动,直到节点1操作系统启动成功。

 

本文固定链接: http://www.dbdream.com.cn/2013/01/%e7%94%b1%e4%ba%8e%e7%a7%81%e6%9c%89%e7%bd%91%e5%8d%a1%e7%9b%b4%e8%bf%9erac%e4%b8%80%e4%b8%aa%e8%8a%82%e7%82%b9%e6%95%85%e9%9a%9c%e5%af%bc%e8%87%b4%e6%95%b4%e4%b8%aa%e9%9b%86%e7%be%a4%e7%98%ab/ | 信春哥,系统稳,闭眼上线不回滚!

该日志由 dbdream 于2013年01月17日发表在 Oracle, oracle 10g 分类下, 你可以发表评论,并在保留原文地址及作者的情况下引用到你的网站或博客。
原创文章转载请注明: 由于私有网卡直连RAC一个节点故障导致整个集群瘫痪 | 信春哥,系统稳,闭眼上线不回滚!
关键字: , , , ,

由于私有网卡直连RAC一个节点故障导致整个集群瘫痪:目前有4 条留言

  1. 4楼
    Lost:

    文中有“经分析和咨询客户,他们RAC的私有网卡是直连的,并没有经过交换机,所以在节点2重启后(节点1关闭状态)”
    想请问下,网卡直连是怎样的?

    2013-01-17 23:09 [回复]
  2. 地板
    streamsong:

    @Lost
    他们是直接用一根网线,连接两台服务器的私有网卡,没有通过交换机,这样1台服务器异常关闭,CRS就会把另一个节点重启,就像这个客户这种情况,而且关闭的这台服务器没有启动,另一个节点的CRS无法正常启动。这种直连的方式,ORACLE已经不再支持了,MOS上有相关的文章。

    2013-01-18 21:47 [回复]
  3. 板凳
    藏章博客:

    好久没来博主博客了,就来看看!~大家 相互访问

    2013-01-24 17:32 [回复]
  4. 沙发
    streamsong:

    @藏章博客
    兄台的博客都很经典

    2013-01-25 14:33 [回复]

发表评论

快捷键:Ctrl+Enter