- 論壇徽章:
- 0
|
從cssd進(jìn)程的日志里看上去,還是比較正常的啊,但是怎么它就起不來呢?根據(jù)ORACLE的文檔:Troubleshooting 10g and 11.1 Clusterware Reboots [ID 265769.1]所描述的,導(dǎo)致節(jié)點(diǎn)重啟的進(jìn)程有兩,ocssd.bin及oprocd,但是現(xiàn)在節(jié)點(diǎn)也不重啟,crsctl start crs是能啟動(dòng)這些進(jìn)程的,但是crsctl check crs時(shí)就hang住,多次重啟節(jié)點(diǎn)后發(fā)現(xiàn)cssd進(jìn)程是能起來的,偶爾crsctl check css可以看到css進(jìn)程起來了,但是檢查狀態(tài)的時(shí)候返回結(jié)果是非常的慢,隱隱約約的懷疑是IO的問題,客戶也提示先檢查一下IO看看,用dd命令測試ocr文件是可以讀的,說明IO貌似沒有問題。過一段時(shí)間之后*d.bin的進(jìn)程竟然都停止了,節(jié)點(diǎn)也沒有重啟.....
暫時(shí)無果,嘗試著去檢查硬件出問題的那2臺(tái)小機(jī): errpt -a
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION BFE4C025 0207150811 P H sysplanar0 UNDETERMINED ERROR BFE4C025 0207150011 P H sysplanar0 UNDETERMINED ERROR A6DF45AA 0207145811 I O RMCdaemon The daemon is started. 9DBCFDEE 0207145411 T O errdemon ERROR LOGGING TURNED ON 192AC071 0207142011 T O errdemon ERROR LOGGING TURNED OFF BFE4C025 0207140611 P H sysplanar0 UNDETERMINED ERROR A6DF45AA 0207140011 I O RMCdaemon The daemon is started. 2BFA76F6 0207135611 T S SYSPROC SYSTEM SHUTDOWN BY USER 9DBCFDEE 0207135811 T O errdemon ERROR LOGGING TURNED ON 192AC071 0201223311 T O errdemon ERROR LOGGING TURNED OFF A6DF45AA 0130225711 I O RMCdaemon The daemon is started. 2BFA76F6 0130225511 T S SYSPROC SYSTEM SHUTDOWN BY USER 9DBCFDEE 0130225711 T O errdemon ERROR LOGGING TURNED ON 192AC071 0130225211 T O errdemon ERROR LOGGING TURNED OF
errpt -aj BFE4C025 LABEL: SCAN_ERROR_CHRP IDENTIFIER: BFE4C025
Date/Time: Mon Feb 7 15:08:54 BEIST 2011 Sequence Number: 16758 Machine Id: 00CE63EF4C00 Node Id: secusz Class: H Type: PERM Resource Name: sysplanar0 Resource Class: planar Resource Type: sysplanar_rspc Location:
Description UNDETERMINED ERROR
Failure Causes UNDETERMINED
Recommended Actions RUN SYSTEM DIAGNOSTICS.
Detail Data PROBLEM DATA 0644 00E0 0000 05FC 9600 8E00 0000 0000 0000 0000 4942 4D00 5048 0030 0100 3F30 2011 0207 0642 4350 2011 0207 0642 4350 4500 0106 0000 0000 0000 0000 0000 0000 501D 5CD8 501D 5CD8 5548 0018 0100 3F30 6103 4400 0000 0000 0000 A004 0000 0000 5053 00F0 0101 3F30 0201 0002 0000 00E8 003C 0004 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 3131 3030 3135 3130 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 C000 0028 4C2B 4C14 5537 3837 392E 3030 312E 4451 4458 4644 5700 0000 4944 1CCD 5057 5253 504C 5900 0000 0000 0000 0000 0000 0000 0000 0000 5045 1800 3931 3137 2D35 3730 3036 4536 3345 4600 0000 0000 .... .... 系統(tǒng)提示做一個(gè)diag,那就做吧: diag結(jié)果顯示:
The following informational event was reported by Platform Firmware.
CEC hardware System resources deconfigured by system due to prior error event.
Supporting data:
SRC: B150FD00 Additional Words: 2-010000F0 3-28DA0110 4-C1009002 5-000000FF 6-00000002 7-00000000 8-00000000 9-00000000
Error log information: Date: Mon Feb 7 13:56:02 BEIST 2011 Sequence number: 2408 Label: SCAN_ERROR_CHRP
Press Enter or Cancel to return to the application.
google了一下說是內(nèi)存有被deconfig掉,出現(xiàn)CEC這種類似的錯(cuò)誤不要重啟機(jī)器,問了農(nóng)仙也說是內(nèi)存問題,看來這個(gè)問題基本確認(rèn)了!但是我們重啟了幾次,似乎沒啥問題,lsattr -El mem0 顯示內(nèi)存是正確的,看樣子是重啟后有變好了,那就把日志清除掉,重啟機(jī)器后檢查發(fā)現(xiàn)正常了,致以為啥之前會(huì)被deconfig掉,農(nóng)仙說要連接ASMI去找原因,這個(gè)我就不會(huì)鳥!那這個(gè)問題就留給主機(jī)工程師去解決吧,現(xiàn)在對系統(tǒng)沒啥影響了,至少現(xiàn)在是!
檢查另外一臺(tái)的系統(tǒng)日志,做了一個(gè)diag: The Service Request Number(s)/Probable Cause(s) (causes are listed in descending order of probability):
11001510: Power/Cooling subsystem Unrecovered Error, bypassed with loss of redundancy. Refer to the system service documentation for more information. Additional Words: 2-003C0004 3-00000000 4-00000000 5-00000000 6-00000000 7-00000000 8-00000000 9-00000000 Error log information: Date: Mon Feb 7 15:08:54 BEIST 2011 Sequence number: 16758 Label: SCAN_ERROR_CHRP Priority: L FRU: PWRSPLY Location: U7879.001.DQDXFDW Priority: L FRU: 10N8505 S/N: YL11C7157160 CCIN: 28EA Location: U7879.001.DQDYBNR-P1-C8
Use Enter to continue. 發(fā)現(xiàn)是電源問題以為是電源壞了,跟現(xiàn)場的工程師交流了一番,是他們加電的時(shí)候一個(gè)電源的插座沒插好導(dǎo)致的,現(xiàn)在正常了,但是系統(tǒng)面板上的黃燈還沒有消失掉,那就用命令清除掉吧,清除后重啟這臺(tái)主機(jī),啟動(dòng)過程巨慢無比啊,一個(gè)小時(shí)后還沒啟動(dòng)完成!!....趁著主機(jī)的重啟這段時(shí)間再次去檢查RAC問題,這次發(fā)現(xiàn)有了一些新的發(fā)現(xiàn):) |
|