- 論壇徽章:
- 0
|
AIX操作系統(tǒng)卷組故障維護(hù)
故障描述:
早晨,發(fā)現(xiàn)日報(bào)沒有正常發(fā)送,登錄數(shù)據(jù)庫備機(jī)查看原因,查看系統(tǒng)的log命令:
errpt |more
沒有發(fā)現(xiàn)什么異常,不過發(fā)現(xiàn)有如下錯(cuò)誤:
F3931284 0410055009 I H ent2 ETHERNET NETWORK RECOVERY MODE
F3931284 0410055009 I H ent0 ETHERNET NETWORK RECOVERY MODE
173C787F 0410053709 I S topsvcs Possible malfunction on local adapter
173C787F 0410053709 I S topsvcs Possible malfunction on local adapter
EC0BCCD4 0410053709 T H ent2 ETHERNET DOWN
EC0BCCD4 0410053709 T H ent0 ETHERNET DOWN
這個(gè)時(shí)間正好是同事更換以太網(wǎng)交換機(jī)的時(shí)間
查看數(shù)據(jù)庫同步腳本log:
# sh /home/oracle/sh/rmanres.sh
[YOU HAVE NEW MAIL]
0516-040 lqueryvg: Unable to read the specified physical volume
descriptor area.
0516-932 /usr/sbin/syncvg: Unable to synchronize volume group backvg.
[YOU HAVE NEW MAIL]
restoring datafile 00058 to /u01/oracle/product/9.2.0/oradata/orcl/yy33.dbf
restoring datafile 00059 to /u01/oracle/product/9.2.0/oradata/orcl/yy34.dbf
released channel: ch1
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of restore command at 04/20/2009 12:06:25
ORA-19501: read error on file "/u03/orabackup/rman/orcl_db_684391660_523_1", blockno 8192001 (blocksize=8192)
ORA-27063: skgfospo: number of bytes read/written is incorrect
IBM AIX RISC System/6000 Error: 12: Not enough space
Additional information: -1
Additional information: 1048576
ORA-19501: read error on file "/u03/orabackup/rman/orcl_db_684391660_523_1", blockno 8191873 (blocksize=8192)
ORA-27063: skgfospo: number of bytes read/written is incorrect
Recovery Manager complete.
[YOU HAVE NEW MAIL]
SQL*Plus: Release 9.2.0.1.0 - Production on Mon Apr 20 12:06:26 2009
Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved.
SP2-0640: Not connected
SP2-0640: Not connected
ERROR:
ORA-12500: TNS:listener failed to start a dedicated server process
SP2-0640: Not connected
SP2-0640: Not connected
系統(tǒng)日志:
# ps -ef |more
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Dec 16 - 0:55 /etc/init
root 61572 78170 0 Dec 16 - 359:56 dtgreet
root 69798 1 0 Dec 16 - 0:00 /usr/lib/errdemon
root 73882 1 0 Dec 16 - 71:56 /usr/sbin/syncd 60
root 90242 1 0 Dec 16 - 0:00 /usr/dt/bin/dtlogin -daemon
root 102438 344388 0 13:18:46 pts/7 0:00 -ksh
root 118898 102438 0 13:19:03 pts/7 0:00 ps -ef
root 127086 1 0 Dec 16 - 0:00 /usr/ccs/bin/shlap64
root 143514 106918 0 Dec 16 - 0:00 /usr/sbin/rsct/bin/IBM.ERrmd
root 155816 106918 0 Dec 16 - 2:24 /usr/sbin/rsct/bin/IBM.CSMAgentRMd
root 159976 106918 0 Dec 16 - 3:08 /usr/sbin/rsct/bin/rmcd -a IBM.LPCommands -r
root 164070 352610 0 Dec 16 - 37:11 /usr/sbin/rsct/bin/hats_nim
daemon 168160 106918 0 Dec 16 - 0:00 /usr/sbin/rpc.statd -d 0 -t 50
oracle 180262 1 0 Dec 16 - 0:02 ora_reco_rmandb
root 184400 106918 0 Dec 16 - 1:01 /usr/sbin/gsclvmd
oracle 205000 1 0 11:26:43 - 0:00 ora_pmon_orcl
root 233570 106918 0 Dec 16 - 7:56 /usr/sbin/rsct/bin/IBM.HostRMd
oracle 237696 1 0 12:29:22 - 0:00 oracleorcl (LOCAL=NO)
root 241712 352610 0 Dec 16 - 50:29 /usr/sbin/rsct/bin/hats_rs232_nim
root 245830 106918 0 Dec 16 - 0:00 /usr/sbin/muxatmd
root 278610 352610 0 Dec 16 - 30:31 /usr/sbin/rsct/bin/hats_nim
oracle 307362 1 0 Dec 16 - 0:06 ora_d000_rmandb
root 315394 106918 0 Dec 16 - 0:10 /usr/sbin/aixmibd
root 352384 106918 0 Dec 16 - 0:05 /usr/sbin/snmpmibd
root 372834 1 0 12:13:02 - 0:00 lsvg -o
oracle 389264 1 0 11:26:43 - 0:00 ora_ckpt_orcl
root 393248 1 0 12:11:24 - 0:00 lsvg -o
root 397368 1 0 12:21:43 - 0:00 lsvg -o
root 405556 1 0 12:15:51 - 0:00 lspv
root 417854 450810 0 12:06:28 - 0:00 lqueryvg -g 00c64e4b00004c000000011dbddadf95 -CX
root 426226 1 0 12:47:15 - 0:00 lsvg statvg
oracle 434210 1 0 12:07:13 - 0:00 oracleorcl (LOCAL=NO)
oracle 442388 1 0 11:26:43 - 0:00 ora_lgwr_orcl
oracle 446680 1 0 11:26:43 - 0:00 ora_dbw0_orcl
root 450810 1 0 12:06:28 - 0:00 /usr/bin/ksh /usr/sbin/varyoffvg backvg
root 61802 90242 0 Dec 16 - 8:20 /usr/lpp/X11/bin/X -D /usr/lib/X11//rgb -T -force :0 -auth /var/dt/A:0-ozyiia
root 74076 106918 0 Dec 16 - 1:34 /usr/sbin/snmpd
root 78170 90242 0 Dec 16 - 0:00 dtlogin <:0> -daemon
root 86416 106918 0 Dec 16 - 0:02 /usr/sbin/syslogd
root 94582 106918 0 Dec 16 - 0:00 /usr/sbin/inetd
root 98768 106918 0 Dec 16 - 13:14 /usr/es/sbin/cluster/clcomd -d
root 106918 1 0 Dec 16 - 0:00 /usr/sbin/srcmstr
root 115134 106918 0 Dec 16 - 0:00 /usr/sbin/portmap
root 119210 1 0 Dec 16 - 0:22 /usr/sbin/cron
root 131516 1 0 Dec 16 - 0:00 /usr/sbin/uprintfd
root 139680 1 0 Dec 16 lft0 0:00 /usr/sbin/getty /dev/console
root 143754 102438 0 13:19:03 pts/7 0:00 more
root 151986 106918 0 Dec 16 - 0:00 /usr/sbin/rsct/bin/IBM.ServiceRMd
root 156076 106918 0 Dec 16 - 0:00 /usr/sbin/rsct/bin/IBM.AuditRMd
oracle 168230 1 0 11:26:43 - 0:00 ora_d000_orcl
oracle 172368 1 0 11:26:43 - 0:00 ora_arc0_orcl
oracle 287158 1 0 11:26:43 - 0:00 ora_smon_orcl
oracle 299364 1 0 11:26:43 - 0:00 ora_reco_orcl
root 319924 1 0 11:51:24 - 0:00 lspv hdisk5
root 332234 106918 0 Dec 16 - 5:53 hagsd grpsvcs
oracle 336330 1 0 Dec 16 - 5:07 ora_dbw0_rmandb
root 344388 94582 0 13:18:45 - 0:00 telnetd -a
root 352610 106918 0 Dec 16 - 55:44 /usr/sbin/rsct/bin/hatsd -n 1 -o deadManSwitch
oracle 356856 1 0 Dec 16 - 11:53 ora_ckpt_rmandb
oracle 360852 1 0 Dec 16 - 5:24 ora_smon_rmandb
root 369086 106918 0 Dec 16 - 51:38 /usr/es/sbin/cluster/clstrmgr
root 389556 106918 0 Dec 16 - 11:02 /usr/es/sbin/cluster/clinfo
oracle 393484 1 0 Dec 16 - 4:17 ora_pmon_rmandb
oracle 418112 1 0 Dec 16 - 0:04 /home/oracle/product/9.2.0/bin/tnslsnr LISTENER -inherit
root 422200 106918 0 Dec 16 - 0:08 haemd HACMP 1 Cluster SECNOSUPPORT
root 438682 106918 0 Dec 16 - 0:05 /usr/sbin/qdaemon
root 442776 106918 0 Dec 16 - 0:00 /usr/sbin/rpc.lockd -d 0
root 446934 106918 0 Dec 16 - 0:00 /usr/sbin/writesrv
root 451032 106918 0 Dec 16 - 0:00 /usr/sbin/biod 6
root 471540 106918 0 Dec 16 - 0:21 sendmail: accepting connections
oracle 479602 1 0 Dec 16 - 1:33 ora_lgwr_rmandb
root 491900 106918 0 Dec 16 - 0:05 /usr/sbin/hostmibd
oracle 495908 1 0 11:26:43 - 0:00 ora_arc1_orcl
環(huán)境: 兩臺(tái)小機(jī),一個(gè)存儲(chǔ)陣列, 兩臺(tái)機(jī)器是hacmp的
有三個(gè)卷組,dbvg, statvg, backvg
主機(jī)卷組 dbvg
備機(jī)卷組:statvg
backvg兩機(jī)都可以訪問,用于備份的
問題描述: 現(xiàn)在備機(jī)只要是執(zhí)行和卷組,pv相關(guān)的命令 就掛在那 ,沒有反應(yīng)
我通過進(jìn)程信息,可以判斷是卷組鎖定了backvg,
我執(zhí)行過的操作,再備機(jī)上: chvg -u backvg , 已經(jīng)3個(gè)小時(shí)了, 還是沒有結(jié)果,掛載那
然后又在備機(jī)上執(zhí)行 exportvg backvg 又很長時(shí)間了,一個(gè)多小時(shí),還是掛在那,
請問如何解決這個(gè)問題,解鎖backvg,我在主機(jī)varyonvg backvg時(shí) ,提示:
# varyonvg backvg
0516-013 varyonvg: The volume group cannot be varied on because
there are no good copies of the descriptor area.
Command: failed stdout: yes stderr: no
Before command completion, additional instructions may appear below.
0516-024 lqueryvg: Unable to open physical volume.
Either PV was not configured or could not be opened. Run
diagnostics.
0516-024 lqueryvg: Unable to open physical volume.
Either PV was not configured or could not be opened. Run
diagnostics.
0516-1140 importvg: Unable to read the volume group descriptor area
on specified physical volume.
問題產(chǎn)生的原因:因?yàn)閎ackvg卷組是共享卷組(不是并發(fā)卷組),在每日的04:00-05:40這段時(shí)間
是數(shù)據(jù)庫用backvg備份,而在每次使用卷組的時(shí)候都要更改卷組的vgda,vgsa中的
時(shí)間戳,而在這段時(shí)間里同事更換了交換機(jī),導(dǎo)致兩個(gè)小機(jī)的卷組的VGDA不一致
從而會(huì)出現(xiàn)這個(gè)錯(cuò)誤
解決方法:
首要目的:讓備機(jī)釋放掉對pv,卷組的管理進(jìn)程,以達(dá)到我可以從新管理備機(jī)的卷組信息
由于一些原因,我強(qiáng)行kill掉相關(guān)LVM命令,導(dǎo)致這些進(jìn)程都被系統(tǒng)接管,根本無法再kill掉,
即使用kill -9,也是不可以
我當(dāng)時(shí)在想有兩個(gè)方法可以解決此種情況
1.有一些特殊的方法可以kill掉這些進(jìn)程
2.重新啟動(dòng)機(jī)器讓其釋放所有資源
咨詢了很多人,又google半天,也沒有找到可以kill那些進(jìn)程的方法
最后決定重啟機(jī)器
因?yàn)槲业沫h(huán)境是兩臺(tái)小機(jī)做了hacmp,為了避免出萬一,決定23號凌晨去機(jī)房維護(hù),出什么問題也好就近解決
主要是擔(dān)心網(wǎng)卡down了,遠(yuǎn)程連接不上
當(dāng)?shù)搅藱C(jī)房,就在外邊的維護(hù)室(機(jī)房太冷了!!能不進(jìn)去就不進(jìn)去啊),
我的hacmp配置為有優(yōu)先級的cascading模式,按優(yōu)先級來接管資源。優(yōu)先級高的節(jié)點(diǎn)恢復(fù)后將回拉資源
而我現(xiàn)在打算reboot備機(jī),所以不會(huì)影響主機(jī)(我咨詢過經(jīng)驗(yàn)豐富的IBM工程師,在此感謝)
操作步驟:
備機(jī):
執(zhí)行如下命令:
# reboot
然后就等,按經(jīng)驗(yàn),也就5分鐘左右,結(jié)果等啊等啊,等了20幾分鐘還沒有起來,心想幸好來機(jī)房了,進(jìn)機(jī)房連上顯示器
沒有反映,觀察硬件也沒有什么錯(cuò)誤,于是按重啟鍵,等了一會(huì),系統(tǒng)起來了,簡單看看了,發(fā)現(xiàn)backvg卷組沒問題,可以
varyon,lspv,lslv都沒什么問題,不過主機(jī)不能varyon這個(gè)卷組了,我又發(fā)現(xiàn)statvg卷組有問題
當(dāng)執(zhí)行l(wèi)svg -l statvg ,有問號,但是這個(gè)卷組varyon后,mount上的文件系統(tǒng),用著也沒有問題,為了避免隱患,我還是
簡單修正下,
這個(gè)原因一般是因?yàn)镺DM庫中的VGDA和PV上的VGDA不一致,只要簡單的exportvg來解決就可以
exportvg statvg
importvg -y statvg hdisk6 或者 smit importvg
執(zhí)行后問題解決!
第二個(gè)問題就是把backvg卷組讓主機(jī)也可以訪問
在主機(jī)上
清空主機(jī)上ODM庫中的backvg信息
#exportvg backvg
然后執(zhí)行
在備機(jī)
# ls -l /dev/backvg
crw-rw---- 1 root system 53, 0 Nov 24 22:58 /dev/backvg
在主機(jī)
#smit importvg
[Entry Fields]
VOLUME GROUP name [backvg]
* PHYSICAL VOLUME name [hdisk6] ---backvg里的任何一個(gè) +
Volume Group MAJOR NUMBER [53] 這個(gè)53相當(dāng)于卷組的唯一標(biāo)識;要沒有他,兩邊機(jī)器就不能保證訪問相同的卷組backvg這個(gè) +#
結(jié)果ok。
最后重新啟動(dòng)下hacmp軟件
#smit clstart
兩邊看看errpt看是否有錯(cuò),都沒有,于是對主庫做一次全備
這次故障是有驚無險(xiǎn),解決的還是蠻順利的
其實(shí)為這次我準(zhǔn)備了好幾套備用方案
1.重新啟動(dòng)系統(tǒng),如可以能識別最好---結(jié)果真識別了
2. 如果不能varyon,那就強(qiáng)制varyon
#varyonvg -f backvg
如果可以varyon,那最好,如果不行,那就恢復(fù)backvg,既recreatevg
3. 如果recreatevg還不能解決,那只有刪除了重新創(chuàng)建backvg,當(dāng)然里面的數(shù)據(jù)也就沒了
#smit mkvg
#smit mklv
#smit mkjfs
注意
pvid存在三個(gè)地方
ODM庫中
# lspv
hdisk0 00c64e4bd07d52a8 rootvg active
hdisk1 00c64e4bd0e61501 rootvg active
hdisk2 00c64e4bbdd3e449 dbvg
hdisk3 00c64e4bbdd3e75f dbvg
hdisk4 00c64e4bbdd91029 statvg active
hdisk5 00c64e4bbdd91370 statvg active
hdisk6 00c64e4bbddabdb3 backvg
hdisk7 00c64e4bbddac0e8 backvg
#
存在VGDA中
# lqueryvg -Atp hdisk6
Max LVs: 256
PP Size: 27
Free PPs: 5
LV count: 2
PV count: 2
Total VGDAs: 3
Conc Allowed: 0
MAX PPs per PV 32768
MAX PVs: 1024
Quorum Setting 1
Auto Varyon ?: 0
Conc Autovaryo 0
Varied on Conc 0
Logical: 00c64e4b00004c000000011dbddadf95.1 loglv02 1
00c64e4b00004c000000011dbddadf95.2 fslv07 1
Physical: 00c64e4bbddabdb3 2 0
00c64e4bbddac0e8 1 0
Total PPs: 2206
LTG size: 128
HOT SPARE: 0
AUTO SYNC: 0
VG PERMISSION: 0
SNAPSHOT VG: 0
IS_PRIMARY VG: 0
PSNFSTPP: 7168
VARYON MODE: 0
VG Type: 2
Max PPs: 32768
存在pv頭
# lquerypv -H /dev/hdisk6
00c64e4bbddabdb30000000000000000
|
|