This blog has moved here.

Thursday, June 21, 2007

/etc/oratab or /etc/rc.d/rc3.d/S99gcstartup?


A few days ago I had updated the Oracle Grid Control and its database located on a RedHat AS4 server. Everything worked fine till the server was rebooted and I had the “nice” surprise to see that the repository database was restarted but using the old ORACLE_HOME. At this point it is important to review the steps I had followed as part of the upgrading task:


1. stop agent

10ag
emctl stop agent

2. stop opmn stack

10gr
$ORACLE_HOME/opmn/bin/opmnctl stopall

3. stop listener

10db
lsnrctl stop

4. stop database

sqlplus / as sysdba
shutdown immediate

5. backup agent, database, oms homes

cp -r /opt/oracle/product/10.2.0/agent /opt/oracle/backup/
cp -r /opt/oracle/product/10.2.0/db10g /opt/oracle/backup/
cp -r /opt/oracle/product/10.2.0/oms10g /opt/oracle/backup/

6. apply patch for 4329444

oracle@aut-vie-racman:kits$ unzip p4329444_10104_LINUX.zip
Archive: p4329444_10104_LINUX.zip
creating: 4329444/
creating: 4329444/files/
creating: 4329444/files/lib/
creating: 4329444/files/lib/libserver10.a/
inflating: 4329444/files/lib/libserver10.a/qerix.o
creating: 4329444/etc/
creating: 4329444/etc/config/
inflating: 4329444/etc/config/inventory
inflating: 4329444/etc/config/actions
creating: 4329444/etc/xml/
inflating: 4329444/etc/xml/GenericActions.xml
inflating: 4329444/etc/xml/ShiphomeDirectoryStructure.xml
inflating: 4329444/README.txt
oracle@aut-vie-racman:kits$ cd 4329444/
oracle@aut-vie-racman:4329444$ /opt/oracle/product/10.2.0/db10g/OPatch/opatch apply
Invoking OPatch 10.2.0.1.0
Oracle interim Patch Installer version 10.2.0.1.0
Copyright (c) 2005, Oracle Corporation. All rights reserved..
Oracle Home : /opt/oracle/product/10.2.0/db10g
Central Inventory : /opt/oracle/oraInventory
from : /opt/oracle/product/10.2.0/db10g/oraInst.loc
OPatch version : 10.2.0.1.0
OUI version : 10.2.0.1.0
OUI location : /opt/oracle/product/10.2.0/db10g/oui
Log file location : /opt/oracle/product/10.2.0/db10g/cfgtoollogs/opatch/opatch-
2007_Jun_19_16-16-19-CEST_Tue.log
ApplySession applying interim patch '4329444' to OH
'/opt/oracle/product/10.2.0/db10g'
Invoking fuser to check for active processes.
Invoking fuser on "/opt/oracle/product/10.2.0/db10g/bin/oracle"
OPatch detected non-cluster Oracle Home from the inventory and will patch the local
system only.
Please shutdown Oracle instances running out of this ORACLE_HOME on the local
system.
(Oracle Home = '/opt/oracle/product/10.2.0/db10g')
Is the local system ready for patching?
Do you want to proceed? [y|n]
y
User Responded with: Y
Backing up files and inventory (not for auto-rollback) for the Oracle Home
Backing up files affected by the patch '4329444' for restore. This might take a
while...
Backing up files affected by the patch '4329444' for rollback. This might take a
while...
Patching component oracle.rdbms, 10.1.0.4.0...
Updating archive file "/opt/oracle/product/10.2.0/db10g/lib/libserver10.a" with
"lib/libserver10.a/qerix.o"
Running make for target ioracle
ApplySession adding interim patch '4329444' to inventory
The local system has been patched and can be restarted.
OPatch succeeded.

7. Start db listener

oracle@aut-vie-racman:4329444$ 10db
oracle@aut-vie-racman:4329444$ lsnrctl start
LSNRCTL for Linux: Version 10.1.0.4.0 - Production on 19-JUN-2007 16:18:42
Copyright (c) 1991, 2004, Oracle. All rights reserved.
Starting /opt/oracle/product/10.2.0/db10g/bin/tnslsnr: please wait...
TNSLSNR for Linux: Version 10.1.0.4.0 - Production
System parameter file is /opt/oracle/product/10.2.0/db10g/network/admin/listener.ora
Log messages written to /opt/oracle/product/10.2.0/db10g/network/log/listener.log
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC)))
Listening on: (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=aut-vieracman.
bankgutmann.co.at)(PORT=1521)))
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC)))
STATUS of the LISTENER
------------------------
Alias LISTENER
Version TNSLSNR for Linux: Version 10.1.0.4.0 - Production
Start Date 19-JUN-2007 16:18:43
Uptime 0 days 0 hr. 0 min. 0 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File
/opt/oracle/product/10.2.0/db10g/network/admin/listener.ora
Listener Log File /opt/oracle/product/10.2.0/db10g/network/log/listener.log
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=EXTPROC)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=aut-vieracman.
bankgutmann.co.at)(PORT=1521)))
Services Summary...
Service "PLSExtProc" has 1 instance(s).
Instance "PLSExtProc", status UNKNOWN, has 1 handler(s) for this service...
The command completed successfully

8. start emrep database

oracle@aut-vie-racman:4329444$ sqlplus / as sysdba
SQL*Plus: Release 10.1.0.4.0 - Production on Tue Jun 19 16:20:27 2007
Copyright (c) 1982, 2005, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup
ORACLE instance started.
Total System Global Area 536870912 bytes
Fixed Size 780056 bytes
Variable Size 275781864 bytes
Database Buffers 260046848 bytes
Redo Buffers 262144 bytes
Database mounted.
Database opened.

9. Patch OMS.
10. Patch agent
11. Start agent

10ag
emctl start agent

12. Start oms

10gr
$ORACLE_HOME/opmn/bin/opmnctl startall


The Grid Control was patched and it worked just fine. The only problem was that the database
remained at 10.1.0.4.0 version. So, I decided to upgrade the repository database too, following the steps below:


1. stop agent
2. stop oms
3. stop database
4. stop listener
5. install 10.2.0.1 Oracle software into a new home
6. upgrade to 10.2.0.3 the Oracle software
7. upgrade the database using “dbua”
8. start oms
9. start agent


Everything was normal until today when we had to shutdown the server for maintenance reasons. Unfortunately, on reboot the repository database continued to be started under the previous 10.1.0.4 oracle home and, of course, because of this the OEM couldn't be used. I had a look into /etc/oratab file and the entries were fine. The database was supposed to be started from the new upgraded ORACLE_HOME!!! Furthermore, changing the /etc/oratab file seemed to have no effect. In the end, I found the reason. Under the /etc/rc.d/rc3.d/ directory there is a script called S99gcstartup. This script calls the startup scripts for repository database, OMS and for the agent. It's enough to copy the startup script for the database under the new ORACLE_HOME directory and to change the ORACLE_HOME variable which is declared in that script. After these changes the repository database was/is started up with the correct ORACLE_HOME on every reboot.

Friday, June 08, 2007

Losing One Voting Disk

Voting disks are used in a RAC configuration for maintaining nodes membership. They are critical pieces in a cluster configuration. Starting with ORACLE 10gR2, it is possible to mirror the OCR and the voting disks. Using the default mirroring template, the minimum number of voting disks necessary for a normal functioning is two.

Scenario Setup

In this scenario it is simulated the crash of one voting disk by using the following steps:

  1. identify votings:

crsctl query css votedisk

0. 0 /dev/raw/raw1

1. 0 /dev/raw/raw2

2. 0 /dev/raw/raw3

  1. corrupt one of the voting disks (as root):

    dd if=/dev/zero /dev/raw/raw3 bs=1M

Recoverability Steps

  1. check the “$CRS_HOME/log/[hostname]/alert[hostname].log” file. The following message should be written there which allows us to determine which voting disk became corrupted:

    [cssd(9120)]CRS-1604:CSSD voting file is offline: /opt/oracle/product/10.2.0/crs_1/Voting1. Details in /opt/oracle/product/10.2.0/crs_1/log/aut-arz-ractest1/cssd/ocssd.log.

  2. According to the above listing the Voting1 is the corrupted disk. Shutdown the CRS stack:

    srvctl stop database -d fitstest -o immediate

    srvctl stop asm -n aut-vie-ractest1

    srvctl stop asm -n aut-arz-ractest1

    srvctl stop nodeapps -n aut-vie-ractest1

    srvctl stop nodeapps -n aut-arz-ractest1

    crs_stat -t

    On every node as root:

    crsctl stop crs

  3. Pick a good voting from the remaining ones and copy it over the corrupted one:

    dd if=/dev/raw/raw4 of=/dev/raw/raw3 bs=1M

  4. Start CRS (on every node as root):

      crsctl start crs

  5. Check log file “$CRS_HOME/log/[hostname]/alert[hostname].log”. It should look like shown below:

    [cssd(14463)]CRS-1601:CSSD Reconfiguration complete. Active nodes are aut-vie-ractest1 aut-arz-ractest1 .

    2007-05-31 15:19:53.954

    [crsd(14268)]CRS-1012:The OCR service started on node aut-vie-ractest1.

    2007-05-31 15:19:53.987

    [evmd(14228)]CRS-1401:EVMD started on node aut-vie-ractest1.

    2007-05-31 15:19:55.861 [crsd(14268)]CRS-1201:CRSD started on node aut-vie-ractest1.

  6. After a couple of minutes check the status of the whole CRS stack:

    [oracle@aut-vie-ractest1 ~]$ crs_stat -t

    Name Type Target State Host

    ------------------------------------------------------------

    ora....SM2.asm application ONLINE ONLINE aut-...est1

    ora....T1.lsnr application ONLINE ONLINE aut-...est1

    ora....st1.gsd application ONLINE ONLINE aut-...est1

    ora....st1.ons application ONLINE ONLINE aut-...est1

    ora....st1.vip application ONLINE ONLINE aut-...est1

    ora....SM1.asm application ONLINE ONLINE aut-...est1

    ora....T1.lsnr application ONLINE ONLINE aut-...est1

    ora....st1.gsd application ONLINE ONLINE aut-...est1

    ora....st1.ons application ONLINE ONLINE aut-...est1

    ora....st1.vip application ONLINE ONLINE aut-...est1

    ora....test.db application ONLINE ONLINE aut-...est1

    ora....t1.inst application ONLINE ONLINE aut-...est1

    ora....t2.inst application ONLINE ONLINE aut-...est1


Note: There's also possible to recover a lost voting disk from an old voting backup and to perform the “dd” command without shutting down the CRS stack.