Algunas consideraciones de la instalación RAC

Troubleshooting
Confirm the RAC Node Name is Not Listed in Loopback Address
Ensure that the node names (linux1 or linux2) are not included for the loopback address in the /etc/hosts file. If the machine name is listed in the in the loopback address entry as below:
127.0.0.1 linux1 localhost.localdomain localhost
it will need to be removed as shown below:
127.0.0.1 localhost.localdomain localhost
If the RAC node name is listed for the loopback address, you will receive the following error during the RAC installation:
ORA-00603: ORACLE server session terminated by fatal error
or
ORA-29702: error occurred in Cluster Group Service operation
Confirm localhost is defined in the /etc/hosts file for the loopback address
Ensure that the entry for localhost.localdomain and localhost are included for the loopback address in the /etc/hosts file for each of the Oracle RAC nodes:
127.0.0.1        localhost.localdomain localhost
If an entry does not exist for localhost in the /etc/hosts file, Oracle Clusterware will be unable to start the application resources — notably the ONS process. The error would indicate "Failed to get IP for localhost" and will be written to the log file for ONS. For example:
CRS-0215 could not start resource 'ora.linux1.ons'. Check log file
"/u01/app/crs/log/linux1/racg/ora.linux1.ons.log"
for more details.
The ONS log file will contain lines similar to the following:
Oracle Database 10g CRS Release 10.2.0.1.0 Production Copyright 1996, 2005 Oracle. All rights reserved.
2007-04-14 13:10:02.729: [ RACG][3086871296][13316][3086871296][ora.linux1.ons]: Failed to get IP for localhost (1)
Failed to get IP for localhost (1)
Failed to get IP for localhost (1)
onsctl: ons failed to start
...
Setting the Correct Date and Time on All Cluster Nodes
During the installation of Oracle Clusterware, the Database, and the Companion CD, the Oracle Universal Installer (OUI) first installs the software to the local node running the installer (i.e. linux1). The software is then copied remotely to all of the remaining nodes in the cluster (i.e. linux2). During the remote copy process, the OUI will execute the UNIX "tar" command on each of the remote nodes to extract the files that were archived and copied over. If the date and time on the node performing the install is greater than that of the node it is copying to, the OUI will throw an error from the "tar" command indicating it is attempting to extract files stamped with a time in the future:
Error while copying directory 
    /u01/app/crs with exclude file list 'null' to nodes 'linux2'.
[PRKC-1002 : All the submitted commands did not execute successfully]
---------------------------------------------
linux2:
   /bin/tar: ./bin/lsnodes: time stamp 2009-07-28 09:21:34 is 735 s in the future
   /bin/tar: ./bin/olsnodes: time stamp 2009-07-28 09:21:34 is 735 s in the future
   ...(more errors on this node)
Please note that although this would seem like a severe error from the OUI, it can safely be disregarded as a warning. The "tar" command DOES actually extract the files; however, when you perform a listing of the files (using ls -l) on the remote node, they will be missing the time field until the time on the server is greater than the timestamp of the file.
Before starting any of the above noted installations, ensure that each member node of the cluster is set as closely as possible to the same date and time. Oracle strongly recommends using the Network Time Protocol feature of most operating systems for this purpose, with both Oracle RAC nodes using the same reference Network Time Protocol server.
Accessing a Network Time Protocol server, however, may not always be an option. In this case, when manually setting the date and time for the nodes in the cluster, ensure that the date and time of the node you are performing the software installations from (linux1) is less than all other nodes in the cluster (linux2). I generally use a 20 second difference as shown in the following example:
Setting the date and time from linux1:
# date -s "7/28/2009 23:00:00"
Setting the date and time from linux2:
# date -s "7/28/2009 23:00:20"
The two-node RAC configuration described in this article does not make use of a Network Time Protocol server.
Openfiler - Logical Volumes Not Active on Boot
One issue that I have run into several times occurs when using a USB drive connected to the Openfiler server. When the Openfiler server is rebooted, the system is able to recognize the USB drive however, it is not able to load the logical volumes and writes the following message to /var/log/messages - (also available through dmesg):
iSCSI Enterprise Target Software - version 0.4.14
iotype_init(91) register fileio
iotype_init(91) register blockio
iotype_init(91) register nullio
open_path(120) Can't open /dev/rac1/crs -2
fileio_attach(268) -2
open_path(120) Can't open /dev/rac1/asm1 -2
fileio_attach(268) -2
open_path(120) Can't open /dev/rac1/asm2 -2
fileio_attach(268) -2
open_path(120) Can't open /dev/rac1/asm3 -2
fileio_attach(268) -2
open_path(120) Can't open /dev/rac1/asm4 -2
fileio_attach(268) -2
Please note that I am not suggesting that this only occurs with USB drives connected to the Openfiler server. It may occur with other types of drives, however I have only seen it with USB drives!
If you do receive this error, you should first check the status of all logical volumes using the lvscan command from the Openfiler server:
# lvscan
inactive          '/dev/rac1/crs' [2.00 GB] inherit
inactive          '/dev/rac1/asm1' [115.94 GB] inherit
inactive          '/dev/rac1/asm2' [115.94 GB] inherit
inactive          '/dev/rac1/asm3' [115.94 GB] inherit
inactive          '/dev/rac1/asm4' [115.94 GB] inherit
Notice that the status for each of the logical volumes is set to inactive - (the status for each logical volume on a working system would be set to ACTIVE).
I currently know of two methods to get Openfiler to automatically load the logical volumes on reboot, both of which are described below.
Method 1
One of the first steps is to shutdown both of the Oracle RAC nodes in the cluster - (linux1 and linux2). Then, from the Openfiler server, manually set each of the logical volumes to ACTIVE for each consecutive reboot:
# lvchange -a y /dev/rac1/crs
# lvchange -a y /dev/rac1/asm1
# lvchange -a y /dev/rac1/asm2
# lvchange -a y /dev/rac1/asm3
# lvchange -a y /dev/rac1/asm4
Another method to set the status to active for all logical volumes is to use the Volume Group change command as follows:
# vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "rac1" using metadata type lvm2

# vgchange -ay
  5 logical volume(s) in volume group "rac1" now active
After setting each of the logical volumes to active, use the lvscan command again to verify the status:
# lvscan
  ACTIVE            '/dev/rac1/crs' [2.00 GB] inherit
  ACTIVE            '/dev/rac1/asm1' [115.94 GB] inherit
  ACTIVE            '/dev/rac1/asm2' [115.94 GB] inherit
  ACTIVE            '/dev/rac1/asm3' [115.94 GB] inherit
  ACTIVE            '/dev/rac1/asm4' [115.94 GB] inherit
As a final test, reboot the Openfiler server to ensure each of the logical volumes will be set to ACTIVE after the boot process. After you have verified that each of the logical volumes will be active on boot, check that the iSCSI target service is running:
# service iscsi-target status
ietd (pid 2668) is running...
Finally, restart each of the Oracle RAC nodes in the cluster - (linux1 and linux2).
Method 2
This method was kindly provided by Martin Jones. His workaround includes amending the /etc/rc.sysinit script to basically wait for the USB disk (/dev/sda in my example) to be detected. After making the changes to the/etc/rc.sysinit script (described below), verify the external drives are powered on and then reboot the Openfiler server.
The following is a small portion of the /etc/rc.sysinit script on the Openfiler server with the changes (highlighted in blue) proposed by Martin:
..............................................................
# LVM2 initialization, take 2
        if [ -c /dev/mapper/control ]; then
                if [ -x /sbin/multipath.static ]; then
                        modprobe dm-multipath >/dev/null 2>&1
                        /sbin/multipath.static -v 0
                        if [ -x /sbin/kpartx ]; then
                                /sbin/dmsetup ls --target multipath --exec "/sbin/kpartx -a"
                        fi
                fi
 

                if [ -x /sbin/dmraid ]; then
                        modprobe dm-mirror > /dev/null 2>&1
                        /sbin/dmraid -i -a y
                fi

#-----
#-----  MJONES - Customisation Start
#-----

       # Check if /dev/sda is ready
         while [ ! -e /dev/sda ]
         do
             echo "Device /dev/sda for first USB Drive is not yet ready."
             echo "Waiting..."
             sleep 5
         done
         echo "INFO - Device /dev/sda for first USB Drive is ready."

#-----
#-----  MJONES - Customisation END
#-----
                if [ -x /sbin/lvm.static ]; then
                        if /sbin/lvm.static vgscan > /dev/null 2>&1 ; then
                                action $"Setting up Logical Volume
Management:" /sbin/lvm.static vgscan --mknodes --ignorelockingfailure &&
/sbin/lvm.static vgchange -a y --ignorelockingfailure
                        fi
                fi
        fi
 

# Clean up SELinux labels
if [ -n "$SELINUX" ]; then
   for file in /etc/mtab /etc/ld.so.cache ; do
      [ -r $file ] && restorecon $file  >/dev/null 2>&1
   done
fi
..............................................................
Finally, restart each of the Oracle RAC nodes in the cluster - (linux1 and linux2).
OCFS2 - o2cb_ctl: Unable to access cluster service while creating node
While configuring the nodes for OCFS2 using ocfs2console, it is possible to run into the error:
o2cb_ctl: Unable to access cluster service while creating node
This error does not show up when you startup ocfs2console for the first time. This message comes up when there is a problem with the cluster configuration or if you do not save the cluster configuration initially while setting it up using ocfs2console. This is a bug!
The work-around is to exit from the ocfs2console, unload the o2cb module and remove the ocfs2 cluster configuration file/etc/ocfs2/cluster.conf. I also like to remove the /config directory. After removing the ocfs2 cluster configuration file, restart the ocfs2console program.
For example:
# /etc/init.d/o2cb offline ocfs2
# /etc/init.d/o2cb unload
Unmounting ocfs2_dlmfs filesystem: OK
Unloading module "ocfs2_dlmfs": OK
Unmounting configfs filesystem: OK
Unloading module "configfs": OK

# rm -f /etc/ocfs2/cluster.conf
# rm -rf /config

# ocfs2console &
This time, it will add the nodes!
OCFS2 - Adjusting the O2CB Heartbeat Threshold
With previous versions of this article, (using FireWire as opposed to iSCSI for the shared storage), I was able to install and configure OCFS2, format the new volume, and finally install Oracle Clusterware (with its two required shared files; the voting disk and OCR file), located on the new OCFS2 volume. While I was able to install Oracle Clusterware and see the shared drive using FireWire, however, I was receiving many lock-ups and hanging after about 15 minutes when the Clusterware software was running on both nodes. It always varied on which node would hang (either linux1 or linux2 in my example). It also didn't matter whether there was a high I/O load or none at all for it to crash (hang).
After looking through the trace files for OCFS2, it was apparent that access to the voting disk was too slow (exceeding the O2CB heartbeat threshold) and causing the Oracle Clusterware software (and the node) to crash. On the console would be a message similar to the following:
...
Index 0: took 0 ms to do submit_bio for read
Index 1: took 3 ms to do waiting for read completion
Index 2: took 0 ms to do bio alloc write
Index 3: took 0 ms to do bio add page write
Index 4: took 0 ms to do submit_bio for write
Index 5: took 0 ms to do checking slots
Index 6: took 4 ms to do waiting for write completion
Index 7: took 1993 ms to do msleep
Index 8: took 0 ms to do allocating bios for read
Index 9: took 0 ms to do bio alloc read
Index 10: took 0 ms to do bio add page read
Index 11: took 0 ms to do submit_bio for read
Index 12: took 10006 ms to do waiting for read completion
(13,3):o2hb_stop_all_regions:1888 ERROR: stopping heartbeat on all active regions.
Kernel panic - not syncing: ocfs2 is very sorry to be fencing this system by panicing
The solution I used was to increase the O2CB heartbeat threshold from its default value of 31 (which used to be 7 in previous versions of OCFS2), to a value of 61. Some setups may require an even higher setting. This is a configurable parameter that is used to compute the time it takes for a node to "fence" itself. During the installation and configuration of OCFS2, we adjusted this value in the section "Configure O2CB to Start on Boot and Adjust O2CB Heartbeat Threshold". If you encounter a kernel panic from OCFS2 and need to increase the heartbeat threshold, use the same procedures described in the section "Configure O2CB to Start on Boot and Adjust O2CB Heartbeat Threshold".
The following describes how to manually adjust the O2CB heartbeat threshold.
First, let's see how to determine what the O2CB heartbeat threshold is currently set to. This can be done by querying the /proc file system as follows:
# cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold
31
We see that the value is 31, but what does this value represent? Well, it is used in the formula below to determine the fence time (in seconds):
[fence time in seconds] = (O2CB_HEARTBEAT_THRESHOLD - 1) * 2
So, with an O2CB heartbeat threshold of 31, we would have a fence time of:
(31 - 1) * 2 = 60 seconds
If we want a larger threshold (say 120 seconds), we would need to adjust O2CB_HEARTBEAT_THRESHOLD to 61 as shown below:
(61 - 1) * 2 = 120 seconds
Let's see now how to manually increase the O2CB heartbeat threshold from 31 to 61. This task will need to be performed on all Oracle RAC nodes in the cluster. We first need to modify the file /etc/sysconfig/o2cb and set O2CB_HEARTBEAT_THRESHOLD to 61:
#
# This is a configuration file for automatic startup of the O2CB
# driver.  It is generated by running /etc/init.d/o2cb configure.
# Please use that method to modify this file
#

# O2CB_ENABELED: 'true' means to load the driver on boot.
O2CB_ENABLED=true

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=ocfs2

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=61

# O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead.
O2CB_IDLE_TIMEOUT_MS=30000

# O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent
O2CB_KEEPALIVE_DELAY_MS=2000

# O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
O2CB_RECONNECT_DELAY_MS=2000
After modifying the file /etc/sysconfig/o2cb, we need to alter the o2cb configuration. Again, this should be performed on all Oracle RAC nodes in the cluster.
# umount /u02
# /etc/init.d/o2cb offline ocfs2
# /etc/init.d/o2cb unload
# /etc/init.d/o2cb configure
Configuring the O2CB driver.

This will configure the on-boot properties of the O2CB driver.
The following questions will determine whether the driver is loaded on
boot.  The current values will be shown in brackets ('[]').  Hitting
 without typing an answer will keep that current value.  Ctrl-C
will abort.

Load O2CB driver on boot (y/n) [n]: y
Cluster to start on boot (Enter "none" to clear) [ocfs2]: ocfs2
Specify heartbeat dead threshold (>=7) [31]: 61
Specify network idle timeout in ms (>=5000) [30000]: 30000
Specify network keepalive delay in ms (>=1000) [2000]: 2000
Specify network reconnect delay in ms (>=2000) [2000]: 2000
Writing O2CB configuration: OK
Loading module "configfs": OK
Mounting configfs filesystem at /sys/kernel/config: OK
Loading module "ocfs2_nodemanager": OK
Loading module "ocfs2_dlm": OK
Loading module "ocfs2_dlmfs": OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Starting O2CB cluster ocfs2: OK
We can now check again to make sure the settings took place in for the o2cb cluster stack:
# cat /proc/fs/ocfs2_nodemanager/hb_dead_threshold
61
It is important to note that the value of 61 I used for the O2CB heartbeat threshold may not work for all configurations. In some cases, the O2CB heartbeat threshold value may have to be increased to as high as 601 in order to prevent OCFS2 from panicking the kernel.
Oracle Clusterware Installation: Running root.sh Fails on the Last Node
After the Oracle Clusterware install process, running root.sh on the last node will fail while attempting to configure vipca at the end of the script:
Oracle CRS stack installed and running under init(1M)
Running vipca(silent) for configuring nodeapps
/u01/app/crs/jdk/jre//bin/java: error while loading
shared libraries: libpthread.so.0: 
cannot open shared object file: No such file or directory
After receiving this error, please leave the OUI up. Do not hit the OK button on the "Execute Configuration Scripts" dialog until all of the issues described in this section have been resolved.
Note that srvctl will produce similar output until the workaround described below is performed.
This error occurs because these releases of the Linux kernel fix an old bug in the Linux threading that Oracle worked around usingLD_ASSUME_KERNEL settings in both vipca and srvctl, this workaround is no longer valid on OEL5 or RHEL5 or SLES10 hence the failures.
To workaround this issue, edit vipca (in the CRS bin directory on all nodes) to undo the setting of LD_ASSUME_KERNEL. After the IF statement around line 120, add an unset command to ensure LD_ASSUME_KERNEL is not set as follows:
if [ "$arch" = "i686" -o "$arch" = "ia64" ]
then
    LD_ASSUME_KERNEL=2.4.19
    export LD_ASSUME_KERNEL
fi

unset LD_ASSUME_KERNEL    <<== Line to be added

Similarly for srvctl (in both the CRS and, when installed, RDBMS and ASM bin directories on all nodes), unset LD_ASSUME_KERNELby adding one line, around line 168 should look like this:
LD_ASSUME_KERNEL=2.4.19
export LD_ASSUME_KERNEL

unset LD_ASSUME_KERNEL    <<== Line to be added

Note: Remember to re-edit these files on all nodes:
/bin/vipca
  /bin/srvctl 
  /bin/srvctl
  /bin/srvctl  # (If exists)
after applying the 10.2.0.2 or 10.2.0.3 patch sets, as these patchset will still include those settings unnecessary for OEL5 or RHEL5 or SLES10. This issue was raised with development and is fixed in the 10.2.0.4 patchsets.
Also note that we are explicitly unsetting LD_ASSUME_KERNEL and not merely commenting out its setting to handle a case where the user has it set in their environment (login shell).
After working around the LD_ASSUME_KERNEL issue above, vipca will now fail to run with the following error if the VIP IP's are in a non-routable range [10.x.x.x172.(16-31).x.x, or 192.168.x.x]:
[root@linux2 ~]# $ORA_CRS_HOME/bin/vipca
Error 0(Native: listNetInterfaces:[3])
  [Error 0(Native: listNetInterfaces:[3])]
There are several ways to workaround this issue. The goal to this workaround is to get the output of "$ORA_CRS_HOME/bin/oifcfg getif" to include both public and cluster_interconnect interfaces. If you try to run the above command, you will notice it provides nothing which means we have some work to do!
The first step is to identify the current interfaces and IP addresses:
[root@linux2 ~]# $ORA_CRS_HOME/bin/oifcfg iflist
eth1  192.168.2.0
eth0  192.168.1.0
Remember during the Oracle Clusterware install that 192.168.1.0 is my public interface while 192.168.2.0 is the cluster_interconnect interface.
Using this information, we can manually set the public / private interfaces accordingly using the setif option of the$ORA_CRS_HOME/bin/oifcfg command:
# $ORA_CRS_HOME/bin/oifcfg setif -global eth0/192.168.1.0:public
# $ORA_CRS_HOME/bin/oifcfg setif -global eth1/192.168.2.0:cluster_interconnect
Let's know run the "$ORA_CRS_HOME/bin/oifcfg getif" command again to verify its output:
[root@linux2 ~]# $ORA_CRS_HOME/bin/oifcfg getif
eth0  192.168.1.0  global  public
eth1  192.168.2.0  global  cluster_interconnect
After resolving all of the issues above, manually re-run vipca (GUI) as root from the last node in which the errors occurred. Please keep in mind that vipca is a GUI and will need to set your DISPLAY variable accordingly to your X server:
$ORA_CRS_HOME/bin/vipca

Comentarios