My last few cases were cluster related. It's funny, but some customers (and even engineers) always forget how to manage NBU when it's under cluster control. They also believe that they can easily bring NBU up by same old start\stop scripts during the cluster's problem. It wouldn't work out. As soon as we place NBU under Cluster control it means that it's being controlled by cluster Agent. But Agent's functions aren't limited by start\stop\monitor routines. Let's look at possible scenario. We've upgraded both nodes and want to perform a switchover to the another node for the check. It fails. What would we do? Everyone knows that first of all we need to understand why we can't start application in the cluster. Is it NBU's problem or is it cluster related issue? We can disable AutoFailover and switch NBU's service group to the problem node again. It'll stop in OFFLINE|FAULTED state on it. What's next?
1. Mount shared volume
2. Bring up common IP.
3. Start NBU manually.
But NBU will not start.
# bp.start_all Starting nbatd... Starting vnetd... Starting bpcd... Starting nbftclnt... NetBackup will not run without /usr/openv/db/bin/NB_dbsrv running. Starting nbazd... Starting nbevtmgr... Starting nbaudit... Starting spad... Starting spoold... Starting nbemm... Starting nbrb... Starting ltid... Starting bprd... Starting bpcompatd... Starting nbjm... Starting nbpem... Starting nbstserv... Starting nbrmms... Starting nbkms... Starting nbsl... Starting nbars... Starting bmrd... Starting nbvault... Starting nbsvcmon... Starting bmrbd...
Is it a problem with configuration? No.
When we're installing NBU as a cluster master server at the last steps installer calls configuration script which guides us through service group configuration and it also creates /usr/openv/netbackup/bin/cluster/NBU_RSP file. It keeps some cluster related information. Let's have a look at my lab NBU_RSP file:
#DO NOT DELETE OR EDIT THIS FILE!!! NBU_GROUP=nbu SHARED_DISK=/opt/VRTSnbu NODES=nbu-node1 VNAME=nbu-srv VIRTUAL_IP=10.1.5.188 CLUTYPE=VCS START_PROCS=NB_dbsrv nbevtmgr nbemm nbrb ltid vmd bpcompatd nbjm nbpem nbstserv nbrmms nbsl nbvault nbsvcmon bpdbm bprd bptm bpbrmds bpsched bpcd bpversion bpjobd nbproxy vltcore acsd tl8cd odld tldcd tl4d tlmd tshd rsmd tlhcd pbx_exchange nbkms nbaudit nbatd nbazd PRODUCT_CODE=NBU DIR=netbackup mkdir DIR=netbackup/db mv DIR=var mkdir DIR=var/global mv DIR=volmgr/mkdir DIR=volmgr/misc mkdir DIR=volmgr/misc/robotic_db mv DIR=kms mv LINK=volmgr/misc/robotic_db LINK=netbackup/db LINK=var/global PROBE_PROCS=nbevtmgr nbstserv vmd bprd bpdbm nbpem nbjm nbemm nbrb NB_dbsrv nbaudit DIR=netbackup/vault mkdir DIR=netbackup/vault/sessions mv LINK=netbackup/vault/sessions
As we can see there are some LINK records. They are the root cause of manual NBU start problem. After cluster installation some folders are just symlinks to the shared disk's folders and normally we need to use Agent to recreate them. But we're in the middle of troubleshooting and need to recreate them manually. After that NBU will start normally.
What symlinks need to be recreated:
# ls -la /usr/openv/netbackup/ total 170 drwxr-xr-x 12 root bin 512 Mar 27 23:56 . drwxr-xr-x 16 root bin 512 Mar 27 23:29 .. drwxr-xr-x 3 root bin 512 Mar 27 23:13 baremetal drwxr-xr-x 14 root bin 4096 Mar 27 23:57 bin -rw-r--r-- 1 root root 295 Mar 27 23:56 bp.conf drwxr-xr-x 16 root root 512 Mar 27 23:14 client lrwxrwxrwx 1 root root 25 Mar 27 23:56 db -> /opt/VRTSnbu/netbackup/db drwxr-xr-x 4 root bin 512 Mar 27 23:14 db.org drwxr-xr-x 2 root bin 512 Mar 27 23:28 dbext drwxr-xr-x 3 root bin 512 Mar 27 23:28 ext drwxr-xr-x 6 root bin 512 Mar 27 23:28 help drwxr-xr-x 5 root bin 512 Mar 27 23:52 logs -rw-r--r-- 1 root bin 8957 Mar 27 23:29 nblog.conf -rw-r--r-- 1 root bin 8957 Feb 4 2011 nblog.conf.template -rw-r--r-- 1 root bin 1071 Feb 4 2011 nblu.conf.template -rw-r--r-- 1 root root 1913 Mar 27 23:57 nbsvcmon.conf drwxr-xr-x 4 root bin 512 Mar 27 23:28 sec drwxr-xr-x 3 root bin 512 Mar 27 23:56 vault -r--r--r-- 1 root bin 101 Feb 4 2011 version -rw-r--r-- 1 root bin 20379 Feb 4 2011 vfm.conf -r--r--r-- 1 root bin 25232 Feb 4 2011 vfm_master.conf # ls -la /usr/openv/var/ total 26 drwxr-xr-x 6 root bin 512 Mar 27 23:57 . drwxr-xr-x 16 root bin 512 Mar 27 23:29 .. -rw-r--r-- 1 root root 11 Mar 27 23:56 clear_cache_time.txt lrwxrwxrwx 1 root root 23 Mar 27 23:56 global -> /opt/VRTSnbu/var/global drwxr-xr-x 3 root bin 512 Mar 27 23:14 global.org drwxr-xr-x 11 root root 512 Mar 27 23:54 host_cache -rw-r--r-- 1 root root 848 Mar 27 23:17 license.txt -rw------- 1 root root 903 Mar 27 23:57 nbproxy_jm.ior -rw------- 1 root root 903 Mar 27 23:57 nbproxy_pem.ior -r--r--r-- 1 root bin 543 Feb 4 2011 resource_limits_template.xml -rw-r--r-- 1 root root 11 Mar 27 23:56 startup_time.txt drwx------ 4 root bin 1024 Mar 27 23:57 vnetd drwxr-xr-x 5 root bin 512 Mar 27 23:47 vxss # ls -la /usr/openv/netbackup/vault/ total 8 drwxr-xr-x 3 root bin 512 Mar 27 23:56 . drwxr-xr-x 12 root bin 512 Mar 27 23:56 .. lrwxrwxrwx 1 root root 37 Mar 27 23:56 sessions -> /opt/VRTSnbu/netbackup/vault/sessions drwxr-xr-x 2 root bin 512 Mar 27 23:14 sessions.org # ls -la /usr/openv/volmgr/misc/ total 12 drwxr-xr-x 3 root bin 512 Mar 28 00:43 . drwxr-xr-x 6 root bin 512 Mar 27 23:56 .. -rw-r--r-- 1 root root 0 Mar 28 00:43 .ltisymlinks -r--r--r-- 1 root bin 340 Feb 4 2011 README lrwxrwxrwx 1 root root 35 Mar 28 00:42 robotic_db -> /opt/VRTSnbu/volmgr/misc/robotic_db drwxr-xr-x 2 root bin 512 Mar 27 23:14 robotic_db.org -rw------- 1 root root 16 Mar 28 00:43 vmd.lock
Hope this saves someone's time.