Performance problems - What do you do and what do you need to get help by Erwin van Londen
So you've got everything calculated, done you due diligence on capacity, number of hosts/hba's. Made a best guess on IO profiles, installed and configured your hosts,switches, array etc and everything is running pretty smoothly.You boss complements you on your work, users are pretty satisfied with response times and your management software hasn't reported any problems.
Then at 03:00 on the dog-watch you get a phone-call from one of your helpdesk colleagues that performance is dropping and users in the other time-zone are getting intermittent errors when sending emails, opening files and retrieving database records is like a little child waiting for Santa to bring him a present on Christmas-eve while it's only November 30th. Ohh, did I forget to mention that the end-of year financial batch-run is about to start of which the CFO is expecting a report at 09:00 on his desk.
This generally means between one and three things:
1. Either you've run into a software/hardware problem which you didn't know of or
2. There has been a change in the infrastructure w.r.t. physical or logical settings or
3. Someone made a change in the IO profiles.
So now what?
You get dressed, rush to you workplace, if you're lucky is just behind your bed when you have remote access, (No, no time for brushing your teeth or a shower) and start looking around. You log into your management software to check if there were any traps received, same for the switches to check if there are any obvious hardware/software failures. All of a sudden you think you've pinpointed a host that looks very suspicious and tell the server admins to shut it down or reboot it. But the situation stays the same.
Then you log into Storage Navigator and activate performance monitor to see if there is anything hot on the array. Nothing really obvious shows up there as well.
In the mean time around two or three hours have past and you decide to call the HDS supportcentre and lodge a severity 1 call. The person on the other side will do his/her due diligence checking for entitlement and first analysis.
Then he or she will ask for a list of events what happened, what is the current status etc etc. Now, although this may seem frustrating to you as why all this has to happen when you a facing a crisis situation and your boss is calling you every 5 minutes (believe me I've seen it, done that, been there) of what the heck is going on and when you will have resolved this issue, this process is important to verify if equipment is covered by HDS maintenance and to check who of the specialist is available to dispatch the call to. (Be aware that a lot of customers have equipment from multiple vendors and they all have contractual obligations that those vendors are not allowed to maintain/repair or even troubleshoot each-others equipment.)
Another hour has passed and during your call with the support person he/she asked you for this detailed description, actions taken, which you have to write down, and a whole set of instructions to retrieve a bunch of log information which will be dispatched to the respective specialists. So this will probably take another 30 to 60 minutes.
OK, what am I trying to say with this.
The moment the support specialist is getting all the data that he needs there is a fair chance that numerous logs already have wrapped. Be aware that most device in a SAN only have a certain amount of space where they can save internal trace logs after which it will wrap the first entries in a FIFO style. If this happens, and believe me it does, there is no way you will get a root cause analysis of the event. Especially in large scale fabrics where a lot of internal traffic is going on as well some logfiles already wrap within 10 to 15 minutes.
Secondly be aware that the HDS support centre is also covering multiple customers at the same time so it also takes some time to assign available specialists.
As you can see, before the support specialists can even start looking at this there has been a lot of time "wasted" as well as the data that support specialist have to look at might be useless.
To prevent this here are some tips that you need to follow:
1. Have a look at https://tuf.hds.com. This site has a lot of information that HDS needs to be able to help you as quickly as possible.
2. Treat you environment as a "crime scene". Don't start issuing all sorts of commands that will convolute the data that you have to send to the HDS support centre. First thing you should is gather all evidence that is appropriate for your environment. Again have a look at the HDS TUF site as mentioned before.
3. Make a plan/workflow with commands and procedures and work out a system that will enable you to collect all this information as soon as possible. Als make sure that your colleagues know how this procedure works.
4. Try to automate this process as much as possible.
SWITCHES:
E.g. when you need to collect Brocade switchlogs you only have to log into a switch and type the command "supportsave". Normally this command enters interactive mode and starts asking for info where to store all this. If however you add some parameters like "-n -u 'uname' -p 'pwd' -h 'host ftp server' " you can create a script to automate this to log into your switches, fire off this command and all supportfiles will be dropped on that FTP server.
From a Cisco side you only have to log in, enter two commands:
1. set term len 0
2. show tech-support details
and capture that output from your terminal program.
Performance monitoring on a switch level is very difficult. What the support specialists first look for is the overall health of the switch. Then they look for rogue ports. (i.e. are there any ports having problems with frame handling or do they have link issues). This covers about 80% of all performance and/or connectivity problems. Of this 80% a lot of cases are caused by cabling problems, dust, or SFP that have gone bad. Fibre cables that are under tension or not properly seated can have a devastating effect on an entire SAN. If you have Brocade's Fabric Watch utility licensed please have a look at the port-fencing feature. This can save you a lot of problems. Cisco has a similar tool available.
There also might be a problem in the firmware. If problems arise like the cabling one mentioned above, chances are that internal routing tables are not properly updated due to a timing issue or otherwise. Even when the fabric stabilises this error in the routing tables might still persist and although you don't see any obvious reason why a host is having problems this might be one of the problems. Especially in large fabrics it takes extensive knowledge to find this problem. If possible very often the easiest and fasted way is to restart the host(s) that are affected. If however this is a storage port who's frames are dropped you have to take this port offline and online again. This re-creates the internal nameserver entries and related routing tables in the fabric and often solves the problem.
One more tip on Brocade switches:
Port statistics are cumulative. They just add up until a switch a restarted or someone clears those stats. If a support specialist is looking for port problems he cannot see which port is having problems this very minute since all ports will probably have a large number but there is no way to find out when these numbers have accumulated. There are two undocumented commands that will clear both back-end and front-end counters and show these counters. I don't expect you to know what these counters mean but for the support organisation they are invaluable.
1. slotstatsclear - clears the counters (duhhh)
2. slotstatsshow - shows the counters (ahumm, seems pretty obvious)
If possible script these commands on a daily basis e.g. at 00:00 midnight and save the output to a separate file each day:
porterrshow
sloterrshow -c 1 -a # Shows the front-end and back-end asic port error counters.
slotstatsshow -c 1 -a # The -c 1 takes one snapshot and -a reads all counters (otherwise it runs continuously and reads only the counters which have a non-zero value. For semi-long or long term differential comparison this is a bit difficult so I always prefer the -a option.
statsclear
slotstatsclear
The last 2 commands clear all stats again and capture a new day worth of data.
On Cisco switches you use the following commands:
Clear counters interface all # for FC interfaces.
clear ips stats all # for FCIP and etherchannel stats
Some commands that might be requested.
show ips stats ip interface gig #/# det
show ips stats tcp interface gig #/# det
show ips stats dma-bridge interface gig #/#
show ips stats hw-comp all
show interface counters
HOSTS.
Then there is the host connectivity problem. Everyone assumes then when multiple HBA's are connected to disk via different physical paths the MPIO (alike) software will prevent everything and any connectivity problem will be handled by this. !!!!WRONG!!! Let me explain why. Multipath software is designed to handle physical path problems. If a path to a disk goes offline the HBA will notify the multipath software and instruct it not to direct any IO via this path to this/those particular volumes. This has been designed and working since the late 1990's and has done so very well with all vendors. Now the problem comes when there is not a physical problem but an end-to end connection to a disk on a logical level is having problems. Most of the time you will see this as SCSI timeout errors. These sorts of errors are not covert by multipath software by design but they still have a huge effect on application availability, performance and data integrity. For this to circumvent most multipath software has some threshold setting where you can instruct the multipath software to place a path off-line if x amount of path errors have occurred in Y amount of time. The X and Y values depend on the application resiliency. If you set the X/Y values higher than the application can sustain it is still of no use to you. So you have to check this.
Before multipath software is able to detect these sort of errors there needs to IO going over that path otherwise it cannot detect these sorts of errors.
What happens if you're not using some sort of loadbalancing algorithm but instead us true fail-over and fail-back mechanism. The above still applies, if there is no IO going over a path but there still is an end-to-end connectivity problem, the multipath software will not detect this. If for some reason the active path goes off-line properly for any reason (valid or not) it cannot fail-over to the other path since that one was having a problem as well. To circumvent that the multipath software has some sort of polling mechanism. (In HDLM it's called IEM or Intermittend Error Monitoring) Basically what it does is it send an SCSI inquiring string to the disk and if it gets a valid response the path is online. If it times out the above mentioned threshold setting start to kick in so if for some reason this IEM polling fails X times in Y minutes it will place that path in offline status and will remove it from the failback list if failback is configured. The last is very important if you have more than two HBA's in your servers.
The HDLM manual is very clear about the internals so I advise you have a look at it.
DONE.
At least for this part. :-)
Performance is not an exact science. The definition of what is good and what is bad is often in the hands of the end-user. For a support organisation they can never define what is good or bad performance. They only know what the equipment is able to handle and if the performance metrics fall within those boundaries. Maybe they can pinpoint hotspots within you environment and if you're lucky give you some advise on how to change something in your configuration to optimize the behaviour. If, however, the equipment has reached the physical limitations of what it can do then you really have to talk to your boss and ask for a budget increase. (Fast, Ohh and keep the phonenumber of your HDS salesrep at hand. :-))
Very often when a performance problem happens the above mentioned scenario is going on and no performance data is captured. Sometimes when SAN admins manage to "solve" or circumvent the problem they extract a dump or trace from the array and expect the support specialist to come up with a root cause of this problem. Unfortunately it doesn't work this way. We have to split the behaviour of system dumps in modular and enterprise arrays.
1. Modular. (AMS series)
Although the array keeps a certain amount of performance data which is embedded in a so called "simple trace" this is often not sufficient to deeply analyse performance problems. As mentioned especially for the modular arrays the amount of data that is saved is very limited because of the physical space limitations inside the controller memory chips where this data is stored.
2. Enterprise. (USP(-V(M)) and NSC
These arrays behave a bit different. There are two ways the array captures and saves the performance data.
1. Via a system option mode. This mode is turned off by default since it will impact all CPU utilisation on front-end and back-end ports since these are now the CPUs that have to manage this performance capturing and handling. So most likely if this mode is turned on all the time you might experience a performance drop. Although not significant most of the times it's not needed so why leave this on.
2. Via Performance Monitor (PerfMon). PerfMon is a tool that resides on the SVP (service processor) this tool captures the performance metrics from all CPUs on a short term (1 minute interval/24h) and long-term (15 minutes interval/3 months) (note: these numbers differ per array type, consult you manual)
For exact procedures of what is required and how to obtain the data again refer to https://tuf.hds.com and select the "Performance" link on the left hand side. Then follow the equipment type that is applicable to you.
!!!!!!!!! !!!!!! Very Important !!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!! TIME, TIME, TIME , TIME !!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!! LOOKS, LOOKS, LOOKS , LOOKS !!!!!!!!!!!!!!!!!!!!!
One of the most important and difficult things technical support is always facing is trying to stitch the SAN together and try to relate events together. To be able to do this it is of the utmost importance that all systems do have the correct time set. Do yourself and the techsupport guys a favour and hook all systems up to an NTP (Network Time Protocol) system so all clocks have the correct time. If one piece of equipment shows an increase of IO at 15:00 and that system runs in GMT and the remote system runs in GMT +5 but the clock settings are incorrect there is no way that the techsupport guys can figure this out unless you tell them the exact time difference. Be aware that if you take performance monitor data as described above also mention the time difference of you local PC/server to the primary site and secondary site where the respective MCU/RCU are located. Otherwise they still look at the incorrect times/dates.
As for the "LOOKS" HDS techsupport does not know your SAN and array configuration as you do. They have no idea when and why certain configuration decisions were made. They are very well capable of stitching a fabric together based on the logfiles you provided but this takes time. If this is a large fabric with multiple interdependent pieces of equipment it can take a lot of time depending on the problem, symptoms and completeness and accuracy of the information provided. If possible create a drawing of your SAN and how everything is connected. MS Visio is preferred but if you use OpenOffice Draw or any other tool please save-as/convert-to a high-res JPEG file.
Now, what if you run into a performance problem on an array level? As mentioned in option 4 above, automate as much as possible. The above mentioned procedures can be executed on a daily basis if you write a script and execute this on a daily basis. This makes sure you always have the latest data available.
When problems start or you get this phone-call at 03:00 in the morning you know you have everything in one place. Call the HDS support centre. Ask for the case ID and while you're still on the phone with the helpdesk gentlemen/lady you already can start uploading you logs and data which you know is the correct information the HDS specialist needs to help you with the problem.
As Jerry Maguire said to Cuba Gooding Jr. in the named-like 1996 movie: Help Me Help You.
Always refer to the information listed on https://tuf.hds.com for collecting data. Procedures and requirements might change over time so check it out regularly.
I hope this was useful to you.
Thanks to Erwin van Londen