It’s the little things….

It’s the little things that make a successful deployment. Getting caught up in planning the perfect design without looking at the fine detail can discredit the time and effort you have put it.

To VMwares credit, the default install generally runs just fine out of the box with little or no additional work required.

However that being said we recently ran into an issue with one of our drivers which caused a huge amount of headache.

We are running 10Gbe on all our hosts. This works well for us and actually was a request from our network department. I did a large amount of testing and settled on an Intel dual port card. Its fast, efficient, tidy and since we are only using two cards we have four fibre cables running out of the back of the servers instead of many copper cables.

So on to the headache. I was on the way home when I received a call from a very distressed Admin. Apparently one of our ESXi hosts had lost its storage taking 56 VM’s down. The network department reported that the two 10Gbe ports were flapping and nobody really knew what was happening.

After plugging in a copper cable and disabling the fiber. We were back up and running. The host was put into maintenance mode, log files exported and a case raised.

As it turns out the driver we were using was a couple of releases out of date. VMware suggested we update to the latest (obviously). I did query why Update Manager hadn’t presented me with the latest driver so hopefully I’ll get an answer for that soon too.

EDIT (02/07/2012): Just had a twitter conversation with Andre Beukes (@eabeukes). Who let me know that the only the 1000e and the bge drivers are in VUM at the moment. As far as he knows they should be coming soon. 
Full conversation:

  • Andre Beukes @eabeukes I dont think the ixgbe driver is in the Update Manager repository – its a FCoE card right?
  • Carel Maritz @carelmaritz The card doesn’t support FCoE but the driver is igbxe. I was surprised that it isn’t updated by VUM.
  • Andre Beukes @eabeukes Only e1000 and bge are in VUM – theres no 10GB cards in there (yet) AFAIK as that will come with soft-FCoE when its released
  • Carel Maritz @carelmaritz ah, thanks for the info. I’ll update my post. Do you mind if I quote this conversation?
  • Andre Beukes @eabeukes Sure no worries

END EDIT

I have updated the driver and it all seems good. I do, however, like to understand exactly what happened. I went to our syslog server (which is something everybody who uses ESXi should have).

Logs as follows:




vmkernel: 345:02:11:53.704 cpu23:73434691)<3>ixgbe: vmnic15: ixgbe_check_tx_hang: Detected Tx Unit Hang
vmkernel:   Tx Queue             <6>
vmkernel:   TDH, TDT             <0>, <3>
vmkernel:   next_to_use          <3>
vmkernel:   next_to_clean        <0>
vmkernel: tx_buffer_info[next_to_clean]
vmkernel:   time_stamp  
vmkernel: 345:02:11:53.704 cpu15:64444910)<6>ixgbe: vmnic15: ixgbe_clean_tx_irq: tx hang 7 detected, resetting adapter



vmkernel: 345:02:11:57.823 cpu1:15580)<3>ixgbe: vmnic17: ixgbe_check_tx_hang: Detected Tx Unit Hang
vmkernel:   Tx Queue             <4>
vmkernel:   TDH, TDT             ,
vmkernel:   next_to_use          
vmkernel:   next_to_clean        
vmkernel: tx_buffer_info[next_to_clean]
vmkernel:   time_stamp    
vmkernel: 345:02:11:57.823 cpu1:15580)<6>ixgbe: vmnic17: ixgbe_clean_tx_irq: tx hang 1 detected, resetting adapter

It was quite interesting. The first nic (vmnic15) registered a stop in tx packets and the driver reset the port. All the portgroups and vmp ports failed over to the other nic (vmnic17). The driver at this stage had gotten itself into a loop. and reset vmnic17. Everything failed over and ……   well you get the picture.

So to cut a long story short, it pays to sweat the small stuff in any deployment.