It’s the little things….

It’s the little things that make a successful deployment. Getting caught up in planning the perfect design without looking at the fine detail can discredit the time and effort you have put it.

To VMwares credit, the default install generally runs just fine out of the box with little or no additional work required.

However that being said we recently ran into an issue with one of our drivers which caused a huge amount of headache.

We are running 10Gbe on all our hosts. This works well for us and actually was a request from our network department. I did a large amount of testing and settled on an Intel dual port card. Its fast, efficient, tidy and since we are only using two cards we have four fibre cables running out of the back of the servers instead of many copper cables.

So on to the headache. I was on the way home when I received a call from a very distressed Admin. Apparently one of our ESXi hosts had lost its storage taking 56 VM’s down. The network department reported that the two 10Gbe ports were flapping and nobody really knew what was happening.

After plugging in a copper cable and disabling the fiber. We were back up and running. The host was put into maintenance mode, log files exported and a case raised.

As it turns out the driver we were using was a couple of releases out of date. VMware suggested we update to the latest (obviously). I did query why Update Manager hadn’t presented me with the latest driver so hopefully I’ll get an answer for that soon too.

EDIT (02/07/2012): Just had a twitter conversation with Andre Beukes (@eabeukes). Who let me know that the only the 1000e and the bge drivers are in VUM at the moment. As far as he knows they should be coming soon. 
Full conversation:

  • Andre Beukes @eabeukes I dont think the ixgbe driver is in the Update Manager repository – its a FCoE card right?
  • Carel Maritz @carelmaritz The card doesn’t support FCoE but the driver is igbxe. I was surprised that it isn’t updated by VUM.
  • Andre Beukes @eabeukes Only e1000 and bge are in VUM – theres no 10GB cards in there (yet) AFAIK as that will come with soft-FCoE when its released
  • Carel Maritz @carelmaritz ah, thanks for the info. I’ll update my post. Do you mind if I quote this conversation?
  • Andre Beukes @eabeukes Sure no worries

END EDIT

I have updated the driver and it all seems good. I do, however, like to understand exactly what happened. I went to our syslog server (which is something everybody who uses ESXi should have).

Logs as follows:




vmkernel: 345:02:11:53.704 cpu23:73434691)<3>ixgbe: vmnic15: ixgbe_check_tx_hang: Detected Tx Unit Hang
vmkernel:   Tx Queue             <6>
vmkernel:   TDH, TDT             <0>, <3>
vmkernel:   next_to_use          <3>
vmkernel:   next_to_clean        <0>
vmkernel: tx_buffer_info[next_to_clean]
vmkernel:   time_stamp  
vmkernel: 345:02:11:53.704 cpu15:64444910)<6>ixgbe: vmnic15: ixgbe_clean_tx_irq: tx hang 7 detected, resetting adapter



vmkernel: 345:02:11:57.823 cpu1:15580)<3>ixgbe: vmnic17: ixgbe_check_tx_hang: Detected Tx Unit Hang
vmkernel:   Tx Queue             <4>
vmkernel:   TDH, TDT             ,
vmkernel:   next_to_use          
vmkernel:   next_to_clean        
vmkernel: tx_buffer_info[next_to_clean]
vmkernel:   time_stamp    
vmkernel: 345:02:11:57.823 cpu1:15580)<6>ixgbe: vmnic17: ixgbe_clean_tx_irq: tx hang 1 detected, resetting adapter

It was quite interesting. The first nic (vmnic15) registered a stop in tx packets and the driver reset the port. All the portgroups and vmp ports failed over to the other nic (vmnic17). The driver at this stage had gotten itself into a loop. and reset vmnic17. Everything failed over and ……   well you get the picture.

So to cut a long story short, it pays to sweat the small stuff in any deployment.

Heading down the VCDX Highway

Once again, I had hoped to blog more this year but as these things happen I haven’t been able to.
However I have been working hard at getting my certs done. I passed the VCAP4-DCA which was an amazing feeling. I had almost given up on the process as it was very time consuming. My wife has been very supportive which has really helped.
The exam itself was not easy.There are lots of scenarios and a wide variety of tasks so you really need to know your stuff. A lab is a must and you should give yourself a good three months to study. My home lab was inspired by Simon Gallagher’s vTardis and run on my laptop. There are plenty of resources available to help you get going. I would personally recommend Edward Grigson’s guide which can be found at http://www.vexperienced.co.uk.
So with the release of the VCAP5-DCD and the imminent release of the VCAP5-DCA I have decided to go for the VCDX on the vSphere 5 track. I will probably use the design that I put together for my office.
For those of you wanting to go for the VCDX certification has a look at http://vcdxwannabe.com/. It looks like it is going to be a good community for people after the same thing and when I last checked there were a few VCDX’s on there too.
The VCDX’s I have met have been really friendly and have always been approachable, answering any questions I have (and there have been plenty).
There is also the question of whether or not to try to be recognised as a vExpert. The vExpert is not a certification but more of recognition by VMware of an individual’s contribution to the community.
So anyway ramblings aside, I have reading to do.

Hostd and ESXi

So I recently had an issue where we had to put a host into maintenance mode quickly to accomodate an emergency change for the network team.

Now I personally prefer to scale up vs scaling out. There are pro’s and con’s for both.
We run DL580’s in a stretched cluster and each host holds about 60 VM’s. The hosts are equipped with 10Gb cards, which help.
So putting the host into maintenance mode kicked off a bit of a storm and since we are using 10Gb card s it can suck up to 8Gb of the bandwidth.
Shortly after kicking off this process the host became disconnected from the vCenter server.
Right Click -> connect didn’t work. The VM’s were up and the host was responding to pings.
I also couldn’t connect directly using the client.
Connecting to the console through the iLO I was presented with the familiar yellow and grey screen.
I logged in and turned on local tech support mode.
Neither ./sbin/service.sh restart or ./etc/init.d/hostd restart got the host back. Some googling and KB surfing later and I came across VMware KB 1005566 which discusses manually killing the hostd process and running ./sbin/service.sh restart and ./etc/init.d/hostd restart again. And like magic the host was back.

VMTN

So about a week ago Mike Laverick  posted on the VMware communities forms here.

Basically its asking for VMware to restart a program they had several years ago called VMTN. Similar the the Microsoft Technet subscriptions. I can say that without my technet subscription my job would be a lot more difficult. I am able to test out different configurations in my home lab both for studying and proof of concept.

If you support the idea please go over to http://communities.vmware.com/thread/335123and post your support.

I would however ask that you don’t just put +1 in you comment but also put down why you think it would be good or a suggestion of how it could be done.

My post
This is a really good idea and VMware could really push their products forward.
I think a layered approach/pricing would be the most successful/fair, for example:
VMTN – VIRT: Only the hypervisor and related products.
  • vSphere, Update Manager, ESXi Enterprise Plus (2 or 4 sockets and current memory limits), vShield, workstation/Fusion, vCloud Products
VMTN – VIRT + DT: The hypervisor and end user focused.
  • vSphere, Update Manager, ESXi Enterprise Plus (2 or 4 sockets and current memory limits), Workstation/Fusion, View, Ace, vCloud Products
VMTN – ALL:
  • Access to all the VMware products, maybe access to the beta products.
This could help people at different stages in their certification/careers.
VMware, lets get this done. It will help you and the community.
Carel”

#UKVMUG

So thats my first VMUG over.
The night before (2nd of November) VEEAM were holding a vCurry (networking) event. It was a lot of fun. Also met Duncan Epping of yellow-bricks fame. I have been reading his blog for a while so I went over and introduced myself and chatted with him and Andre Beukes (http://www.virtualiseeverything.com)for a few minutes before heading off.
It was a really good day. It was at the Motorcycle museum just outside Birmingham.

The keynote by Joe Baguley was quite informative. I will definitely be looking into Project Octopus. It’s interesting to see how VMware are always trying to look forward and what they think will be next.
Having a few minutes after the keynote I wondered over to Simon Gallagher to ask if I could watch some of the VCDX hopefuls testdrive their presentations and managed to get myself talked into test driving my own.
After that I attended a presentation by Xangati (https://www.xangati.com). The product came across really well and I’ll be asking the network team to have a look. Then on to a Q&A session with Duncan Epping and Lee Dilworth.
When the Q&A finished I made my way down to lunch and presented my design to Duncan Epping and Simon Gallagher. I don’t think the questions were unfair but I did slip up at one point when I wrongly thought something was a supported configuration.
I skipped the next round of talks as they were vendor focused and spent some time walking around the exhibition floor talking to the reps at the various stands.
The last session I attended was by Simon Gallagher. He talked about his lab setup and why it was good to have one.

I would recommend attending a VMUG to any IT person with even a passing interest in IT.

vSphere book and vbeers

Just a quick post:
Firstly I would like to recommend the book VMware vSphere Design. I have been using it to study for my VCAP-DCD exam and so far its been a real help. The writing style is easy to read and the authors obviously know their stuff. It is available in eBook format too, which has been a real help for me.

Also vBeers in London soon: http://www.vbeers.org/2011/08/19/vbeers-london-thurs-1st-sept-2011/. Hope to see you there. If you aren’t in London have a look at the website http://www.vbeers.org/, there is a listing of vBeers around the world.

Cheers

Carel

Failed vMotion, host busy, White Spaces and Directory Names.

So recently I came across and issue where a “space” at the end of a VM’s directory causes it to fail to vMotion.

The error is typically cryptic stating that the host is too busy.
If you try to browse the directory using the vsphere client it looks empty. You also can’t rename the directory using the vsphere.

For us the problem came about when we were creating VM’s. We typically created quite long machine names in vsphere. “machine name – OS – description.
Ask you can see we tend to add Tue description of the vm to the name. Now what vSphere does is take the first 32 characters and use that as the directory name. If the 32nd character is a “space”….. well you get the picture.

I solved the issue by using the vMA (this can also be done using the ESX console). I use the vMA a fair amount and have all of our nfs shares mounted in /mnt. I would recommend this as it just good to have that kind of access to the virtual machine files.

Anyway the fix:
1. Shutdown and remove the VM from inventory.( Don’t delete from disk!!!).
2. Using the vMA browse to the location of the folder. Rename through folder using the mv command. mv “folder with space ” “folder with space”.
3. Renregister Tue vm with sphere or the ESX server.

If anybody else has had this problem and solved it drop me a line.

The Virtues of Standardisation.

So we are currently moving to 10g fibre in our production cluster and then the nexus 1000v.

Its been a bit of a trial and I thought I would share the experience to date. We purchased 12 X Qlogic fibre cards and 12 X HP fibre cards. this would allow us to take advantage of the maximum amount of 10g ports allowed (4 per server) and we could then get rid of 12 copper cables per server, a total of 144 cables. We would also be using VMware’s own best practice of using cards from two different vendors.

We ran into this issue which was fixed with a firmware upgrade and later, after upgrading to 4.1, ran into the issue discussed in VMware KB 1026431.

I decided to get rid of the HP cards, gave them to the DBA’s, and replaced them with intel cards. The HP cards run really well under windows and have made the DBA’s happy.

The whole ordeal was very frustrating but made me take a closer look at the servers we have. It looks like our operations department hadn’t been very thorough when installing my servers. Of the 12 servers, half had different BIOS revisions. This is a major issue and the latest firmware can really make a difference where performance is concerned. It took a whole weekend to comb through the servers and make sure they were identical. This also included making sure the cards were in the same slots across all servers. Most of the time was spent waiting for Hosts to enter maintenance mode.

Now I know if I look at a host it is going to be the same as every other host. I also spoke to the operations team and gave them an exact how-to for setting up a new host including which firmware the host needs and the exact hardware layout.

VCAP-DCD

So a few weeks ago I got an invitation to the VCAP4-DCD beta exam. This is the first time I have been invited to a beta exam so I jumped at the chance.

Because of the NDA, I can’t really say too much about the content except whats already know. The format is pretty much what was expected.

Exam length, structure and topics are outlined in the exam blueprint.

I really like the format of the in-exam design tool. You can see a demo here: I think it needs a bit more polishing as I got the occasional ghost image on the screen but it seemed to work well. I guess that’s what the beta’s are for. I tried to comment where I could, but for the most part the exam seemed almost complete.

I really can’t say how I did suffice to say its no walk in the park.

Its been a good experience. Studying for the VCAP-DCA/DCD exams has really improved me as a virtualization engineer. I have read countless whitepapers and technical docs and many of the new things I learnt I have applied directly to my job, making the environments I administer better.

A few of the resources which have been invaluable are:

 So thats it really. I am knackered, but happy.

HA scaremongering

Came back from lunch to find this charming error message: HA initiated a failover action in cluster prod in datacenter example.

This is not the kind of thing you want to see on a Friday afternoon.

A quick check revealed that there was in fact no HA incident happening or waiting to happen.

The quick fix was to remove HA from the cluster and then re-enable it. However if this happens again I will definitely be raising it with VMware support.