New Patch for vSphere

Things got a bit weird a couple of weeks ago while trying to upgrade my home lab from 4.1 to 5.1.

The install of SSO went fine but upgrading the Inventory Service threw up the following error:
Error 29102 . Could not contact Lookup Service. Please check VM_ssoreg.log in system temporary folder for details.

So I dutifully checked the required log file and googled it. It turns out that if the FQDN has the word “port” in it anywhere the install/uprade will fail. REALLY?!?? 
It turns out that this is fixed in 5.1.0a (Build 880471). you can find the details here.
Upgrading to 5.1.0a worked. How that bug got in there is mind boggling.

What is the value of a certification?

Interesting question, as the true value of a certification differs depending on how you present.

For me the true value of being certified has realised itself in two ways:
  1. Got me a second interview with the company a work for.
  2. Improved my Skills.
A few years ago I was looking for a new job and had taken just taken the first of two netapp exams required to get NCDA certified (now I believe one is required). 
I had submitted my CV and managed to get a first interview. On my CV I had stated that I had passed the first exam and intended the second to get the certification. By the time the interview rolled around I had taken and passed the second exam. During the interview I mentioned that I had passed the second exam and had the cert.
This went down well and they felt it showed that I was committed.
Other job interviews granted have been, usually, down to the fact that I have had certifications in the relevant areas as well as experience. In other words it often helps get a foot in the door. Once you are sitting in front of the interview panel it’s up to you to show you know your stuff.
But mostly I do it now to improve my skills.
The VCP certification is VMware’s entry level certification. To get VCP certified you need to attend an official VMware course and then sit the exam. This for me is quite interesting as VMware are trying to ensure that all VCP’s have had exposure to the same or similar material but if your company wont sponsor you it can be quite expensive. It’s also fairly easy and quite common these days but many jobs require you to at least have this cert before you even apply. (Foot in the door!).
The VCAP Exams are much more difficult and, I would expect, don’t have a high first time pass rate. I have taken the VCAP-DCD exam once and missed out by 18 points. Very frustrating.
I took the VCAP4-DCA exam twice before I passed and more recently I took the VCAP5-DCA beta and passed. I guess practice makes perfect. I felt that the DCA exams were similar in format to Red Hats’ exams which give an objective and let the candidate get on with the LAB.
The DCD exams are really about understanding designs. It is important to note, that while reading the recommended white papers and books will help, you really do need to understand how to make decisions required for those designs. Anybody can get a basic VMware setup up and running in next to no time but does it meet the requirements? That’s what you need to get for this exam.
Now studying for these two exams has helped me as a VMware admin more than the cert has helped my career. People who have taken the exams are very cagey about giving out any info about the questions asked, and for good reason. Apart from the fact that VMware will find you and remove your certification and possibly ban you from taking VMware exams in the future, they require a fair degree of study, which means a not insignificant time sacrifice.
VCDX? Well I haven’t done that…. Yet.
But it’s all about what you want from your studies. Can it open doors? Yes, but then it’s up to you to prove yourself. Can it help you improve you skills? Yes, but the higher you aim and the more time you invest.
When all’s said and done, for me it’s been worth it.

It’s the little things….

It’s the little things that make a successful deployment. Getting caught up in planning the perfect design without looking at the fine detail can discredit the time and effort you have put it.

To VMwares credit, the default install generally runs just fine out of the box with little or no additional work required.

However that being said we recently ran into an issue with one of our drivers which caused a huge amount of headache.

We are running 10Gbe on all our hosts. This works well for us and actually was a request from our network department. I did a large amount of testing and settled on an Intel dual port card. Its fast, efficient, tidy and since we are only using two cards we have four fibre cables running out of the back of the servers instead of many copper cables.

So on to the headache. I was on the way home when I received a call from a very distressed Admin. Apparently one of our ESXi hosts had lost its storage taking 56 VM’s down. The network department reported that the two 10Gbe ports were flapping and nobody really knew what was happening.

After plugging in a copper cable and disabling the fiber. We were back up and running. The host was put into maintenance mode, log files exported and a case raised.

As it turns out the driver we were using was a couple of releases out of date. VMware suggested we update to the latest (obviously). I did query why Update Manager hadn’t presented me with the latest driver so hopefully I’ll get an answer for that soon too.

EDIT (02/07/2012): Just had a twitter conversation with Andre Beukes (@eabeukes). Who let me know that the only the 1000e and the bge drivers are in VUM at the moment. As far as he knows they should be coming soon. 
Full conversation:

  • Andre Beukes @eabeukes I dont think the ixgbe driver is in the Update Manager repository – its a FCoE card right?
  • Carel Maritz @carelmaritz The card doesn’t support FCoE but the driver is igbxe. I was surprised that it isn’t updated by VUM.
  • Andre Beukes @eabeukes Only e1000 and bge are in VUM – theres no 10GB cards in there (yet) AFAIK as that will come with soft-FCoE when its released
  • Carel Maritz @carelmaritz ah, thanks for the info. I’ll update my post. Do you mind if I quote this conversation?
  • Andre Beukes @eabeukes Sure no worries

END EDIT

I have updated the driver and it all seems good. I do, however, like to understand exactly what happened. I went to our syslog server (which is something everybody who uses ESXi should have).

Logs as follows:




vmkernel: 345:02:11:53.704 cpu23:73434691)<3>ixgbe: vmnic15: ixgbe_check_tx_hang: Detected Tx Unit Hang
vmkernel:   Tx Queue             <6>
vmkernel:   TDH, TDT             <0>, <3>
vmkernel:   next_to_use          <3>
vmkernel:   next_to_clean        <0>
vmkernel: tx_buffer_info[next_to_clean]
vmkernel:   time_stamp  
vmkernel: 345:02:11:53.704 cpu15:64444910)<6>ixgbe: vmnic15: ixgbe_clean_tx_irq: tx hang 7 detected, resetting adapter



vmkernel: 345:02:11:57.823 cpu1:15580)<3>ixgbe: vmnic17: ixgbe_check_tx_hang: Detected Tx Unit Hang
vmkernel:   Tx Queue             <4>
vmkernel:   TDH, TDT             ,
vmkernel:   next_to_use          
vmkernel:   next_to_clean        
vmkernel: tx_buffer_info[next_to_clean]
vmkernel:   time_stamp    
vmkernel: 345:02:11:57.823 cpu1:15580)<6>ixgbe: vmnic17: ixgbe_clean_tx_irq: tx hang 1 detected, resetting adapter

It was quite interesting. The first nic (vmnic15) registered a stop in tx packets and the driver reset the port. All the portgroups and vmp ports failed over to the other nic (vmnic17). The driver at this stage had gotten itself into a loop. and reset vmnic17. Everything failed over and ……   well you get the picture.

So to cut a long story short, it pays to sweat the small stuff in any deployment.

Heading down the VCDX Highway

Once again, I had hoped to blog more this year but as these things happen I haven’t been able to.
However I have been working hard at getting my certs done. I passed the VCAP4-DCA which was an amazing feeling. I had almost given up on the process as it was very time consuming. My wife has been very supportive which has really helped.
The exam itself was not easy.There are lots of scenarios and a wide variety of tasks so you really need to know your stuff. A lab is a must and you should give yourself a good three months to study. My home lab was inspired by Simon Gallagher’s vTardis and run on my laptop. There are plenty of resources available to help you get going. I would personally recommend Edward Grigson’s guide which can be found at http://www.vexperienced.co.uk.
So with the release of the VCAP5-DCD and the imminent release of the VCAP5-DCA I have decided to go for the VCDX on the vSphere 5 track. I will probably use the design that I put together for my office.
For those of you wanting to go for the VCDX certification has a look at http://vcdxwannabe.com/. It looks like it is going to be a good community for people after the same thing and when I last checked there were a few VCDX’s on there too.
The VCDX’s I have met have been really friendly and have always been approachable, answering any questions I have (and there have been plenty).
There is also the question of whether or not to try to be recognised as a vExpert. The vExpert is not a certification but more of recognition by VMware of an individual’s contribution to the community.
So anyway ramblings aside, I have reading to do.

Hostd and ESXi

So I recently had an issue where we had to put a host into maintenance mode quickly to accomodate an emergency change for the network team.

Now I personally prefer to scale up vs scaling out. There are pro’s and con’s for both.
We run DL580’s in a stretched cluster and each host holds about 60 VM’s. The hosts are equipped with 10Gb cards, which help.
So putting the host into maintenance mode kicked off a bit of a storm and since we are using 10Gb card s it can suck up to 8Gb of the bandwidth.
Shortly after kicking off this process the host became disconnected from the vCenter server.
Right Click -> connect didn’t work. The VM’s were up and the host was responding to pings.
I also couldn’t connect directly using the client.
Connecting to the console through the iLO I was presented with the familiar yellow and grey screen.
I logged in and turned on local tech support mode.
Neither ./sbin/service.sh restart or ./etc/init.d/hostd restart got the host back. Some googling and KB surfing later and I came across VMware KB 1005566 which discusses manually killing the hostd process and running ./sbin/service.sh restart and ./etc/init.d/hostd restart again. And like magic the host was back.

VMTN

So about a week ago Mike Laverick  posted on the VMware communities forms here.

Basically its asking for VMware to restart a program they had several years ago called VMTN. Similar the the Microsoft Technet subscriptions. I can say that without my technet subscription my job would be a lot more difficult. I am able to test out different configurations in my home lab both for studying and proof of concept.

If you support the idea please go over to http://communities.vmware.com/thread/335123and post your support.

I would however ask that you don’t just put +1 in you comment but also put down why you think it would be good or a suggestion of how it could be done.

My post
This is a really good idea and VMware could really push their products forward.
I think a layered approach/pricing would be the most successful/fair, for example:
VMTN – VIRT: Only the hypervisor and related products.
  • vSphere, Update Manager, ESXi Enterprise Plus (2 or 4 sockets and current memory limits), vShield, workstation/Fusion, vCloud Products
VMTN – VIRT + DT: The hypervisor and end user focused.
  • vSphere, Update Manager, ESXi Enterprise Plus (2 or 4 sockets and current memory limits), Workstation/Fusion, View, Ace, vCloud Products
VMTN – ALL:
  • Access to all the VMware products, maybe access to the beta products.
This could help people at different stages in their certification/careers.
VMware, lets get this done. It will help you and the community.
Carel”

The Virtues of Standardisation.

So we are currently moving to 10g fibre in our production cluster and then the nexus 1000v.

Its been a bit of a trial and I thought I would share the experience to date. We purchased 12 X Qlogic fibre cards and 12 X HP fibre cards. this would allow us to take advantage of the maximum amount of 10g ports allowed (4 per server) and we could then get rid of 12 copper cables per server, a total of 144 cables. We would also be using VMware’s own best practice of using cards from two different vendors.

We ran into this issue which was fixed with a firmware upgrade and later, after upgrading to 4.1, ran into the issue discussed in VMware KB 1026431.

I decided to get rid of the HP cards, gave them to the DBA’s, and replaced them with intel cards. The HP cards run really well under windows and have made the DBA’s happy.

The whole ordeal was very frustrating but made me take a closer look at the servers we have. It looks like our operations department hadn’t been very thorough when installing my servers. Of the 12 servers, half had different BIOS revisions. This is a major issue and the latest firmware can really make a difference where performance is concerned. It took a whole weekend to comb through the servers and make sure they were identical. This also included making sure the cards were in the same slots across all servers. Most of the time was spent waiting for Hosts to enter maintenance mode.

Now I know if I look at a host it is going to be the same as every other host. I also spoke to the operations team and gave them an exact how-to for setting up a new host including which firmware the host needs and the exact hardware layout.

VCAP-DCD

So a few weeks ago I got an invitation to the VCAP4-DCD beta exam. This is the first time I have been invited to a beta exam so I jumped at the chance.

Because of the NDA, I can’t really say too much about the content except whats already know. The format is pretty much what was expected.

Exam length, structure and topics are outlined in the exam blueprint.

I really like the format of the in-exam design tool. You can see a demo here: I think it needs a bit more polishing as I got the occasional ghost image on the screen but it seemed to work well. I guess that’s what the beta’s are for. I tried to comment where I could, but for the most part the exam seemed almost complete.

I really can’t say how I did suffice to say its no walk in the park.

Its been a good experience. Studying for the VCAP-DCA/DCD exams has really improved me as a virtualization engineer. I have read countless whitepapers and technical docs and many of the new things I learnt I have applied directly to my job, making the environments I administer better.

A few of the resources which have been invaluable are:

 So thats it really. I am knackered, but happy.

HA scaremongering

Came back from lunch to find this charming error message: HA initiated a failover action in cluster prod in datacenter example.

This is not the kind of thing you want to see on a Friday afternoon.

A quick check revealed that there was in fact no HA incident happening or waiting to happen.

The quick fix was to remove HA from the cluster and then re-enable it. However if this happens again I will definitely be raising it with VMware support.

Vmworld and Exam registration

Registration opened today for the VCAP-DCA. I have been furiously been trying to get as much study in as I can.
The blueprint guide is huge but I am ploughing through.

A few people have also been working through the blue print and posting study guides on line. Two in particular are Sean Crookston at vFAIL.net and Simon Long at http://simonlong.co.uk/blog/vcap4-dca-study-notes.

The amount of info is huge so I will be looking at getting a tablet sometime soon.

On a different note I have asked work if they will send me to VMworld in Copenhagen. If I get to go I will try to sit the exam at the conference.