Things got a bit weird a couple of weeks ago while trying to upgrade my home lab from 4.1 to 5.1.
Uncategorized
What is the value of a certification?
- Got me a second interview with the company a work for.
- Improved my Skills.
It’s the little things….
It’s the little things that make a successful deployment. Getting caught up in planning the perfect design without looking at the fine detail can discredit the time and effort you have put it.
To VMwares credit, the default install generally runs just fine out of the box with little or no additional work required.
However that being said we recently ran into an issue with one of our drivers which caused a huge amount of headache.
We are running 10Gbe on all our hosts. This works well for us and actually was a request from our network department. I did a large amount of testing and settled on an Intel dual port card. Its fast, efficient, tidy and since we are only using two cards we have four fibre cables running out of the back of the servers instead of many copper cables.
So on to the headache. I was on the way home when I received a call from a very distressed Admin. Apparently one of our ESXi hosts had lost its storage taking 56 VM’s down. The network department reported that the two 10Gbe ports were flapping and nobody really knew what was happening.
After plugging in a copper cable and disabling the fiber. We were back up and running. The host was put into maintenance mode, log files exported and a case raised.
As it turns out the driver we were using was a couple of releases out of date. VMware suggested we update to the latest (obviously). I did query why Update Manager hadn’t presented me with the latest driver so hopefully I’ll get an answer for that soon too.
EDIT (02/07/2012): Just had a twitter conversation with Andre Beukes (@eabeukes). Who let me know that the only the 1000e and the bge drivers are in VUM at the moment. As far as he knows they should be coming soon.
Full conversation:
- Andre Beukes @eabeukes I dont think the ixgbe driver is in the Update Manager repository – its a FCoE card right?
- Carel Maritz @carelmaritz The card doesn’t support FCoE but the driver is igbxe. I was surprised that it isn’t updated by VUM.
- Andre Beukes @eabeukes Only e1000 and bge are in VUM – theres no 10GB cards in there (yet) AFAIK as that will come with soft-FCoE when its released
- Carel Maritz @carelmaritz ah, thanks for the info. I’ll update my post. Do you mind if I quote this conversation?
- Andre Beukes @eabeukes Sure no worries
END EDIT
I have updated the driver and it all seems good. I do, however, like to understand exactly what happened. I went to our syslog server (which is something everybody who uses ESXi should have).
Logs as follows:
vmkernel: 345:02:11:53.704 cpu23:73434691)<3>ixgbe: vmnic15: ixgbe_check_tx_hang: Detected Tx Unit Hang
vmkernel: Tx Queue <6>
vmkernel: TDH, TDT <0>, <3>
vmkernel: next_to_use <3>
vmkernel: next_to_clean <0>
vmkernel: tx_buffer_info[next_to_clean]
vmkernel: time_stamp
vmkernel: 345:02:11:53.704 cpu15:64444910)<6>ixgbe: vmnic15: ixgbe_clean_tx_irq: tx hang 7 detected, resetting adapter
vmkernel: 345:02:11:57.823 cpu1:15580)<3>ixgbe: vmnic17: ixgbe_check_tx_hang: Detected Tx Unit Hang
vmkernel: Tx Queue <4>
vmkernel: TDH, TDT
vmkernel: next_to_use
vmkernel: next_to_clean
vmkernel: tx_buffer_info[next_to_clean]
vmkernel: time_stamp
vmkernel: 345:02:11:57.823 cpu1:15580)<6>ixgbe: vmnic17: ixgbe_clean_tx_irq: tx hang 1 detected, resetting adapter
So to cut a long story short, it pays to sweat the small stuff in any deployment.
Heading down the VCDX Highway
Hostd and ESXi
So I recently had an issue where we had to put a host into maintenance mode quickly to accomodate an emergency change for the network team.
VMTN
- vSphere, Update Manager, ESXi Enterprise Plus (2 or 4 sockets and current memory limits), vShield, workstation/Fusion, vCloud Products
- vSphere, Update Manager, ESXi Enterprise Plus (2 or 4 sockets and current memory limits), Workstation/Fusion, View, Ace, vCloud Products
- Access to all the VMware products, maybe access to the beta products.
The Virtues of Standardisation.
So we are currently moving to 10g fibre in our production cluster and then the nexus 1000v.
Its been a bit of a trial and I thought I would share the experience to date. We purchased 12 X Qlogic fibre cards and 12 X HP fibre cards. this would allow us to take advantage of the maximum amount of 10g ports allowed (4 per server) and we could then get rid of 12 copper cables per server, a total of 144 cables. We would also be using VMware’s own best practice of using cards from two different vendors.
We ran into this issue which was fixed with a firmware upgrade and later, after upgrading to 4.1, ran into the issue discussed in VMware KB 1026431.
I decided to get rid of the HP cards, gave them to the DBA’s, and replaced them with intel cards. The HP cards run really well under windows and have made the DBA’s happy.
The whole ordeal was very frustrating but made me take a closer look at the servers we have. It looks like our operations department hadn’t been very thorough when installing my servers. Of the 12 servers, half had different BIOS revisions. This is a major issue and the latest firmware can really make a difference where performance is concerned. It took a whole weekend to comb through the servers and make sure they were identical. This also included making sure the cards were in the same slots across all servers. Most of the time was spent waiting for Hosts to enter maintenance mode.
Now I know if I look at a host it is going to be the same as every other host. I also spoke to the operations team and gave them an exact how-to for setting up a new host including which firmware the host needs and the exact hardware layout.
VCAP-DCD
So a few weeks ago I got an invitation to the VCAP4-DCD beta exam. This is the first time I have been invited to a beta exam so I jumped at the chance.
Because of the NDA, I can’t really say too much about the content except whats already know. The format is pretty much what was expected.
Exam length, structure and topics are outlined in the exam blueprint.
I really like the format of the in-exam design tool. You can see a demo here: I think it needs a bit more polishing as I got the occasional ghost image on the screen but it seemed to work well. I guess that’s what the beta’s are for. I tried to comment where I could, but for the most part the exam seemed almost complete.
I really can’t say how I did suffice to say its no walk in the park.
Its been a good experience. Studying for the VCAP-DCA/DCD exams has really improved me as a virtualization engineer. I have read countless whitepapers and technical docs and many of the new things I learnt I have applied directly to my job, making the environments I administer better.
A few of the resources which have been invaluable are:
- The brown bags at Professional VMware, Scott Lowes
- Scott Lowes Blog
- Duncan Epping’s blog, particuarly the HA Deep Dive post
- The VMware Cummunities Podcast .
- The VMware communities pages
- Obviously the VMware website.
- And many more, which I will review at a later stage.
So thats it really. I am knackered, but happy.
HA scaremongering
Came back from lunch to find this charming error message: HA initiated a failover action in cluster prod in datacenter example.
This is not the kind of thing you want to see on a Friday afternoon.
A quick check revealed that there was in fact no HA incident happening or waiting to happen.
The quick fix was to remove HA from the cluster and then re-enable it. However if this happens again I will definitely be raising it with VMware support.
Vmworld and Exam registration
Registration opened today for the VCAP-DCA. I have been furiously been trying to get as much study in as I can.
The blueprint guide is huge but I am ploughing through.
A few people have also been working through the blue print and posting study guides on line. Two in particular are Sean Crookston at vFAIL.net and Simon Long at http://simonlong.co.uk/blog/vcap4-dca-study-notes.
The amount of info is huge so I will be looking at getting a tablet sometime soon.
On a different note I have asked work if they will send me to VMworld in Copenhagen. If I get to go I will try to sit the exam at the conference.