I finally figured it out…
By john on Sep 9, 2008 in Apple, featured
A long while back I acquired an Apple XServe for use in my home network. It was used, and definitely not new (sadly, not an Intel XServe), but still quite the powerful (and loud) machine! At the behest of my buddy I set it up as my gateway, providing VPN termination, firewall,DNS, nat, dhcp and OpenDirectory service. I agreed and we got Leopard Server installed and fully configured. It was cool, I will admit. Until it started crashing. Regularly.
First it was AFP, it kept locking up and requiring a restart of the process - this in turn caused all sorts of problems with my portable home directory on my Macbook Pro. Eventually syncing just completely stopped working and nothing I did could get it to work again.
That’s about when I gave up trying.
Then I moved, and moved the XServe with me, and decided to wipe the machine and give it another go - figuring the problem to be something corrupt with the installation. I did that, and at first things seemed to be running fine. The next morning, though, I realized I was wrong. The system had locked up hard and required a hard reset. Over the next few days I experienced random lockups, kernel panics, and services dying. Then the boot volume raid array (raid 1, software, and yes I am aware it’s generally not considered good form to use a RAID for your boot volume - I don’t care about good form) degraded. It was listing one of the drives as “FAILED”, but the drive hardware itself was listing itself as fine. I opted to wipe the drive and try rebuilding the array. I did that and after a few failures during the rebuild process it finally took. Then degraded a few hours later.
I looked through the logs to see if I could find a cause for the degraded array (and the failed rebuilds) and was somewhat surprised that the reason was an I/O error reading from the “good” non-failed drive. Interesting.
I replaced the “FAILED” drive with a known good drive of the same capacity, and tried another rebuild. Again it failed - same error message. Aha! I have the culprit!
It turns out that the problem was on the supposedly good drive in the array, and not the ones showing up as failed (like you’d expect). I replaced that drive and did a full clean install of the OS (I hadn’t spent much time populating OD or anything since the system had been so unstable) - the install took a LOT less time than it had on previous attempts and the system itself is behaving a lot more stable. AFP has yet to crash, syncing works fine for the most part (though I don’t recommend doing it over a 54 Meg wifi connection
Also as a side note: Apple’s XGrid Administration and High Performance Computing guide has great instructions on building a gateway device (look at the section for “Setting up Cluster Controller”, but there’s one thing you really should change from how they do it - use AFP instead of NFS for home directories. NFS caused all sorts of problems for me (logins would not work) while AFP worked perfectly (and solved the problems). Otherwise it’s a pretty good guide to follow, though I made some modifications to my approach (given that I’m on a dynamic IP address, instead of static. If you want to know how I got kerberos and OD to work with that email me).
Popularity: 66% [?]
Post a Comment