Tuesday, March 29, 2011

Ubuntu 10.04 on a SunFire X4500

A while back, I inherited command of a small cluster with a monster of a disk: the Sun Microsystems SunFire X4500 storage server. Two dual-core AMD Opterons, 16GB of ECC memory, and -- count 'em -- 48 hard disks on six LSI SATA controllers. X4500s sold with drives up to 1 TB; mine has 500 GB drives for a total of 20 TB of storage, less the two reserved for the OS.
This seemed like a pretty rad place to host my users' home directories and data, but serving ZFS over NFS turned out to be unusably slow due to some issue with Solaris/ZFS/NFS and being too picky about synchronous IO. Later SunFire servers were sold with a solid state drive as a separate intent log (slog) (i.e. journal) device. Writes committed to this were good enough for ZFS, and the flushes are virtually instantaneous.

I added a new PCI-X LSI SATA controller (since the X4500 has zero spare SATA ports) and an Intel X-25E SSD as a slog, and saw about an order of magnitude improvement in write performance. This was still worse than any of my ext4 Linux NFS servers, but usable.

On account of this poor performance and the cloud of FUD surrounding Sun/Oracle these days, the time has come for the X4500 to run Linux. And not one of the distributions it shipped with -- that would be boring.

My other NFS servers run Ubuntu 10 Server, so I chose 10.04 LTS as the new OS. Confident that it was possible since a guy on the internet did this once, I set out to Ubuntify the beast.

The X4500 wasn't feeling my external USB CD drive, so I started ILOM remote console redirection from the service processor's web interface dealie. CD and CD image redirection actually worked, and the X4500 is configured by default to boot to the redirected virtual CD drive. Ubuntu installer achieved.

This proceeded exactly as it would on a real computer, even using a monitor and USB keyboard plugged right into the X4500. The partitioner took forever to scan, but found all 48 disks plus my SSD which it thought was a swap partition. Annoyingly, the OS sees the 48 internal disks as if each were on its own controller.

Not wanting to go scorched-earth just yet, I had hoped to leave the two UFS-formatted Solaris root disks and the ZFS slog SSD untouched. So, I popped two new drives for Ubuntu into my external enclosure next to the SSD.

Unfortunately, there is some hardcoded magic in the internal SATA controller that only spins up two devices for the BIOS to see. So you *have* to put the boot partition on one of these. More on that later. Not knowing this, I proceeded with the install.

I set up software RAID-1 (mirroring) in the installer, like so:

Then, started the install:

This was going swimmingly until it failed to install GRUB with a horrible red error screen. This is what forum guy (n6mod) had warned us about.

I finished the install and rebooted (leaving the CD attached).

When I got back to the installer, I chose rescue mode and followed the prompts, then selected the root device (/dev/md0) as the environment for the shell. I installed smartmontools and used smartctl --all /dev/device-name to get the serial numbers of the Solaris disks. They turned out to be in physical slots 0 and 1... go figure.
I shut down, popped those guys out and replaced them with my own two disks (right out of the external enclosure). I rebooted again and returned to the recovery console, where I installed GRUB2 with apt-get install grub2. After telling it to install the boot stuff on my boot partition (now /dev/sdz1, was /dev/sday1 when I partitioned it), I rebooted once again and it Just Worked. Bam.

Now, I had to handle the disks -- how to RAID these things up!? I cobbled them into seven 6-disk RAID5 arrays chained up into a single RAID0 array (i.e. RAID5+0). With this architecture, a single disk failure necessitates reading 6 drives, not all 48, to recover. I wrote a python script to do the partitioning and md creation as 48 is a big number. This hierarchy leaves a few spares, which I am ignoring for now but could be used as spares if a device fails (if only hot spares could be shared... I guess these are warm spares??).

I had planned to format this with ext4 like my other servers, but it turns out that while ext4 volumes can be up to 1 EB in size, e2fsprogs (including mkfs.ext4) only supports up to 16 TB.

After perusing some flame wars on Phoronix and the Wikipedia comparison of file systems for a bit and learning that all filesystems suck and can't be shrunk, I settled on XFS.

XFS performance is best if blocks are aligned with RAID chunks, which mdadm makes 64k by default. So I formatted /dev/md20 (the giant stripey thing) with:

mkfs.xfs -f -d su=65536,sw=7 /dev/md20

where su = RAID chunk size, sw = number of disks in RAID-0. I then mounted the volume and fired up bonnie++ for some benchmarking goodness. The results were pretty amazing:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
vorton 31976M 847 94 387044 84 174342 69 1715 97 534121 99 578.6 63
Latency 24186us 32591us 62407us 10213us 50292us 48030us

Version 1.96 ------Sequential Create------ --------Random Create--------
vorton -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 2098 23 +++++ +++ 1741 19 2186 24 +++++ +++ 1978 23
Latency 53673us 2197us 56708us 36312us 56us 46485us


Yup: sequential block reads at 387 M/s and writes at 534 M/s. Not too bad for a bunch of 7200 RPM drives.

To conclude: it works. Should you be so blessed/afflicted as to have a SunFire X4500 laying around, don't give up hope -- this is still a blazing fast disk server and can be migrated to Ubuntu Server in a matter of hours.

Basically, best. enclosure. ever. (Though this thing is cool too).

UPDATE: The X4500 has 4 Intel e1000 gigabit network ports. One I'm using for the interwebs, leaving the others for the cluster. I used ifenslave to bond them all together (mode 6, adaptive load balancing), and used iperf to saturate it at 2.56 Gbit/s:

From sphaleron
Pretty speedy, and you could probably get closer the theoretical limit (3 Gbit/s) using bond mode 4 (802.11ad dynamic link aggregation).