Wednesday, November 9, 2011

Hitachi Deskstar + RocketRAID 232x

Spent the morning tracking down a stupid RAID problem, so hopefully this helps someone...

I use RocketRAID 2320 cards as HBAs in a few Ubuntu servers. I use software RAID (mdadm), so no need for the RAID features.

The trick to using the rr232x as an HBA is to configure each drive as a separate single-disk JBOD "array" in the RocketRAID BIOS. In the servers I've set up before, I've used eight Western Digital (Green) 1.5 TB and 2.0 TB drives, did the JBOD thing, and they appeared to the OS without issue.

My new server has 12 3.0 TB Hitachi Deskstars (H3IK3000), and when I installed the rr232x driver, I saw no drives. The RAID card could see the drives, and the OS could see the RAID card, but the OS couldn't see the drives.

When the drives that had been attached to the RR2320 were plugged directly into the motherboard, I noticed that they didn't spin up. The OS saw them and tried to talk, but timed out and gave up. Identical drives not exposed to the RR2320 worked fine.

There turned out to be two problems:

  1. The staggered spin-up feature of the RR2320. Apparently not compatible with these Deskstars, when enabled it flips a bit telling the drives not to spin up at all. The solution is to enable-then-disable the staggered spin up setting (Settings menu in the BIOS utility).
  2. The RR2320 just doesn't seem to like H3IK3000s. Solution: run the drives in "Legacy" mode rather than as JBOD arrays, a tidbit I picked up from here.
To run in Legacy mode, put an empty partition on a drive (plug into mobo, "mklabel" and "mkpart" in parted). The RR2320 will ignore it and pass it through to the OS.

Recap:
  1. Enable-then-disable staggered spin-up
  2. Plug each drive into the motherboard and fire up parted
    1. mklabel gpt
    2. mkpart (enter: null, xfs, 1, 3tb)
    3. set 1 raid on
  3. Plug drives back into the RR2320 -- should see them on boot
  4. mdadm --create /dev/md0 --raid-level=6 --raid-devices=12 /dev/sdb1 /dev/sdc1 ... /dev/sdm1
  5. mkfs.xfs /dev/md0

Thursday, October 27, 2011

This Is Not the DBMS You Are Looking For.

NoSQL databases such as CouchDB can be great tools for some jobs. Their concept of "denormalization" shouldn't be thought of as a rebellion against repressive traditional SQL data structures, but as a formalization of the usual workarounds in the special cases where SQL sucks. If you are trying to represent nested objects with possibly unknown fields, SQL is a nightmare. Before, you might have given up, cursing Boyce and Codd, and rolled your own object store (store blobs in fields, or [gasp] a text file). Now, CouchDB gives you a way to store your complicated data and query it efficiently.

That efficiency comes with a price, however. The problem, and the reason CouchDB won't ever be useful for anything but playing around and special-purpose internal storage for applications, is security, which basically can't be done. A user can see an entire database or none of it, a fact fundamental to the design of NoSQL document stores like CouchDB, and one many a n00b on Stack Overflow seems unwilling to accept.

The reason is simple. CouchDB views are efficient because it precomputes the answer, sorted by the view keys. A request grabs a contiguous block of memory, and the server doesn't have to do anything. Incidentally, this is why you cannot filter by one thing and sort by another. The server is lazy (and hence fast) -- it gets to "relax" while putting the work of sorting, etc., on you. The problem is that the server can't do any reorganization of the rows it returns, including authentication to validate which rows it should. That would be work, and would ruin the magical storage paradigm and the performance. Since you can't filter rows by user, any user who can read the database can read the entire thing -- they can use the /_all_docs view, they can scan document IDs, etc. You can stop users from writing by implementing document update validation, but everyone reads everything.

This means you cannot do per-document authentication with lists, shows, or views. You cannot rely on  a "valid_user" key emitted in a view. Attempting to use URL rewriting to hide parts of the HTTP API is asking for trouble. You can only set per-database security, with so-called "readers" who may in general read and write everything but design documents, and "admins" who can read and write everything.

This is fine if you trust all your users and have nothing to hide, but CouchDB likes to posture as a real web framework for the real internet. It has a web server, lists and shows can effectively be used for templating, it has module support, user authentication, support for running middleware daemons, and more. CouchApps --  web applications contained in a database's design documents -- are a thing.

These are only practical in totally controlled situations: namely, localhost. You can't really deploy them anywhere. Even the user authentication that is built into CouchDB is a red herring -- it is fundamentally flawed because it is implemented at the server level and not the database level. You can't write Twitter as a CouchApp. You can make a blog, but you can't make a Blogger. Nor a Facebook, nor a chat server, nor much of anything else.

Consider the chat server, since "Toast" is a famous example described at this page with an ugly URL: http://jchrisa.net/drl/_design/sofa/_list/post/post-page?startkey=%5B%22Simple-Wins%22%5D. If you really wanted to deploy Toast, you would quickly find yourself scratching your head, thinking that there surely must be a way make two separate chatrooms, readable only by the participants, without making a separate database for each. There is not a way. The "official" solution is to create a separate database for each user and copy the chat content into each user's db with filtered replication. This doesn't exactly bespeak scalability. Despite Couchbase co-founder J. Chris Anderson's 1999-esque claims that "disk is free," (few bytes * n_tweets * n_users!) is not free if you're Twitter-scale.

With CouchDB, the only way to implement per-document security is in middleware -- you can add security objects to your documents (like those at the DB level), intercept requests to the CouchDB server, and filter CouchDB's responses there. There are a few problems with this approach:

  1. It sacrifices all the performance of CouchDB, since you have to iterate over all the rows.
  2. You can't do anything about list functions. Lists operate on view results, and it's all done internally -- you can't intercept the view rows before they get to the list.
  3. URL rewrites must be handled upstream of middleware -- it has to know what it's requesting
The right answer is for NoSQL document-based database systems to recognize real-world problems and evolve. The DBMS of the future should do the above-described middleware's function internally. It may not be as fast as CouchDB, but at least it would be usable.

Wednesday, September 21, 2011

Cross-cultural chocolate bar taste-off

Particle physics is a highly international field, which means you get to work with people from all over the world and hear about how their country's chocolate is superior to yours.
Today we put these claims to experimental test, pitting English candy bars against their American cousins. Our contestants were the English Mars vs US Milky Way in the nougat+caramel category, and English Milky Way vs. US Three Musketeers in nougat-only.

UKUSA
Caramel & Nougat
Just Nougat


uk milky way

team usa
soft marshmallowey nougat, lacking in body, "like a chocolatey molten peep"
team england
not as good as they used to be, which was like the american 3M but "with decent chocolate"

us three musketeers

team usa
much firmer than uk milky way, much more chocolatey
team england
<grimaces>

uk mars

team usa
creamy chocolate, well-balanced caramel and nougat
team england
best in show

us milky way

team usa
firmer than uk mars, more chocolatey, best in show
team england
at first, indistinguishable from uk mars. after some thought, decided mars was better.

summary:

English people like English candy bars, while Americans like American ones.

"the england one is weak and insubstantial. the american one has some hair on its chest. chocolate hair. egh..."

Friday, September 16, 2011

Setting up the environment in a non-interactive SSH session (Ubuntu)

Problem:

You are using SSH to run commands on a remote Ubuntu host, like
$ ssh remotehost command args

and require that the remote host set up some environment (set PATH, source a script, etc.)

The remote host ignores .bashrc, .bash_profile, .ssh/environment, /etc/profile, etc.

Solution:

At the top of the default .bashrc in Ubuntu, you will find the lines
# If not running interactively, don't do anything
[ -z "$PS1" ] && return

SSH does not execute an interactive shell when called with a command argument, so .bashrc bails immediately. Comment out the second line:
# If not running interactively, don't do anything
# [ -z "$PS1" ] && return

and set up your environment in .bashrc as usual.

Caveat:

bash does not expand $s inside quotes, but for some reason SSH does. running
ssh remotehost "echo $PATH"

will fill in the path on the current machine and tell the remote machine to echo that, so it appears that your current PATH has been forwarded to the remote host. Use
ssh remotehost echo \$PATH

instead.

Friday, August 26, 2011

Really Super Quick Start Guide to Setting Up SLURM

SLURM is the awesomely-named Simple Linux Utility for Resource Management written by the good people at LLNL. It's basically a smart task queuing system for clusters. My cluster has always run Sun Grid Engine, but it looks like SGE is more or less dead in the post-Oracle Sun software apocalypse. In light of this and since SGE recently looked at me the wrong way, I'm hoping to ditch it for SLURM. I like pop culture references and software that works.

The "Super Quick Start Guide" for LLNL SLURM has a lot of words, at least one of which is "make." If you're lazy like me, just do this:

0. Be using Ubuntu
1. Install: # apt-get install slurm-llnl
2. Create key for MUNGE authentication: /usr/sbin/create-munge-key
3a. Make config file: https://computing.llnl.gov/linux/slurm/configurator.html
3b. Put config file in: /etc/slurm-llnl/slurm.conf
4. Start master: # slurmctld
5. Start node: # slurmd
6. Test that fool: $ srun -N1 /bin/hostname

Bam.

(In my config file, I specified "localhost" as the master and the node. Probably a good place to start.)

Thursday, August 18, 2011

CouchDB replicate with authorization

(Parts of this are documented elsewhere, but the more on the googs the better...)

The CouchDB Replicator in Futon is a handy tool, but doesn't seem to work when the target is anything but an Admin Party (everyone is admin). This is a bummer, since presumably no production systems are such a party.

No matter: you can do the same thing with a new document in the _replicate database. Just make a new doc (POST or in Futon):



There are some options:
  "continuous": true     // keep replicating forever as new documents arrive

"create_target": true // allow creation of target_db_name if necessary


When you (re)load this document, you'll see that CouchDB has added some new fields, _replication_id, _replication_state, _replication_state_time. If it worked, the _replication_state should be "triggered".

If the _replication_state is "error", you'll have to go look in the CouchDB server logs for the reason.

Saturday, June 4, 2011

Evil C++ #2: Using GCC's -ftrapv flag to debug integer overflows

In C++, overflowing an integer type won't cause an exception and can result in weird numbers propagating through your program. GCC's ftrapv flag has your back.

Thursday, June 2, 2011

Evil C++ #1: Brackets and "at" for accessing STL vector elements

This is the first in a series of code snippets that demonstrate C/C++ pitfalls.

(For an thorough explanation of the many ways C++ is out to get you, see Yossi Kreinin's excellent C++ FQA).

Ignoring GCC warnings on a per-file basis

In most cases, ignoring GCC warnings is a Bad Idea. Treating warnings as errors results in better code.

However, sometimes we are forced to deal with other people's code. For instance, a project I work on relies on JsonCpp. We include this in our source tree so that every user doesn't to have to go get JsonCpp source code in order to compile this thing.

Such dependencies can be a problem if you want really strict compiler options, since libraries will often be slightly incompatible with your particular standard (ANSI, C++0x, ...) or not be up to your lofty expectations. In my case, JsonCpp gives me a couple of warnings with GCC options -W, -Wall, -ansi, -pedantic. This means I can't compile my code with -Werror, which makes me sad. I certainly don't want to modify these external libraries.

Fortunately, in recent GCC versions ways of selectively disabling warnings have been added. If your problems are confined to headers, you can replace -I/path/to/headers with -isystem/path/to/headers and GCC will treat them as system headers, ignoring warnings.

Another less-desirable solution is to use pragmas. Headers can be marked as system headers by putting at the top:

#pragma GCC system_header


If the problems lie in the source files themselves, neither of these tricks work. We can, however, add to the top of the files causing the warnings things like this:

#pragma GCC diagnostic ignored "-Wunused-parameter"
#pragma GCC diagnostic ignored "-Woverflow"


to disable specific warnings generated by that file.

To figure out the names of the warnings causing the problems, recompile with the -fdiagnostics-show-option option on the g++ line. This is especially useful in the case of default warnings (i.e. those which aren't optional) like -Woverflow since they are harder to find in the documentation.

This isn't a great solution, since it does require some modification of the libraries. However, you can easily generate a patch from your changes and apply it to any new library versions should you decide later to upgrade them. Hopefully someday GCC will include an "ignore warnings from this file or subdirectory" option, but until then... it works.

Saturday, April 30, 2011

SNO+ Explained

In the spirit of the comic book and/or children's story I wrote to explain the miniCLEAN dark matter experiment (http://deapclean.org/about/), I have attempted to summarize the SNO+ experiment on a single awesome page.

SNO+ is a multi-purpose particle physics experiment studying all things neutrino. Neutrinos are very light elementary particles. They come from the Sun, their antiparticles (antineutrinos) come from nuclear reactors, and these two things (neutrinos and antineutrinos) might in fact be the same thing.

It sounds like we know shamefully little about neutrinos, which is more or less true. Hence SNO+, which is studying all of the above to figure this stuff out.

SNO+ Explained



(click to enlarge)

Tuesday, March 29, 2011

Ubuntu 10.04 on a SunFire X4500

A while back, I inherited command of a small cluster with a monster of a disk: the Sun Microsystems SunFire X4500 storage server. Two dual-core AMD Opterons, 16GB of ECC memory, and -- count 'em -- 48 hard disks on six LSI SATA controllers. X4500s sold with drives up to 1 TB; mine has 500 GB drives for a total of 20 TB of storage, less the two reserved for the OS.
This seemed like a pretty rad place to host my users' home directories and data, but serving ZFS over NFS turned out to be unusably slow due to some issue with Solaris/ZFS/NFS and being too picky about synchronous IO. Later SunFire servers were sold with a solid state drive as a separate intent log (slog) (i.e. journal) device. Writes committed to this were good enough for ZFS, and the flushes are virtually instantaneous.

I added a new PCI-X LSI SATA controller (since the X4500 has zero spare SATA ports) and an Intel X-25E SSD as a slog, and saw about an order of magnitude improvement in write performance. This was still worse than any of my ext4 Linux NFS servers, but usable.

On account of this poor performance and the cloud of FUD surrounding Sun/Oracle these days, the time has come for the X4500 to run Linux. And not one of the distributions it shipped with -- that would be boring.

My other NFS servers run Ubuntu 10 Server, so I chose 10.04 LTS as the new OS. Confident that it was possible since a guy on the internet did this once, I set out to Ubuntify the beast.

The X4500 wasn't feeling my external USB CD drive, so I started ILOM remote console redirection from the service processor's web interface dealie. CD and CD image redirection actually worked, and the X4500 is configured by default to boot to the redirected virtual CD drive. Ubuntu installer achieved.



This proceeded exactly as it would on a real computer, even using a monitor and USB keyboard plugged right into the X4500. The partitioner took forever to scan, but found all 48 disks plus my SSD which it thought was a swap partition. Annoyingly, the OS sees the 48 internal disks as if each were on its own controller.

Not wanting to go scorched-earth just yet, I had hoped to leave the two UFS-formatted Solaris root disks and the ZFS slog SSD untouched. So, I popped two new drives for Ubuntu into my external enclosure next to the SSD.

Unfortunately, there is some hardcoded magic in the internal SATA controller that only spins up two devices for the BIOS to see. So you *have* to put the boot partition on one of these. More on that later. Not knowing this, I proceeded with the install.


I set up software RAID-1 (mirroring) in the installer, like so:


Then, started the install:


This was going swimmingly until it failed to install GRUB with a horrible red error screen. This is what forum guy (n6mod) had warned us about.


I finished the install and rebooted (leaving the CD attached).

When I got back to the installer, I chose rescue mode and followed the prompts, then selected the root device (/dev/md0) as the environment for the shell. I installed smartmontools and used smartctl --all /dev/device-name to get the serial numbers of the Solaris disks. They turned out to be in physical slots 0 and 1... go figure.
I shut down, popped those guys out and replaced them with my own two disks (right out of the external enclosure). I rebooted again and returned to the recovery console, where I installed GRUB2 with apt-get install grub2. After telling it to install the boot stuff on my boot partition (now /dev/sdz1, was /dev/sday1 when I partitioned it), I rebooted once again and it Just Worked. Bam.

Now, I had to handle the disks -- how to RAID these things up!? I cobbled them into seven 6-disk RAID5 arrays chained up into a single RAID0 array (i.e. RAID5+0). With this architecture, a single disk failure necessitates reading 6 drives, not all 48, to recover. I wrote a python script to do the partitioning and md creation as 48 is a big number. This hierarchy leaves a few spares, which I am ignoring for now but could be used as spares if a device fails (if only hot spares could be shared... I guess these are warm spares??).


I had planned to format this with ext4 like my other servers, but it turns out that while ext4 volumes can be up to 1 EB in size, e2fsprogs (including mkfs.ext4) only supports up to 16 TB.

After perusing some flame wars on Phoronix and the Wikipedia comparison of file systems for a bit and learning that all filesystems suck and can't be shrunk, I settled on XFS.

XFS performance is best if blocks are aligned with RAID chunks, which mdadm makes 64k by default. So I formatted /dev/md20 (the giant stripey thing) with:

mkfs.xfs -f -d su=65536,sw=7 /dev/md20

where su = RAID chunk size, sw = number of disks in RAID-0. I then mounted the volume and fired up bonnie++ for some benchmarking goodness. The results were pretty amazing:
Version 1.96 ------Sequential Output------ --Sequential Input- --Random-
Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
vorton 31976M 847 94 387044 84 174342 69 1715 97 534121 99 578.6 63
Latency 24186us 32591us 62407us 10213us 50292us 48030us

Version 1.96 ------Sequential Create------ --------Random Create--------
vorton -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 2098 23 +++++ +++ 1741 19 2186 24 +++++ +++ 1978 23
Latency 53673us 2197us 56708us 36312us 56us 46485us

1.96,1.96,vorton,1,1301449025,31976M,,847,94,387044,84,174342,69,1715,97,534121,
99,578.6,63,16,,,,,2098,23,+++++,+++,1741,19,2186,24,+++++,+++,1978,23,24186us,
32591us,62407us,10213us,50292us,48030us,53673us,2197us,56708us,36312us,56us,46485us

Yup: sequential block reads at 387 M/s and writes at 534 M/s. Not too bad for a bunch of 7200 RPM drives.

To conclude: it works. Should you be so blessed/afflicted as to have a SunFire X4500 laying around, don't give up hope -- this is still a blazing fast disk server and can be migrated to Ubuntu Server in a matter of hours.

Basically, best. enclosure. ever. (Though this thing is cool too).

UPDATE: The X4500 has 4 Intel e1000 gigabit network ports. One I'm using for the interwebs, leaving the others for the cluster. I used ifenslave to bond them all together (mode 6, adaptive load balancing), and used iperf to saturate it at 2.56 Gbit/s:

From sphaleron
Pretty speedy, and you could probably get closer the theoretical limit (3 Gbit/s) using bond mode 4 (802.11ad dynamic link aggregation).