Friday, August 26, 2011

Really Super Quick Start Guide to Setting Up SLURM

SLURM is the awesomely-named Simple Linux Utility for Resource Management written by the good people at LLNL. It's basically a smart task queuing system for clusters. My cluster has always run Sun Grid Engine, but it looks like SGE is more or less dead in the post-Oracle Sun software apocalypse. In light of this and since SGE recently looked at me the wrong way, I'm hoping to ditch it for SLURM. I like pop culture references and software that works.

The "Super Quick Start Guide" for LLNL SLURM has a lot of words, at least one of which is "make." If you're lazy like me, just do this:

0. Be using Ubuntu
1. Install: # apt-get install slurm-llnl
2. Create key for MUNGE authentication: /usr/sbin/create-munge-key
3a. Make config file: https://computing.llnl.gov/linux/slurm/configurator.html
3b. Put config file in: /etc/slurm-llnl/slurm.conf
4. Start master: # slurmctld
5. Start node: # slurmd
6. Test that fool: $ srun -N1 /bin/hostname

Bam.

(In my config file, I specified "localhost" as the master and the node. Probably a good place to start.)

9 comments:

  1. Is it possible to install it on a 32 core workstation if yes how different will the steps be.

    ReplyDelete
  2. Hi, Sreejith, and thanks for the comment.

    That is totally possible, and it's a great use of a cluster system like SLURM to get the most out of a multi-core system (just be sure that other resources (memory, memory bandwidth, network and disk I/O, etc.) are scaled to match depending on your usage patterns, so you don't end up with a bottleneck somewhere else).

    The steps don't change at all in your scenario. It's all in the configuration file created in step 3a: you should just set "localhost" as the master set up one node called "localhost" as well.

    ReplyDelete
  3. This was a big help. I just wanted to be able to test some scripts I was writing on my home computer. Did have a small problem with munge, needed to manually create /var/run/munge and change the ownership of /var/log/munge to root.

    ReplyDelete
  4. Thanks for a nice super-quick guide. Would it be possible for you to share the slurm.conf you generated? cheers

    ReplyDelete
  5. Interesting. It seems that this is broken on Ubuntu 16.04.

    ReplyDelete
    Replies
    1. That is certainly my experience: https://github.com/superphy/semantic/issues/93. Any ideas?

      Delete
  6. i followed the exact same steps but when i run the "srun -N1 /bin/hostname" command i get "srun:error unable to allocate resources: unable to contact slurm controller (connect failer)" any idea on why i'm getting this error?
    Thnaks alot.

    ReplyDelete
    Replies
    1. https://github.com/Azure/azure-quickstart-templates/issues/1796

      Following the instructions in above URL helped me to solved the error you mentioned.

      Delete
  7. Sadly with a non homogeneous set of nodes, this will not quite do it for setup. Each OS had a different version, both of which were out of date for 'security vulnerabilities'. Awesome guide for setting up my VM test cluster though.

    ReplyDelete