[Update: A related post has been added, where I explain how to set up additional boxes as submission hosts.]
[Update Nov 2016: I have since confirmed that this method works without change on 16.04 LTS as well, and have updated the post to note this. I have also added a few instructions for installation on a machine which has no official FQDN, since this was a common stumbling block for a lot of people.]
Last week I found myself needing to do something I really hate: solving a problem I know I solved before. What made this not a fun experience was that I couldn’t find any notes from what I did the last time, which is rather unusual. You see, every time I encounter a problem with a non-obvious solution I like to write a blog post about it. That way, not only do I have a single place to look for notes, but it may also prove useful to others. In fact, the most-visited posts on this blog tend to be just those.
Anyway, the problem at hand was the installation of the Torque/PBS job scheduler on a Ubuntu 14.04 LTS box [Note: this has been confirmed to work without change on 16.04 LTS]. The plan was to initially install the scheduler on a single box, acting as server, scheduler, compute node, and submission host. Eventually, job submission would be extended to other machines, adding them also as compute nodes on additional queues.
To help myself if I ever need to do this again, and to help anyone else in the same situation, I’ll detail below what I did. Most importantly, I will highlight where I ran into problems, what the problems were, and how I resolved them. I hope this will be useful to others.
All the following needs to be done as root on the box that will act as single-node cluster. First, of course, one needs to install the necessary packages. This can be done easily, with the caveat that you get Torque v2.4.16, which at this point is at end of life. I do not want to bother with non-packaged installs, as that would make my life harder later, so here goes.
apt-get install torque-server torque-client torque-mom torque-pam
Installing the packages also sets up torque with a default setup that is in no way helpful. So next you’ll need to stop all torque services and recreate a clean setup.
/etc/init.d/torque-mom stop /etc/init.d/torque-scheduler stop /etc/init.d/torque-server stop pbs_server -t create
You’ll need to answer ‘yes’ here to overwrite the existing database. Next, kill the just-started server instance so we can set a few things manually.
If you don’t kill the server, many things you do below will be overwritten the next time the server stops. Next, let’s set up the server process; in the following, replace ‘SERVER.DOMAIN’ with your box’s fully-qualified domain name [Note: see just below if your machine doesn’t have an official FQDN]. I prefer to use FQDN’s so that it’s easier later to add other compute nodes, job submission nodes, etc. The following also sets up the server process to allow user ‘root’ to change configurations in the database. This bit seemed missing from the default install, and it took me a while to figure it out (again).
echo SERVER.DOMAIN > /etc/torque/server_name echo SERVER.DOMAIN > /var/spool/torque/server_priv/acl_svr/acl_hosts echo root@SERVER.DOMAIN > /var/spool/torque/server_priv/acl_svr/operators echo root@SERVER.DOMAIN > /var/spool/torque/server_priv/acl_svr/managers
If the machine you’re installing Torque on doesn’t have an official FQDN, a simple work-around is to invent one and assign it to the machine’s network IP. For example, if eth0 is assigned to 192.168.1.1, we can add the following line to /etc/hosts.
The FQDN itself can be anything you want, but ideally choose something that cannot exist in reality, so something with a non-existent top-level domain.
A cluster is nothing without some compute nodes, so next we tell the server process that the box itself is a compute node (with 4 cores, below – change this to suit your requirements).
echo "SERVER.DOMAIN np=4" > /var/spool/torque/server_priv/nodes
We also need to tell the MOM process (i.e. the compute node handler) which server to contact for work.
echo SERVER.DOMAIN > /var/spool/torque/mom_priv/config
Once this is done, we can restart all processes again.
/etc/init.d/torque-server start /etc/init.d/torque-scheduler start /etc/init.d/torque-mom start
If you get any errors at this point, I’d suggest stopping any running processes, and restart them one by one in this order. Check the logs (under /var/spool/torque) for whatever is failing. Otherwise, the next step is to start the scheduler.
# set scheduling properties qmgr -c 'set server scheduling = true' qmgr -c 'set server keep_completed = 300' qmgr -c 'set server mom_job_sync = true'
At this point, if you get the dreaded Unauthorized Request error, it’s critical to figure out why this is happening. Usually it is because the commands look like they’re coming from an unauthorized user/machine, that is anything different from the string ‘root@SERVER.DOMAIN’. You can check this with the following command.
grep Unauthorized /var/spool/torque/server_logs/*
We also need to create a default queue (here called ‘batch’ – you can change this to whatever you want). We’ll set this up with default 1-hour time limit and single-node requirement, but you don’t have to.
# create default queue qmgr -c 'create queue batch' qmgr -c 'set queue batch queue_type = execution' qmgr -c 'set queue batch started = true' qmgr -c 'set queue batch enabled = true' qmgr -c 'set queue batch resources_default.walltime = 1:00:00' qmgr -c 'set queue batch resources_default.nodes = 1' qmgr -c 'set server default_queue = batch'
Finally, we need to configure the server to allow submissions from itself. This one stumped me for a while. Note that the submit_hosts lists cannot be made up to FQDNs! This will not work if you do that, as the comparison is done after truncating the name of the submitting host!
# configure submission pool qmgr -c 'set server submit_hosts = SERVER' qmgr -c 'set server allow_node_submit = true'
For example, if above you use SERVER.DOMAIN instead of just SERVER, you’ll get an error like the following the next time you try to submit:
qsub: Bad UID for job execution MSG=ruserok failed validating USER/USER from SERVER.DOMAIN
where ‘USER’ will be the uid of whichever (non-root) user you try to submit the job as. The solution is simply to list the submission hosts as unqualified names. To test the system you need to try to submit a job (here an interactive one) as a non-root user from the same box.
If this works, you’ll get into a shell on the same box (as if you ssh’ed into itself). Note that you’ll need authorised SSH keys set up for this user to allow password-less ssh.
Eventually I’ll need to set up additional boxes as submission hosts. I’ll write about that process once it’s done. [Note: you can now find this here.] Meanwhile, if you have any questions about the above, or if there’s anything that could do with some clarification, feel free to let me know in the comments.