Installing Torque/PBS job scheduler on Ubuntu 14.04 LTS / 16.04 LTS

[Update: A related post has been added, where I explain how to set up additional boxes as submission hosts.]

[Update Nov 2016: I have since confirmed that this method works without change on 16.04 LTS as well, and have updated the post to note this. I have also added a few instructions for installation on a machine which has no official FQDN, since this was a common stumbling block for a lot of people.]

Last week I found myself needing to do something I really hate: solving a problem I know I solved before. What made this not a fun experience was that I couldn’t find any notes from what I did the last time, which is rather unusual. You see, every time I encounter a problem with a non-obvious solution I like to write a blog post about it. That way, not only do I have a single place to look for notes, but it may also prove useful to others. In fact, the most-visited posts on this blog tend to be just those.

Anyway, the problem at hand was the installation of the Torque/PBS job scheduler on a Ubuntu 14.04 LTS box [Note: this has been confirmed to work without change on 16.04 LTS]. The plan was to initially install the scheduler on a single box, acting as server, scheduler, compute node, and submission host. Eventually, job submission would be extended to other machines, adding them also as compute nodes on additional queues.

To help myself if I ever need to do this again, and to help anyone else in the same situation, I’ll detail below what I did. Most importantly, I will highlight where I ran into problems, what the problems were, and how I resolved them. I hope this will be useful to others.

All the following needs to be done as root on the box that will act as single-node cluster. First, of course, one needs to install the necessary packages. This can be done easily, with the caveat that you get Torque v2.4.16, which at this point is at end of life. I do not want to bother with non-packaged installs, as that would make my life harder later, so here goes.

apt-get install torque-server torque-client torque-mom torque-pam

Installing the packages also sets up torque with a default setup that is in no way helpful. So next you’ll need to stop all torque services and recreate a clean setup.

/etc/init.d/torque-mom stop
/etc/init.d/torque-scheduler stop
/etc/init.d/torque-server stop
pbs_server -t create

You’ll need to answer ‘yes’ here to overwrite the existing database. Next, kill the just-started server instance so we can set a few things manually.

killall pbs_server

If you don’t kill the server, many things you do below will be overwritten the next time the server stops. Next, let’s set up the server process; in the following, replace ‘SERVER.DOMAIN’ with your box’s fully-qualified domain name [Note: see just below if your machine doesn’t have an official FQDN]. I prefer to use FQDN’s so that it’s easier later to add other compute nodes, job submission nodes, etc. The following also sets up the server process to allow user ‘root’ to change configurations in the database. This bit seemed missing from the default install, and it took me a while to figure it out (again).

echo SERVER.DOMAIN > /etc/torque/server_name
echo SERVER.DOMAIN > /var/spool/torque/server_priv/acl_svr/acl_hosts
echo root@SERVER.DOMAIN > /var/spool/torque/server_priv/acl_svr/operators
echo root@SERVER.DOMAIN > /var/spool/torque/server_priv/acl_svr/managers

If the machine you’re installing Torque on doesn’t have an official FQDN, a simple work-around is to invent one and assign it to the machine’s network IP. For example, if eth0 is assigned to 192.168.1.1, we can add the following line to /etc/hosts.

192.168.1.1 SERVER.DOMAIN

The FQDN itself can be anything you want, but ideally choose something that cannot exist in reality, so something with a non-existent top-level domain.

A cluster is nothing without some compute nodes, so next we tell the server process that the box itself is a compute node (with 4 cores, below – change this to suit your requirements).

echo "SERVER.DOMAIN np=4" > /var/spool/torque/server_priv/nodes

We also need to tell the MOM process (i.e. the compute node handler) which server to contact for work.

echo SERVER.DOMAIN > /var/spool/torque/mom_priv/config

Once this is done, we can restart all processes again.

/etc/init.d/torque-server start
/etc/init.d/torque-scheduler start
/etc/init.d/torque-mom start

If you get any errors at this point, I’d suggest stopping any running processes, and restart them one by one in this order. Check the logs (under /var/spool/torque) for whatever is failing. Otherwise, the next step is to start the scheduler.

# set scheduling properties
qmgr -c 'set server scheduling = true'
qmgr -c 'set server keep_completed = 300'
qmgr -c 'set server mom_job_sync = true'

At this point, if you get the dreaded Unauthorized Request error, it’s critical to figure out why this is happening. Usually it is because the commands look like they’re coming from an unauthorized user/machine, that is anything different from the string ‘root@SERVER.DOMAIN’. You can check this with the following command.

grep Unauthorized /var/spool/torque/server_logs/*

We also need to create a default queue (here called ‘batch’ – you can change this to whatever you want). We’ll set this up with default 1-hour time limit and single-node requirement, but you don’t have to.

# create default queue
 qmgr -c 'create queue batch'
 qmgr -c 'set queue batch queue_type = execution'
 qmgr -c 'set queue batch started = true'
 qmgr -c 'set queue batch enabled = true'
 qmgr -c 'set queue batch resources_default.walltime = 1:00:00'
 qmgr -c 'set queue batch resources_default.nodes = 1'
 qmgr -c 'set server default_queue = batch'

Finally, we need to configure the server to allow submissions from itself. This one stumped me for a while. Note that the submit_hosts lists cannot be made up to FQDNs! This will not work if you do that, as the comparison is done after truncating the name of the submitting host!

# configure submission pool
 qmgr -c 'set server submit_hosts = SERVER'
 qmgr -c 'set server allow_node_submit = true'

For example, if above you use SERVER.DOMAIN instead of just SERVER, you’ll get an error like the following the next time you try to submit:

qsub: Bad UID for job execution MSG=ruserok failed validating USER/USER from SERVER.DOMAIN

where ‘USER’ will be the uid of whichever (non-root) user you try to submit the job as. The solution is simply to list the submission hosts as unqualified names. To test the system you need to try to submit a job (here an interactive one) as a non-root user from the same box.

qsub -I

If this works, you’ll get into a shell on the same box (as if you ssh’ed into itself). Note that you’ll need authorised SSH keys set up for this user to allow password-less ssh.

Eventually I’ll need to set up additional boxes as submission hosts. I’ll write about that process once it’s done. [Note: you can now find this here.] Meanwhile, if you have any questions about the above, or if there’s anything that could do with some clarification, feel free to let me know in the comments.

Advertisements

57 comments

  1. Very nice tutorial! Worked like a charm, had some problems before but then I realised that I was running 12.04. Now lets see how to continue with Galaxy

    • Thank you jasper, glad this was useful! Don’t think there was much difference in 12.04 – what was your experience? And good luck with galaxy 🙂

  2. I could not get past qmgr part no matter what I did I got:
    qmgr obj= svr=default: Unauthorized Request
    Upgraded and then everything worked. Maybe there is a version difference in the packaging of ubuntu 12/14 for torque/pbs as I read 10.04 should be installed from source.

      • Yeah in that way a pity indeed… but I am glad it is working now, also submitting jobs to a second server is working now playing around with all the options and security as eventually I want to scale this to a dozen computers.

    • I had the same issue with 12.04
      I had to change the name in /etc/torque/server_name to just the name of the server without domain. The qmgr commands work then.
      Im now stuck at configuring the clients 😀

      • Setting up the clients proved rather simple. I need to find some time to write about what it involved. Good luck with your clients!

  3. Thanks for the post!
    For those with “Unauthorized Request” problem, this worked for me: open ‘/etc/hosts’ and check the IP of your . If it is ‘127.0.1.1’, write ‘127.0.0.1’. Hope it helps!

    • Thanks Fran! Re: the Unauthorized Request problem, that should not appear due to the lines adding the necessary permissions to acl_hosts, operators, and managers. If it still does, odds are that the host cannot reverse-lookup its own FQDN. This can easily be solved with an entry in /etc/hosts, but without changing the IP number. Your suggested change goes against Debian (and therefore Ubuntu) convention; I’m not sure what it could break, so I’d advise against it. (c.f. https://www.debian.org/doc/manuals/debian-reference/ch05.en.html#_the_hostname_resolution)

      • These were great instructions. Thank you for taking the time to do this. I also had issues with domain name, and had to make a similar (but not identical) change in /etc/hosts. First, here’s a reference where someone else found that their hostname in /etc/hosts had to correspond to their eth0 address — https://cmayes.wordpress.com/2012/12/15/single-host-torque-pbs/ — According to the link that you provided above, that 127.0.1.1 IP is used in /etc/hosts for systems that may not have a permanent IP address. In my case, the installation of Ubuntu had defaulted to that 127.0.1.1 for my host, but I needed to replace it with my IP. Once I did that, it all worked perfectly.

      • Thanks for stopping by, Dan! Replacing 127.0.1.1 with the eth0 address will work, but it’s still the wrong approach IMHO. There is an alternative simple fix (using FQDN in Torque and writing that in /etc/hosts with the eth0 address), so there is no reason to break a Debian convention.

  4. I’m using Ubuntu 14.04 also. Mine seems to be almost working but get this in the server_logs… I’ve been rotating between server.domain, server (without the domain from the hostname command), and the normal

    05/27/2015 15:55:43;0040;PBS_Server;Svr;<SERVER.DOMAIN>;Scheduler was sent the command scheduler_first
    05/27/2015 15:55:48;0040;PBS_Server;Req;ping_nodes;ping attempting to contact 1 nodes
    05/27/2015 15:55:48;0040;PBS_Server;Req;ping_nodes;successful ping to node (stream 1)
    05/27/2015 15:55:48;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 2.4.16, loglevel = 0
    05/27/2015 15:55:52;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::stream_eof, connection to is bad, remote service may be down, message may be corrupt, or connection may have been dropped remotely (Premature end of message). setting node state to down

    I’m thinking it’s because in my /etc/hosts, my second line is 127.0.1.1 for the hostname and hostname -f name. What do you think? Apparently it should just work but not for me…

    • From the log it looks like the compute node is having problems contacting the server; seems to me that’s where you need to debug. Does the compute node know how to resolve SERVER.DOMAIN? In any case, if you use SERVER.DOMAIN on the server config (as on my post), the second line in your /etc/hosts on the server is irrelevant. I would not touch that as that’s the current Debian recommended setup.

      • I tried it with the SERVER.DOMAIN, but I get the same problem….. I changed the /etc/hosts (I know it’s not recommended) to my eth0 address, and now it works……………….. (why always me…)

        Now the problem is that even with the qmgr setup, I can’t in the life of me ever get qsub to run my batch scripts since it keeps it on state Q forever (all fixes say to run qmgr -c ‘set server scheduling = true’ but obviously this didn’t work). I’m kinda stumped…

        I’ll see what I can do again to get SERVER.DOMAIN working with 127.0.1.1 though

      • The two problems could be related; if jobs are staying queued, what is the reason? you can check this with ‘qstat -f jobid’ where jobid is the actual job id.

  5. I followed your post and every thing work weel excepts that I don’t have torque-scheduler has a service. The file /etc/init.d/torque-scheduler does not exists. Is it mandatory to have torque-scheduler running for any job submission ? How can I fix this issue ? Thank you Johann for this port.

    • The file /etc/init.d/torque-scheduler is installed as part of the torque-scheduler package. Did you install that? The scheduler service needs to be installed on the scheduler node, in the example I gave this is the same as the compute node (i.e. the one with torque-mom and the same one which runs the jobs). I hope this helps.

  6. Hey Johann,

    I followed the tutorial but I am receiving the following error once I start torque-mom. I am using Ubuntu14.04 on amazon’s web services aka AWS. It’s a free tier so I have only one CPU.

    Starting Torque Mom torque-mom
    pbs_mom: LOG_ERROR::Permission denied (13) in chk_file_sec, Security violation with “/var/spool/torque” – /var/spool/torque is world writable and not sticky

    Do you think you could help me?

    Thank you for your time! Great work!
    Burak

  7. Hi Johann,

    I’m very new to Ubuntu, and have been wanting to set up a single-machine multicore queue system (i.e. single node) with TORQUE. I was wondering whether I needed an FQDN (the idea seems to scare me), or is there another way to set up TORQUE without this?

    Thank you for your time,

    Atreyu

    • Hi Atreyu! You definitely don’t need an FQDN – as long as the various components know how to find each other that’s enough. If it’s all on a single machine, just the name of the machine should be sufficient, as there’s a translation to its IP number in ‘hosts’. Hope this helps! Johann

  8. I got several ssh errors when error files have been written (host key verification failed, permission denied publickey, …). The thing that worked in the end was telling torque to use cp instead of scp.
    You need to add the following line in /var/spool/torque/mom_priv/config:

    $usercp *:/home /home

    • Thanks for pointing this out; whether this is necessary depends on many other factors. Could you give a few more details about your setup? E.g. is it a single-node system, and if not how do you share home directories and user details? Also, do you have ssh set up network-wide (e.g. with certs signed by a CA that all nodes trust).

  9. Awesome tutorial, this worked like a charm on my 5-node pine64 picocluster with ubuntu xenial arm64 and packages from standard repo. I needed to modify /etc/hosts to use real IP rather than 127.0.1.1 then qmgr worked.

    • Thanks, and glad it’s also working on Xenial – will make upgrades this summer that much easier 🙂

      On a separate note, as I mentioned in an earlier comment, replacing 127.0.1.1 with the ‘real’ IP address will work, but it’s still the wrong approach IMHO. There is an alternative simple fix (using FQDN in Torque and writing that in /etc/hosts with the eth0 address), so there is no reason to break a Debian convention.

      • Hi Johann! I followed your write up and it worked perfectly for the version on APT, but unfortunately I needed something that supports slot limits. This is for a single node/sever configuration. I installed torque 6.X, but run into a similar DNS error, whenever qmgr is called;

        Unable to communicate with localhost(127.0.0.1)
        Cannot connect to specified server host ‘localhost’

        I can manually create the serverdb file (pbs_server -t create), then create a server_name file with the entry “localhost”. But when I go to run a qmgr command same issue. Similarly when I run pbs_server or /etc/init.d/pbs_server start, it says [ok], but there is no pbs sever running (checked with ps and netstat).

        Any idea?

        my /etc/hosts file looks like;
        lanikai# cat /etc/hosts
        127.0.0.1 localhost
        127.0.1.1 lanikai

        Putting in my actual FQDN is not an option (unless its just the localhost.localdomain).

        Keep up the good work!

      • Hi Ashton, thanks for your comments! What you describe sounds odd. I’d start by checking the logs of the server process, to find out where it’s bombing out. I’d also question the use of localhost as the connection point – especially if you may want to connect to it from another machine on the network.

  10. Hello,

    I’d like to use with torque openmp.

    Openmpi 1.10.3 works perfectly, but when running a scrript with PBS I test.e33 mistakes in the file

    7: /var/spool/torque/mom_priv/jobs/33.server.SC: module: not found
    6: /var/spool/torque/mom_priv/jobs/33.server.SC: mpirun: not found

    or find the modules:

    module purge
    module add torque
    module add openmpi/gcc

    Thanks in advance for your help.

    Steph FRANCE

    • Hi Steph, thanks for stopping by. I have not tested Torque with OpenMPI myself, though it should work. From the errors above, it seems that you’re using Torque on an existing cluster with modular software components. If you’re a user of an existing installation, you’ll really need to ask the administrator. I hope this helps.

  11. Dear Can you are installing torque package on my computer by giving you my teamviewer or ssh .. I’m weak in the Linux operating system ..

    • Dear Ali, it seems to me you want to find a consultant to set up a system to your specifications. This generally comes at a (possibly substantial) cost.

  12. Hi Johann

    Unfortunately I stuck on setting the domain name. Since I want to use only one node I tried to put “localhost”, however it does not work – should I add port or something to this ? (I getting message that “Address already in use”) what should I do ?

    Thanks in advance !

    Lukasz

  13. Great blog! This saved me a ton of time and frustration, as this is my first go-round configuring a scheduler for a compute grid (or setting up a grid for that matter). Your solution still works for Ubuntu 16.04 LTS.

    In case any other newbies out there like me are curious to know, this works just as well on a virtual node as a physical node (including Xen/XenServer PV guests).

    So, on to configuring more compute nodes and a proper login node…

    Btw, I did run into the dreaded (qmgr) Unauthorized Request problem. Unfortunately, the only viable workaround I found was to change the fqdn ip to 127.0.0.1, as others have already mentioned. Interestingly, I ran across this: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=629596. I’ve seen spurious bug reports before, but it does beg the question if the root problem might not lie with the Debian/Ubuntu packages.

    • Thanks for the details Josh; it’s good to see more confirmations that the same method works for 16.04. Re: the Unauthorized Request problem, the workaround you used works (obviously) but I still think it’s not the best approach. Assuming the machine is connected to a network, the ideal approach is to use its actual FQDN, or if this is not possible, to add a line in /etc/hosts with the eth0 IP and an invented FQDN. I’ll update the post to say something about this, as it’s a common problem.

  14. Hi Johann, Thanks for your tutorial. I followed all these steps until the scheduling properties settings, but I get the same error described above ( qmgr obj= svr=default: Unauthorized Request).

    I have readed the comments, but this message still appear, I tried changing /etc/hosts (127.0.1.1 to 127.0.0.1) because I did not undesrtand your previous answer (wich are the necessary step to do it?):

    “the Unauthorized Request problem, that should not appear due to the lines adding the necessary permissions to acl_hosts, operators, and managers. If it still does, odds are that the host cannot reverse-lookup its own FQDN. This can easily be solved with an entry in /etc/hosts, but without changing the IP number”

    This IP changing does not solve anything, but then I have changed it to my IP 192.168.. and it works. With this configuration I have tried to launch the torque job but it remain queued forever, I have used pbsnodes -a command to see node status and its appear as “down”

    can you help me please with this problem please

    best wiches

    Sebastian

    • Thanks for the comments, Sebastian. Re: the Unauthorized Request problem, I’ll add a few lines in the post to explain what can be done in greater detail, as it’s a common problem. For the node appearing as ‘down’ problem, best thing is to check the mom logs, to find out what the problem is. Most probably it’s a failure of the MOM process connecting to the server process, or the other way around. This can be due to the addresses used, which is one reason why localhost is not ideally used.

  15. On Ubuntu 16.04LTS, I ran into problem following your route:

    The following packages have unmet dependencies:
    libroot-graf2d-graf5.34 : Depends: libroot-core5.34 (>= 5.34.00) but it is not going to be installed
    Depends: libroot-hist5.34 (>= 5.34.00) but it is not going to be installed
    Depends: libroot-io5.34 (>= 5.34.00) but it is not going to be installed
    Depends: libroot-math-mathcore5.34 (>= 5.34.00) but it is not going to be installed
    torque-client : Depends: libtorque2 (>= 2.4.16+dfsg) but it is not going to be installed
    Depends: libcurses-perl but it is not going to be installed
    Depends: gawk
    Depends: torque-common but it is not going to be installed
    torque-mom : Depends: libtorque2 (>= 2.4.16+dfsg) but it is not going to be installed
    Depends: torque-common but it is not going to be installed
    E: Unmet dependencies. Try ‘apt-get -f install’ with no packages (or specify a solution).

    Then I tried to install these but without success. Any idea how to fix this. Thanks.

    • Hi, I have no idea how those libroot dependencies got pulled in, they do not seem to have anything to do with torque. I suspect your system had a broken package dependency before you started trying to install torque. I suggest you fix that first, and then proceed with torque.

  16. Hi Johann,
    thanks for this excellent post, it is quite practical. I have null experience with job schedulers. I have a single linux box with 2 processors/20 cores in total, and an applycation that uses pbs for parallelization. Do you know if it could work by installing pbs in my box and using the multiple cores in this single box, instead of running on several boxes?
    Thanks a lot in advance
    Gonzalo

    • Hi Gonzalo, thanks for stopping by and for your kind words! Sure, you can set this up on a single machine if you want, to facilitate use of multiple cores. Only difference is that the same machine acts as all the roles.

  17. Hi Johann,

     Thanks for posting such a great blog. I have successfully installed the server and able to submit my jobs. Can you please suggest me how to add more compute node in the server. Suppose I have 5 box and I want to make them as 5 node cluster. 
    

    Thanks,
    sandip

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s