Adding client nodes to a Torque/PBS system on Ubuntu 14.04 LTS

A few weeks ago I wrote a blog post on installing a single-node Torque/PBS job scheduler on an Ubuntu 14.04 LTS system. The node served as scheduler, queue manager, compute node, and submission node. For my application, this node was installed in a machine room, and was primarily meant to act as a shared compute node; users would need to submit their jobs remotely from their workstations. So an additional part of the job involved the addition of a number of such nodes as job submission nodes. It proved to be rather simpler than I expected, and I’m finally writing about how I did that.

As in my earlier post, the following commands need to be issued as root, either on the machine to be added as a job submission node (I’ll call this the client from now on) or on the machine that’s already installed as scheduler, queue manager, compute node, and submission node (I’ll call this the server from now on).

First, of course, the client machine needs to have the necessary packages installed. This is easily done.

apt-get install torque-client torque-mom

We’ll be installing the client as both a job submission node and a compute node; we won’t necessarily want it to act as a compute node but this makes it easier if we do. So, we first need to stop the compute node process.

/etc/init.d/torque-mom stop

Next, we configure the client to point to the already configured server.

echo SERVER.DOMAIN > /etc/torque/server_name

This does two things: it lets the client know where any submitted jobs need to go for scheduling, and it also lets the compute node process know where to get work from. All we need to do after this is to start the compute node process again

/etc/init.d/torque-mom start

Now, we simply need to let the server know that it should accept any jobs coming from this client. So, on the server, we tell the queue manager that our new client is a valid job submission node.

qmgr -c 'set server submit_hosts += CLIENT'

Note that, as when we added the server machine as a job submission node, this client address cannot be a FQDN (see the previous post for an explanation). Note that if the server cannot resolve the client name from its IP (i.e. if you don’t have reverse DNS lookup on the client’s domain) then you’ll need to add the client IP and name (qualified, if you want) to /etc/hosts on the server. This allows the server to do the necessary name lookup from the client IP.

That’s basically it. To test, just submit a job from the client machine, and it should work.

33 thoughts on “Adding client nodes to a Torque/PBS system on Ubuntu 14.04 LTS”

Pingback: Installing Torque/PBS job scheduler on Ubuntu 14.04 LTS | Johann A. Briffa
Fedele Stabile says:

Jul 27, 2015 at 14:19

Hello,
I have a client node but not installed torque-mom: atcually I can submit jobs but I can’t see with qstat jobs queued other then mine.
Is this normal?
Fedele

Reply
1. Johann Briffa says:
  
  Jul 27, 2015 at 17:09
  
  Hi Fedele! If I remember correctly it’s normal, and to see other users’ jobs you need to do a qstat -u ‘*’. Let me know if this works. Re: torque-mom I vaguely recall it made things easier for me because compute nodes automatically get submission privileges – without it you may have needed to add separate permissions. Did you need to do anything else, or were the remaining instructions enough?
  
  Reply
evalinesees says:

Sep 15, 2015 at 21:33

Hi, thanks so much for this set of tutorials, it’s exactly what I want to do with my multicore server. My only problem is that I’m such a beginner to batch processing that I’m confused over some basic definitions — like what exactly a ‘node’ is (I understand node != processor, but not much else). Are there any intro guides you are aware of?

Reply
1. Johann Briffa says:
  
  Sep 16, 2015 at 08:29
  
  Hi Eileen, thanks for stopping by and for your kind words! Think of a node as a system with independent memory – for example a workstation or desktop PC which has its own RAM. This may have one or more processors / cores. I’m really not sure what to suggest as an intro guide – at least I don’t know of any particularly good ones on the net. It also depends a lot on what you intend to use the system for (batch scheduling, parallel jobs, etc). If you have access to a cluster, the admins usually put up some introductory pages on how to use it, and that’s normally the best place to start. Hope this helps, at least a little.
  
  Reply
Andrew says:

Oct 9, 2015 at 22:43

Hi, thank you for this tutorial, i have a question for you and i hope that you can answer. I followed the tutorials to configure Torque on the “server” and the “client” machine, then i added the client node on server machine with command:
– qmgr -c “create node node-name”
I’m using PBS like Scheduler for Globus Toolkit 6.0, and i can submit jobs to server machine, all works fine, but every job is performed by the server, how can i configure the server to schedule the jobs also on the client’s resources?

Reply
1. Johann Briffa says:
  
  Oct 12, 2015 at 17:10
  
  Hi Andrew, thanks for stopping by and for your kind words. To use the client as a resource as well you need to:
  1) set up the MOM process on the client node (this is what accepts and runs jobs from the server); this involves adding the server name to /var/spool/torque/mom_priv/config on the client, and
  2) add the client node as a usable resource; to do this add a line for the client in /var/spool/torque/server_priv/nodes on the server.
  That should be it.
  
  Reply
Pingback: FQDN, Torque, qsub | Tips for Scientific Computing
Jeremy Chien says:

Apr 2, 2016 at 03:27

Johann,
Thanks for the instruction. I have slightly different setup, and I could not configure correctly.

The goal is to submit jobs from server1 and have the jobs run on six computenodes.

I have server1 that is running torque_server torque_scheduler torque_mom. In the resource file (/var/spool/torque/server_priv/nodes), I have hostnames of six compute nodes. I am running torque_mom on these data nodes. In each computenode, I have the hostname of server1 in /var/spool/torque/mom_priv/config.

When I perform interactive job with qsub -I, it is not working: waiting for job 1.server1 to start.

What am I doing wrong?

Reply
1. Johann Briffa says:
  
  Apr 2, 2016 at 08:14
  
  Hi Jeremy, it’s hard to tell from the above alone what’s wrong. Probably there’s a communication problem between the compute nodes and the server, but this can be an issue in either direction. Start by querying the server for the reason the job isn’t starting (while it’s waiting to start, of course). This should point you in the right direction for what to check next. You can also check the mom logs on the compute nodes to see if they’re starting correctly and connecting with the server.
  
  Reply
David says:

Apr 7, 2017 at 18:03

Hello,

I’m glad I found a post such as yours. However, you do not talk about shared SSH key. WHat are the steps to make your compute nodes to communicate with each other.

Regards,
David

Reply
1. Johann A. Briffa says:
  
  Apr 7, 2017 at 21:05
  
  Thanks, David! I’m not sure what you mean by the shared SSH key – could you elaborate?
  
  Reply
sbenkorichi says:

Nov 14, 2017 at 16:25

Hi,
I’ve followed your steps to link a server with another node. All goes well, starting form the server to the client setting, however when I reboot, and I try to use qsub script.pbs as usual, all the jobs get queued as shown in qstat -a. But none of them works, I tried to look for useful info but in vain. Is there any possibility you check why this is happening?

System: Using Ubuntu

Thanks,

let me know if you need any further info.

Reply
1. Johann A. Briffa says:
  
  Nov 15, 2017 at 07:10
  
  Best thing is to check the job specific status, to determine why it has not started, and take it from there.
  
  Reply
  1. sbenkorichi says:
    
    Nov 15, 2017 at 10:05
    
    Hi Thanks for feedback.
    Well, when I install it first, then add the other node. Everything works fine, I can submit jobs to use both machines, but after I reboot, and I do qsub jobscript then nothing happens, checking the status with qstat -a
    I find the status is : Q. Well, I left it for a while nothing changed– was thinking maybe needs sometime, but not, it just get stuck there, even easy jobs like qsub -I doesn’t work.
  2. Johann A. Briffa says:
    
    Nov 15, 2017 at 10:39
    
    You’ll need to check with qstat -f to see why the job remains queued. Most likely one of the processes didn’t start properly after reboot, but you’ll need to figure out which and why.
  3. sbenkorichi says:
    
    Nov 15, 2017 at 12:06
    
    Well, I’ve checked with qstat -f
    but I can’t understand why there is no sufficient nodes for this. I run it fine before rebooting, after reboot it gives this behaviour.
    $ qstat -f
    Job Id: 3.sbenkorichi
    Job_Name = SD1_test
    Job_Owner = salah@sbenkorichi
    job_state = Q
    queue = batch
    server = sbenkorichi
    Checkpoint = u
    ctime = Wed Nov 15 11:03:50 2017
    Error_Path = sbenkorichi:/nfs/fds-examples/original2.2.1/SD1_test.err
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Wed Nov 15 11:03:50 2017
    Output_Path = sbenkorichi:/nfs/fds-examples/original2.2.1/SD1_test.log
    Priority = 0
    qtime = Wed Nov 15 11:03:50 2017
    Rerunable = True
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=1
    Resource_List.walltime = 05:00:00
    Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/home/salah,
    PBS_O_LOGNAME=salah,
    PBS_O_PATH=/home/salah/FDS/FDS6/bin:/home/salah/FDS/FDS6/bin:/home/sa
    lah/bin:/home/salah/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbi
    n:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games,
    PBS_O_SHELL=/bin/bash,PBS_O_LANG=en_GB.UTF-8,
    PBS_O_WORKDIR=/nfs/fds-examples/original2.2.1,PBS_O_HOST=sbenkorichi,
    PBS_O_SERVER=sbenkorichi
    euser = salah
    egroup = salah
    queue_type = E
    comment = Not Running: Not enough of the right type of nodes are available
    
    etime = Wed Nov 15 11:03:50 2017
    submit_args = script.pbs
    fault_tolerant = False
    job_radix = 0
    submit_host = sbenkorichi
    request_version = 1
  4. Johann A. Briffa says:
    
    Nov 15, 2017 at 14:02
    
    Have you checked whether the compute nodes are online after a reboot? (With pbsnodes.)
sbenkorichi says:

Nov 15, 2017 at 15:25

Johann,
I’m sorry, I’ve messed up with my system. I can no longer get torque to work. I was trying to install the latest release. So, was I compiled it, and followed the steps here: https://paramagnetism.co.uk/2017/05/torque-pbs-ubuntu-16-04mint-18/
But, then didn’t work. Tried to remove it and go back to your steps, but then can’t get it to work either, I passes all steps, when I just use the master node before adding client, it still getting status Q.
Do you know any way how I can purge and remove all the compiled torque, I tried sudo make uninstall, but somehow still can’t get torque working. Any suggestion is appreciated.

Reply
sbenkorichi says:

Nov 15, 2017 at 15:28

Okay, luckily I found why. I just had to stop the server, and then restart it with the scheduler, I think the scheduler wasn’t starting by default. Here is what made it working:
sudo qterm
sudo pbs_server
sudo pbs_sched

Reply
1. Johann A. Briffa says:
  
  Nov 15, 2017 at 15:50
  
  OK, glad to see you solved this. The scheduler should start automatically, you’ll need to check the startup system you’re using.
  
  Reply
sbenkorichi says:

Nov 15, 2017 at 15:52

It’s not set in my .bashrc file. Do you recommend putting it there? Yeah, I think that what was missing. Maybe, you need to highlight this for other users to do as well. I found your instructions are quite easy to follow and are up to date. Thanks alot.
Cheers,
Salah

Reply
1. Johann A. Briffa says:
  
  Nov 15, 2017 at 16:23
  
  I was referring to the system-wide startup system (systemd for recent Ubuntu). I did not add anything because the relevant processes should start automatically; I suspect there is something wrong with your setup that is causing problems.
  
  Reply
sbenkorichi says:

Nov 27, 2017 at 21:02

I have no idea why it’s not. I’m using Linux mint 18.2. I first was trying with your explanations, everything works fine, after reboot doesn’t restart automatically, and I have to reload them each time. Then, I compiled torque 6.02 but same issue. I don’t know where I can put them locally, so they could start after a reboot without doing it manually each time…

Reply
sbenkorichi says:

Jan 10, 2018 at 10:29

Johann,

I have installed torque on new fresh system and I haven’t had this issue of being not restarted. Except for the pbs_sched, which is the default. I found a way to get them enabled after a reboot. This is done by adding the following lines in the /etc/rc.local file:
/usr/sbin/pbs_mom
/usr/sbin/pbs_server
/usr/sbin/pbs_sched

Then just reboot, it should fix it.

This might help users who run into similar issue.

Salah

Reply
1. Johann A. Briffa says:
  
  Jan 10, 2018 at 13:57
  
  Glad you found a solution that worked for you, and thanks for coming back to report on it! Somehow I still feel there must be a more appropriate way to start the services than to stick them into rc.local, but if it’s working, why mess with it?
  
  Reply
sbenkorichi says:

Jan 10, 2018 at 16:38

Well now it’s working without messing with it. But, before it wasn’t. However, now the default of pbs_sched isn’t enabled after reboot. –I do understand why. But, sometimes, you have jobs scheduled, and suppose you rebooted the headnode. The jobs wont start running unless you enable the scheduler. So, to make it easy, I just had to let the rc.local take care of it rather than each time I have to do $ sudo pbs_sched
Hope this makes sense now.

Reply
Luciano T.Costa says:

Jan 24, 2018 at 02:51

Hi, thanks. I believe everything with the client and server is fine, or almost all. I have NODES running in my cluster. However, I had to insert a new NODE and after set it as free after a while it becomes down. I didn’t get let it free and stable. It is there, in principle configured, but it is not working. How can I fix that?

Reply
1. Johann A. Briffa says:
  
  Jan 24, 2018 at 20:27
  
  You’ll need to figure out what is causing it to go down. Often it’s a communication problem between the node and the server. The logs for the mom and the server processes should point you in the right direction.
  
  Reply
Liviu says:

Apr 7, 2018 at 16:11

Hello Johann,

Thank you for this new post. I have followed your first one to set up a torque system on a single computer and it was very helpful to get everything up and running. Very good tutorial. I am running Torque version 2.4.16.

Now I want to add a compute node and I am struggled with the mom server on the node. I just installed Ubuntu on that node and torque-mom (without the torque-client since I do not want to submit jobs from there) with

sudo apt-get install torque-mom

After that I edited “/etc/torque/server_name” and write the name of my server (ClimeForceOne) in it.

The two computers (server and node) can ssh each other without a password and the firewall (ufw) on the server is disabled. I have also admin accounts on both computers (same user name).

My problem is very basic: I cannot even start the torque-mom server on the compute node.

For example this is what I get with “service torque-mom status” after restarting the mom

● torque-mom.service – LSB: Start and stop the PBS Mom
Loaded: loaded (/etc/init.d/torque-mom; bad; vendor preset: enabled)
Active: active (exited) since Sat 2018-04-07 09:37:34 EDT; 5s ago
Docs: man:systemd-sysv-generator(8)
Process: 5157 ExecStop=/etc/init.d/torque-mom stop (code=exited, status=0/SUCCESS)
Process: 5208 ExecStart=/etc/init.d/torque-mom start (code=exited, status=0/SUCCESS)

Apr 07 09:37:34 ClimeForceTwo systemd[1]: Starting LSB: Start and stop the PBS Mom…
Apr 07 09:37:34 ClimeForceTwo torque-mom[5208]: * Starting Torque Mom torque-mom
Apr 07 09:37:34 ClimeForceTwo torque-mom[5208]: …fail!
Apr 07 09:37:34 ClimeForceTwo systemd[1]: Started LSB: Start and stop the PBS Mom.

ClimeForceTwo is the name I gave to the compute node. There are no logs in the related mom directory.

Thank you in advance for your help.

Reply
1. Liviu says:
  
  Apr 7, 2018 at 18:41
  
  Another (hopefully) useful information. If I start the mom server manually (sudo pbs_mom) all I get is
  
  pbs_mom: LOG_ERROR::No such file or directory (2) in chk_file_sec, Security violation with “/var/spool/torque/pbs_environment” – /var/spool/torque/pbs_environment cannot be lstat’d – errno=2, No such file or directory
  
  Reply
  1. Liviu says:
    
    Apr 8, 2018 at 13:38
    
    To start the mom and having it in a stable “running” state I have to manually create an empty “/var/spool/torque/pbs_environment” file…..looks like a bug to me…..
  2. Johann A. Briffa says:
    
    Apr 11, 2018 at 08:49
    
    Hi Liviu, thanks for your kind words and for your comments. I haven’t seen that problem before, but then it’s only very recently that we needed to set up a mom-only machine. But then this was migrated from being a server, so we kept the packages installed anyway, just disabled the services we didn’t need any more.
    
    From your comments it’s clear that the problem is that you had the “/var/spool/torque/pbs_environment” file missing. I have checked the package details, and this is provided by torque-common, which is a dependency of all the other torque packages, so should have been installed with your torque-mom. I suspect that something odd happened in your installation process, and that some of your packages are not correctly installed. I’d suggest reinstalling at least torque-mom and torque-common (after deleting your “/var/spool/torque/pbs_environment”); see if the problem is fixed. You didn’t say which Ubuntu version you’ve installed, though given the Torque version I suspect it’s 16.04 LTS.