Adding client nodes to a Torque/PBS system on Ubuntu 14.04 LTS

A few weeks ago I wrote a blog post on installing a single-node Torque/PBS job scheduler on an Ubuntu 14.04 LTS system. The node served as scheduler, queue manager, compute node, and submission node. For my application, this node was installed in a machine room, and was primarily meant to act as a shared compute node; users would need to submit their jobs remotely from their workstations. So an additional part of the job involved the addition of a number of such nodes as job submission nodes. It proved to be rather simpler than I expected, and I’m finally writing about how I did that.

As in my earlier post, the following commands need to be issued as root, either on the machine to be added as a job submission node (I’ll call this the client from now on) or on the machine that’s already installed as scheduler, queue manager, compute node, and submission node (I’ll call this the server from now on).

First, of course, the client machine needs to have the necessary packages installed. This is easily done.

apt-get install torque-client torque-mom

We’ll be installing the client as both a job submission node and a compute node; we won’t necessarily want it to act as a compute node but this makes it easier if we do. So, we first need to stop the compute node process.

/etc/init.d/torque-mom stop

Next, we configure the client to point to the already configured server.

echo SERVER.DOMAIN > /etc/torque/server_name

This does two things: it lets the client know where any submitted jobs need to go for scheduling, and it also lets the compute node process know where to get work from. All we need to do after this is to start the compute node process again

/etc/init.d/torque-mom start

Now, we simply need to let the server know that it should accept any jobs coming from this client. So, on the server, we tell the queue manager that our new client is a valid job submission node.

qmgr -c 'set server submit_hosts += CLIENT'

Note that, as when we added the server machine as a job submission node, this client address cannot be a FQDN (see the previous post for an explanation). Note that if the server cannot resolve the client name from its IP (i.e. if you don’t have reverse DNS lookup on the client’s domain) then you’ll need to add the client IP and name (qualified, if you want) to /etc/hosts on the server. This allows the server to do the necessary name lookup from the client IP.

That’s basically it. To test, just submit a job from the client machine, and it should work.

Advertisements

33 comments

  1. Hello,
    I have a client node but not installed torque-mom: atcually I can submit jobs but I can’t see with qstat jobs queued other then mine.
    Is this normal?
    Fedele

    • Hi Fedele! If I remember correctly it’s normal, and to see other users’ jobs you need to do a qstat -u ‘*’. Let me know if this works. Re: torque-mom I vaguely recall it made things easier for me because compute nodes automatically get submission privileges – without it you may have needed to add separate permissions. Did you need to do anything else, or were the remaining instructions enough?

  2. Hi, thanks so much for this set of tutorials, it’s exactly what I want to do with my multicore server. My only problem is that I’m such a beginner to batch processing that I’m confused over some basic definitions — like what exactly a ‘node’ is (I understand node != processor, but not much else). Are there any intro guides you are aware of?

    • Hi Eileen, thanks for stopping by and for your kind words! Think of a node as a system with independent memory – for example a workstation or desktop PC which has its own RAM. This may have one or more processors / cores. I’m really not sure what to suggest as an intro guide – at least I don’t know of any particularly good ones on the net. It also depends a lot on what you intend to use the system for (batch scheduling, parallel jobs, etc). If you have access to a cluster, the admins usually put up some introductory pages on how to use it, and that’s normally the best place to start. Hope this helps, at least a little.

  3. Hi, thank you for this tutorial, i have a question for you and i hope that you can answer. I followed the tutorials to configure Torque on the “server” and the “client” machine, then i added the client node on server machine with command:
    – qmgr -c “create node node-name”
    I’m using PBS like Scheduler for Globus Toolkit 6.0, and i can submit jobs to server machine, all works fine, but every job is performed by the server, how can i configure the server to schedule the jobs also on the client’s resources?

    • Hi Andrew, thanks for stopping by and for your kind words. To use the client as a resource as well you need to:
      1) set up the MOM process on the client node (this is what accepts and runs jobs from the server); this involves adding the server name to /var/spool/torque/mom_priv/config on the client, and
      2) add the client node as a usable resource; to do this add a line for the client in /var/spool/torque/server_priv/nodes on the server.
      That should be it.

  4. Johann,
    Thanks for the instruction. I have slightly different setup, and I could not configure correctly.

    The goal is to submit jobs from server1 and have the jobs run on six computenodes.

    I have server1 that is running torque_server torque_scheduler torque_mom. In the resource file (/var/spool/torque/server_priv/nodes), I have hostnames of six compute nodes. I am running torque_mom on these data nodes. In each computenode, I have the hostname of server1 in /var/spool/torque/mom_priv/config.

    When I perform interactive job with qsub -I, it is not working: waiting for job 1.server1 to start.

    What am I doing wrong?

    • Hi Jeremy, it’s hard to tell from the above alone what’s wrong. Probably there’s a communication problem between the compute nodes and the server, but this can be an issue in either direction. Start by querying the server for the reason the job isn’t starting (while it’s waiting to start, of course). This should point you in the right direction for what to check next. You can also check the mom logs on the compute nodes to see if they’re starting correctly and connecting with the server.

  5. Hello,

    I’m glad I found a post such as yours. However, you do not talk about shared SSH key. WHat are the steps to make your compute nodes to communicate with each other.

    Regards,
    David

  6. Hi,
    I’ve followed your steps to link a server with another node. All goes well, starting form the server to the client setting, however when I reboot, and I try to use qsub script.pbs as usual, all the jobs get queued as shown in qstat -a. But none of them works, I tried to look for useful info but in vain. Is there any possibility you check why this is happening?

    System: Using Ubuntu

    Thanks,

    let me know if you need any further info.

    • Best thing is to check the job specific status, to determine why it has not started, and take it from there.

      • Hi Thanks for feedback.
        Well, when I install it first, then add the other node. Everything works fine, I can submit jobs to use both machines, but after I reboot, and I do qsub jobscript then nothing happens, checking the status with qstat -a
        I find the status is : Q. Well, I left it for a while nothing changed– was thinking maybe needs sometime, but not, it just get stuck there, even easy jobs like qsub -I doesn’t work.

      • You’ll need to check with qstat -f to see why the job remains queued. Most likely one of the processes didn’t start properly after reboot, but you’ll need to figure out which and why.

      • Well, I’ve checked with qstat -f
        but I can’t understand why there is no sufficient nodes for this. I run it fine before rebooting, after reboot it gives this behaviour.
        $ qstat -f
        Job Id: 3.sbenkorichi
        Job_Name = SD1_test
        Job_Owner = salah@sbenkorichi
        job_state = Q
        queue = batch
        server = sbenkorichi
        Checkpoint = u
        ctime = Wed Nov 15 11:03:50 2017
        Error_Path = sbenkorichi:/nfs/fds-examples/original2.2.1/SD1_test.err
        Hold_Types = n
        Join_Path = n
        Keep_Files = n
        Mail_Points = a
        mtime = Wed Nov 15 11:03:50 2017
        Output_Path = sbenkorichi:/nfs/fds-examples/original2.2.1/SD1_test.log
        Priority = 0
        qtime = Wed Nov 15 11:03:50 2017
        Rerunable = True
        Resource_List.nodect = 1
        Resource_List.nodes = 1:ppn=1
        Resource_List.walltime = 05:00:00
        Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/home/salah,
        PBS_O_LOGNAME=salah,
        PBS_O_PATH=/home/salah/FDS/FDS6/bin:/home/salah/FDS/FDS6/bin:/home/sa
        lah/bin:/home/salah/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbi
        n:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games,
        PBS_O_SHELL=/bin/bash,PBS_O_LANG=en_GB.UTF-8,
        PBS_O_WORKDIR=/nfs/fds-examples/original2.2.1,PBS_O_HOST=sbenkorichi,
        PBS_O_SERVER=sbenkorichi
        euser = salah
        egroup = salah
        queue_type = E
        comment = Not Running: Not enough of the right type of nodes are available

        etime = Wed Nov 15 11:03:50 2017
        submit_args = script.pbs
        fault_tolerant = False
        job_radix = 0
        submit_host = sbenkorichi
        request_version = 1

  7. Johann,
    I’m sorry, I’ve messed up with my system. I can no longer get torque to work. I was trying to install the latest release. So, was I compiled it, and followed the steps here: https://paramagnetism.co.uk/2017/05/torque-pbs-ubuntu-16-04mint-18/
    But, then didn’t work. Tried to remove it and go back to your steps, but then can’t get it to work either, I passes all steps, when I just use the master node before adding client, it still getting status Q.
    Do you know any way how I can purge and remove all the compiled torque, I tried sudo make uninstall, but somehow still can’t get torque working. Any suggestion is appreciated.

  8. Okay, luckily I found why. I just had to stop the server, and then restart it with the scheduler, I think the scheduler wasn’t starting by default. Here is what made it working:
    sudo qterm
    sudo pbs_server
    sudo pbs_sched

    • OK, glad to see you solved this. The scheduler should start automatically, you’ll need to check the startup system you’re using.

  9. It’s not set in my .bashrc file. Do you recommend putting it there? Yeah, I think that what was missing. Maybe, you need to highlight this for other users to do as well. I found your instructions are quite easy to follow and are up to date. Thanks alot.
    Cheers,
    Salah

    • I was referring to the system-wide startup system (systemd for recent Ubuntu). I did not add anything because the relevant processes should start automatically; I suspect there is something wrong with your setup that is causing problems.

  10. I have no idea why it’s not. I’m using Linux mint 18.2. I first was trying with your explanations, everything works fine, after reboot doesn’t restart automatically, and I have to reload them each time. Then, I compiled torque 6.02 but same issue. I don’t know where I can put them locally, so they could start after a reboot without doing it manually each time…

  11. Johann,

    I have installed torque on new fresh system and I haven’t had this issue of being not restarted. Except for the pbs_sched, which is the default. I found a way to get them enabled after a reboot. This is done by adding the following lines in the /etc/rc.local file:
    /usr/sbin/pbs_mom
    /usr/sbin/pbs_server
    /usr/sbin/pbs_sched

    Then just reboot, it should fix it.

    This might help users who run into similar issue.

    Salah

    • Glad you found a solution that worked for you, and thanks for coming back to report on it! Somehow I still feel there must be a more appropriate way to start the services than to stick them into rc.local, but if it’s working, why mess with it?

  12. Well now it’s working without messing with it. But, before it wasn’t. However, now the default of pbs_sched isn’t enabled after reboot. –I do understand why. But, sometimes, you have jobs scheduled, and suppose you rebooted the headnode. The jobs wont start running unless you enable the scheduler. So, to make it easy, I just had to let the rc.local take care of it rather than each time I have to do $ sudo pbs_sched
    Hope this makes sense now.

  13. Hi, thanks. I believe everything with the client and server is fine, or almost all. I have NODES running in my cluster. However, I had to insert a new NODE and after set it as free after a while it becomes down. I didn’t get let it free and stable. It is there, in principle configured, but it is not working. How can I fix that?

    • You’ll need to figure out what is causing it to go down. Often it’s a communication problem between the node and the server. The logs for the mom and the server processes should point you in the right direction.

  14. Hello Johann,

    Thank you for this new post. I have followed your first one to set up a torque system on a single computer and it was very helpful to get everything up and running. Very good tutorial. I am running Torque version 2.4.16.

    Now I want to add a compute node and I am struggled with the mom server on the node. I just installed Ubuntu on that node and torque-mom (without the torque-client since I do not want to submit jobs from there) with

    sudo apt-get install torque-mom

    After that I edited “/etc/torque/server_name” and write the name of my server (ClimeForceOne) in it.

    The two computers (server and node) can ssh each other without a password and the firewall (ufw) on the server is disabled. I have also admin accounts on both computers (same user name).

    My problem is very basic: I cannot even start the torque-mom server on the compute node.

    For example this is what I get with “service torque-mom status” after restarting the mom

    ● torque-mom.service – LSB: Start and stop the PBS Mom
    Loaded: loaded (/etc/init.d/torque-mom; bad; vendor preset: enabled)
    Active: active (exited) since Sat 2018-04-07 09:37:34 EDT; 5s ago
    Docs: man:systemd-sysv-generator(8)
    Process: 5157 ExecStop=/etc/init.d/torque-mom stop (code=exited, status=0/SUCCESS)
    Process: 5208 ExecStart=/etc/init.d/torque-mom start (code=exited, status=0/SUCCESS)

    Apr 07 09:37:34 ClimeForceTwo systemd[1]: Starting LSB: Start and stop the PBS Mom…
    Apr 07 09:37:34 ClimeForceTwo torque-mom[5208]: * Starting Torque Mom torque-mom
    Apr 07 09:37:34 ClimeForceTwo torque-mom[5208]: …fail!
    Apr 07 09:37:34 ClimeForceTwo systemd[1]: Started LSB: Start and stop the PBS Mom.

    ClimeForceTwo is the name I gave to the compute node. There are no logs in the related mom directory.

    Thank you in advance for your help.

    • Another (hopefully) useful information. If I start the mom server manually (sudo pbs_mom) all I get is

      pbs_mom: LOG_ERROR::No such file or directory (2) in chk_file_sec, Security violation with “/var/spool/torque/pbs_environment” – /var/spool/torque/pbs_environment cannot be lstat’d – errno=2, No such file or directory

      • To start the mom and having it in a stable “running” state I have to manually create an empty “/var/spool/torque/pbs_environment” file…..looks like a bug to me…..

      • Hi Liviu, thanks for your kind words and for your comments. I haven’t seen that problem before, but then it’s only very recently that we needed to set up a mom-only machine. But then this was migrated from being a server, so we kept the packages installed anyway, just disabled the services we didn’t need any more.

        From your comments it’s clear that the problem is that you had the “/var/spool/torque/pbs_environment” file missing. I have checked the package details, and this is provided by torque-common, which is a dependency of all the other torque packages, so should have been installed with your torque-mom. I suspect that something odd happened in your installation process, and that some of your packages are not correctly installed. I’d suggest reinstalling at least torque-mom and torque-common (after deleting your “/var/spool/torque/pbs_environment”); see if the problem is fixed. You didn’t say which Ubuntu version you’ve installed, though given the Torque version I suspect it’s 16.04 LTS.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s