Adding client nodes to a Torque/PBS system on Ubuntu 14.04 LTS

A few weeks ago I wrote a blog post on installing a single-node Torque/PBS job scheduler on an Ubuntu 14.04 LTS system. The node served as scheduler, queue manager, compute node, and submission node. For my application, this node was installed in a machine room, and was primarily meant to act as a shared compute node; users would need to submit their jobs remotely from their workstations. So an additional part of the job involved the addition of a number of such nodes as job submission nodes. It proved to be rather simpler than I expected, and I’m finally writing about how I did that.

As in my earlier post, the following commands need to be issued as root, either on the machine to be added as a job submission node (I’ll call this the client from now on) or on the machine that’s already installed as scheduler, queue manager, compute node, and submission node (I’ll call this the server from now on).

First, of course, the client machine needs to have the necessary packages installed. This is easily done.

apt-get install torque-client torque-mom

We’ll be installing the client as both a job submission node and a compute node; we won’t necessarily want it to act as a compute node but this makes it easier if we do. So, we first need to stop the compute node process.

/etc/init.d/torque-mom stop

Next, we configure the client to point to the already configured server.

echo SERVER.DOMAIN > /etc/torque/server_name

This does two things: it lets the client know where any submitted jobs need to go for scheduling, and it also lets the compute node process know where to get work from. All we need to do after this is to start the compute node process again

/etc/init.d/torque-mom start

Now, we simply need to let the server know that it should accept any jobs coming from this client. So, on the server, we tell the queue manager that our new client is a valid job submission node.

qmgr -c 'set server submit_hosts += CLIENT'

Note that, as when we added the server machine as a job submission node, this client address cannot be a FQDN (see the previous post for an explanation). Note that if the server cannot resolve the client name from its IP (i.e. if you don’t have reverse DNS lookup on the client’s domain) then you’ll need to add the client IP and name (qualified, if you want) to /etc/hosts on the server. This allows the server to do the necessary name lookup from the client IP.

That’s basically it. To test, just submit a job from the client machine, and it should work.

Advertisements

12 comments

  1. Hello,
    I have a client node but not installed torque-mom: atcually I can submit jobs but I can’t see with qstat jobs queued other then mine.
    Is this normal?
    Fedele

    • Hi Fedele! If I remember correctly it’s normal, and to see other users’ jobs you need to do a qstat -u ‘*’. Let me know if this works. Re: torque-mom I vaguely recall it made things easier for me because compute nodes automatically get submission privileges – without it you may have needed to add separate permissions. Did you need to do anything else, or were the remaining instructions enough?

  2. Hi, thanks so much for this set of tutorials, it’s exactly what I want to do with my multicore server. My only problem is that I’m such a beginner to batch processing that I’m confused over some basic definitions — like what exactly a ‘node’ is (I understand node != processor, but not much else). Are there any intro guides you are aware of?

    • Hi Eileen, thanks for stopping by and for your kind words! Think of a node as a system with independent memory – for example a workstation or desktop PC which has its own RAM. This may have one or more processors / cores. I’m really not sure what to suggest as an intro guide – at least I don’t know of any particularly good ones on the net. It also depends a lot on what you intend to use the system for (batch scheduling, parallel jobs, etc). If you have access to a cluster, the admins usually put up some introductory pages on how to use it, and that’s normally the best place to start. Hope this helps, at least a little.

  3. Hi, thank you for this tutorial, i have a question for you and i hope that you can answer. I followed the tutorials to configure Torque on the “server” and the “client” machine, then i added the client node on server machine with command:
    – qmgr -c “create node node-name”
    I’m using PBS like Scheduler for Globus Toolkit 6.0, and i can submit jobs to server machine, all works fine, but every job is performed by the server, how can i configure the server to schedule the jobs also on the client’s resources?

    • Hi Andrew, thanks for stopping by and for your kind words. To use the client as a resource as well you need to:
      1) set up the MOM process on the client node (this is what accepts and runs jobs from the server); this involves adding the server name to /var/spool/torque/mom_priv/config on the client, and
      2) add the client node as a usable resource; to do this add a line for the client in /var/spool/torque/server_priv/nodes on the server.
      That should be it.

  4. Johann,
    Thanks for the instruction. I have slightly different setup, and I could not configure correctly.

    The goal is to submit jobs from server1 and have the jobs run on six computenodes.

    I have server1 that is running torque_server torque_scheduler torque_mom. In the resource file (/var/spool/torque/server_priv/nodes), I have hostnames of six compute nodes. I am running torque_mom on these data nodes. In each computenode, I have the hostname of server1 in /var/spool/torque/mom_priv/config.

    When I perform interactive job with qsub -I, it is not working: waiting for job 1.server1 to start.

    What am I doing wrong?

    • Hi Jeremy, it’s hard to tell from the above alone what’s wrong. Probably there’s a communication problem between the compute nodes and the server, but this can be an issue in either direction. Start by querying the server for the reason the job isn’t starting (while it’s waiting to start, of course). This should point you in the right direction for what to check next. You can also check the mom logs on the compute nodes to see if they’re starting correctly and connecting with the server.

  5. Hello,

    I’m glad I found a post such as yours. However, you do not talk about shared SSH key. WHat are the steps to make your compute nodes to communicate with each other.

    Regards,
    David

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s