[Edited 2019-08-01: changed Debian release to ‘stretch’]
This blog post has been a long time coming. For many years I have used the Torque/PBS job scheduler on machines that I administer or am otherwise responsible for. My blog post on setting up Torque on Ubuntu topped this site’s most-visited pages by a wide margin. I’m also a creature of habit, avoiding changes when there is no real problem to solve. So I was rather disappointed when I saw that the upgrade to 18.04 LTS would take away the Torque packages. I knew what it would mean, and avoided facing the problem for as long as I could. At this stage we’re in the process of migrating some computational resources to 18.04 LTS, so the issue had to be faced.
For those of us using Ubuntu 18.04 LTS, there are fundamentally two options:
- Stick with Torque/PBS. This requires another mechanism for installing the necessary packages. I am told (check the comments on the blog post I mentioned earlier) that it is possible to add the xenial (16.04 LTS) repositories on an 18.04 LTS system and install the packages that way. Configuration should be the same as for 16.04 LTS (see the blog post). Please note that I have not verified this.
- Choose another scheduler. After some research, I decided to go with Son of Grid Engine, a fork of the Sun Grid Engine project (which eventually was bought by Oracle). Unfortunately, I found that the packages in the official Ubuntu repositories have a problem (c.f. here and here), which requires use of other repositories (e.g. debian) anyway. This makes the choice between the two options rather less straight-forward. In the end we chose to go with SGE anyway, mostly because it’s newer code, and more likely to remain supported in future Ubuntu versions.
In the rest of this blog post I’ll document what one needs to do to install the packages and start setting up the scheduler. This is a work in progress at my end, so I’ll update this blog post in due course, with complete instructions on setting up a simple queue (similar to what I had done for Torque).
Adding the Debian repositories
The objective here is to add the Debian repositories so that we can install the Grid Engine packages from that source (instead of the official Ubuntu ones). Specifically, we want the repositories for the ‘stretch’ release, which thankfully uses library versions compatible with those in Ubuntu 18.04. We also want to set things so that only the Grid Engine packages are taken from Debian. This avoids the possibility that various packages in your Ubuntu installation start ‘upgrading’ to their Debian equivalent. Thankfully, APT makes this easy with the right settings.
To start with, we add the Debian repositories with:
sudo cat > /etc/apt/sources.list.d/debian.list <<EOL deb http://ftp.debian.org/debian/ stretch main contrib non-free deb http://security.debian.org/debian-security/ stretch/updates main contrib non-free EOL
Next, we set the APT preferences so that a) the Grid Engine packages from Debian get priority over the official Ubuntu repositories, and b) everything else is never used. This can be done with:
sudo cat > /etc/apt/preferences.d/debian <<EOL Package: gridengine-* Pin: release o=Debian Pin-Priority: 1000 Package: * Pin: release o=Debian Pin-Priority: 10 EOL
Once these files are set up, update the package cache and install the Debian signing keys with:
sudo apt-get update sudo apt-get install debian-archive-keyring apt-key add /usr/share/keyrings/debian-archive-keyring.gpg
Installing the Grid Engine Packages
Assuming a single-node cluster, we can install the packages needed for the execution node, master node, and queue management using:
sudo apt-get install gridengine-exec gridengine-master gridengine-qmon
Now the queue management program (qmon) won’t work because the Debian package installs the pixmaps in a different folder from where qmon will look. This can be fixed with the following commands:
sudo mkdir /var/lib/gridengine/qmon sudo ln -s /usr/share/gridengine/pixmaps /var/lib/gridengine/qmon/PIXMAPS
At this point, the software should be in working order. Next step is to configure Grid Engine by adding the necessary queue. (To be continued.)
Thanks for the tutorial. Any chance for how to set up a queue?
We’re working on it, it’s just that this project got pushed back by several months. Should be getting priority soon, and will write up when it’s done.
Greetings!
I am new to SGE and am experiencing difficulty setting up my local configuration. To break down the issue I began by only installing:
$ sudo apt-get install gridengine-master gridengine-qmon
I can get the sge_qmaster running, but executing the qmon command triggers an error:
reresolve hostname failed: can’t resolve hostname
unable to send message to qmaster using port 6444 on host “/var/lib/gridengine”: got unexpected parameters.
I have tried troubleshooting many different ways. I’m including as much relevant information from my troubleshooting attempts as I can think of in the scripts below. If there is anything else that will help determine where the error is please let me know.
Thank you for any assistance you can offer!
-Matt
cnv18@cnv18:~$ ps -ef | grep sge
root 2783 1 0 11:50 ? 00:00:00 /usr/lib/gridengine/sge_qmaster
cnv18 2815 2806 0 11:51 pts/0 00:00:00 grep –color=auto sge
cnv18@cnv18:~$ qmon
reresolve hostname failed: can’t resolve host name
unable to send message to qmaster using port 6444 on host “/var/lib/gridengine”: got unexpected parameters
cnv18@cnv18:~$ hostname
cnv18
cnv18@cnv18:~$ hostname –fqdn
local.cnv.18
cnv18@cnv18:~$ nslookup cnv18
Server: 127.0.0.53
Address: 127.0.0.53#53
Non-authoritative answer:
Name: cnv18
Address: 127.0.1.1
cnv18@cnv18:~$ ifconfig
enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.0.2.15 netmask 255.255.255.0 broadcast 10.0.2.255
inet6 fe80::9282:ade3:3774:a8f5 prefixlen 64 scopeid 0x20
ether 08:00:27:e4:18:30 txqueuelen 1000 (Ethernet)
RX packets 673 bytes 733313 (733.3 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 328 bytes 40241 (40.2 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10
loop txqueuelen 1000 (Local Loopback)
RX packets 208 bytes 18250 (18.2 KB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 208 bytes 18250 (18.2 KB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
cnv18@cnv18:~$ cat /etc/hostname
cnv18
cnv18@cnv18:~$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 local.cnv.18 cnv18
The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
cnv18@cnv18:~$ cat ‘/var/lib/gridengine/default/common/act_qmaster’
cnv18
cnv18@cnv18:~$ ‘/usr/lib/gridengine/gethostbyaddr’
Version: 8.1.9
usage: gethostbyaddr [-help|-name|-aname|-all]x.x.x.x
cnv18@cnv18:~$ ‘/usr/lib/gridengine/gethostbyname’
Version: 8.1.9
usage: gethostbyname [-help|-name|-aname|-all]
cnv18@cnv18:~$ ‘/usr/lib/gridengine/gethostname’
Hostname: local.cnv.18
Aliases: cnv18
Host Address(es): 127.0.1.1
cnv18@cnv18:~$ ‘/usr/lib/gridengine/getservbyname’
Version: 8.1.9
usage:
getservbyname [-help|-number] service | -check port_number
get number of a tcp service
cnv18@cnv18:~$ cat ‘/var/spool/gridengine/qmaster/messages’
11/19/2020 11:34:16| main|cnv18|W|local configuration cnv18 not defined – using global configuration
11/19/2020 11:34:16| main|cnv18|E|global configuration not defined
11/19/2020 11:34:16| main|cnv18|C|setup failed
cnv18@cnv18:~$
Hello could you install SGE I had the same problem
Hi Jose, apologies for the delayed response, things have been a bit hectic. In the end we abandoned SGE in favour of Slurm, which seems more actively supported, and supports everything we needed. I still intend to write up how we did it, but as I was less involved in the technical work, it will take some time.