Installing (Son of) Grid Engine on Ubuntu 18.04 LTS

[Edited 2019-08-01: changed Debian release to ‘stretch’]

This blog post has been a long time coming. For many years I have used the Torque/PBS job scheduler on machines that I administer or am otherwise responsible for. My blog post on setting up Torque on Ubuntu topped this site’s most-visited pages by a wide margin. I’m also a creature of habit, avoiding changes when there is no real problem to solve. So I was rather disappointed when I saw that the upgrade to 18.04 LTS would take away the Torque packages. I knew what it would mean, and avoided facing the problem for as long as I could. At this stage we’re in the process of migrating some computational resources to 18.04 LTS, so the issue had to be faced.

For those of us using Ubuntu 18.04 LTS, there are fundamentally two options:

  1. Stick with Torque/PBS. This requires another mechanism for installing the necessary packages. I am told (check the comments on the blog post I mentioned earlier) that it is possible to add the xenial (16.04 LTS) repositories on an 18.04 LTS system and install the packages that way. Configuration should be the same as for 16.04 LTS (see the blog post). Please note that I have not verified this.
  2. Choose another scheduler. After some research, I decided to go with Son of Grid Engine, a fork of the Sun Grid Engine project (which eventually was bought by Oracle). Unfortunately, I found that the packages in the official Ubuntu repositories have a problem (c.f. here and here), which requires use of other repositories (e.g. debian) anyway. This makes the choice between the two options rather less straight-forward. In the end we chose to go with SGE anyway, mostly because it’s newer code, and more likely to remain supported in future Ubuntu versions.

In the rest of this blog post I’ll document what one needs to do to install the packages and start setting up the scheduler. This is a work in progress at my end, so I’ll update this blog post in due course, with complete instructions on setting up a simple queue (similar to what I had done for Torque).

Adding the Debian repositories

The objective here is to add the Debian repositories so that we can install the Grid Engine packages from that source (instead of the official Ubuntu ones). Specifically, we want the repositories for the ‘stretch’ release, which thankfully uses library versions compatible with those in Ubuntu 18.04. We also want to set things so that only the Grid Engine packages are taken from Debian. This avoids the possibility that various packages in your Ubuntu installation start ‘upgrading’ to their Debian equivalent. Thankfully, APT makes this easy with the right settings.

To start with, we add the Debian repositories with:

sudo cat > /etc/apt/sources.list.d/debian.list <<EOL
deb http://ftp.debian.org/debian/ stretch main contrib non-free
deb http://security.debian.org/debian-security/ stretch/updates main contrib non-free
EOL

Next, we set the APT preferences so that a) the Grid Engine packages from Debian get priority over the official Ubuntu repositories, and b) everything else is never used. This can be done with:

sudo cat > /etc/apt/preferences.d/debian <<EOL
Package: gridengine-*
Pin: release o=Debian
Pin-Priority: 1000

Package: *
Pin: release o=Debian
Pin-Priority: 10
EOL

Once these files are set up, update the package cache and install the Debian signing keys with:

sudo apt-get update
sudo apt-get install debian-archive-keyring
apt-key add /usr/share/keyrings/debian-archive-keyring.gpg

Installing the Grid Engine Packages

Assuming a single-node cluster, we can install the packages needed for the execution node, master node, and queue management using:

sudo apt-get install gridengine-exec gridengine-master gridengine-qmon

Now the queue management program (qmon) won’t work because the Debian package installs the pixmaps in a different folder from where qmon will look. This can be fixed with the following commands:

sudo mkdir /var/lib/gridengine/qmon
sudo ln -s /usr/share/gridengine/pixmaps /var/lib/gridengine/qmon/PIXMAPS

At this point, the software should be in working order. Next step is to configure Grid Engine by adding the necessary queue. (To be continued.)

7 thoughts on “Installing (Son of) Grid Engine on Ubuntu 18.04 LTS

    1. We’re working on it, it’s just that this project got pushed back by several months. Should be getting priority soon, and will write up when it’s done.

    1. Greetings!

      I am new to SGE and am experiencing difficulty setting up my local configuration. To break down the issue I began by only installing:
      $ sudo apt-get install gridengine-master gridengine-qmon

      I can get the sge_qmaster running, but executing the qmon command triggers an error:
      reresolve hostname failed: can’t resolve hostname
      unable to send message to qmaster using port 6444 on host “/var/lib/gridengine”: got unexpected parameters.

      I have tried troubleshooting many different ways. I’m including as much relevant information from my troubleshooting attempts as I can think of in the scripts below. If there is anything else that will help determine where the error is please let me know.

      Thank you for any assistance you can offer!
      -Matt

      cnv18@cnv18:~$ ps -ef | grep sge
      root 2783 1 0 11:50 ? 00:00:00 /usr/lib/gridengine/sge_qmaster
      cnv18 2815 2806 0 11:51 pts/0 00:00:00 grep –color=auto sge
      cnv18@cnv18:~$ qmon
      reresolve hostname failed: can’t resolve host name
      unable to send message to qmaster using port 6444 on host “/var/lib/gridengine”: got unexpected parameters

      cnv18@cnv18:~$ hostname
      cnv18
      cnv18@cnv18:~$ hostname –fqdn
      local.cnv.18
      cnv18@cnv18:~$ nslookup cnv18
      Server: 127.0.0.53
      Address: 127.0.0.53#53

      Non-authoritative answer:
      Name: cnv18
      Address: 127.0.1.1

      cnv18@cnv18:~$ ifconfig
      enp0s3: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
      inet 10.0.2.15 netmask 255.255.255.0 broadcast 10.0.2.255
      inet6 fe80::9282:ade3:3774:a8f5 prefixlen 64 scopeid 0x20
      ether 08:00:27:e4:18:30 txqueuelen 1000 (Ethernet)
      RX packets 673 bytes 733313 (733.3 KB)
      RX errors 0 dropped 0 overruns 0 frame 0
      TX packets 328 bytes 40241 (40.2 KB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
      inet 127.0.0.1 netmask 255.0.0.0
      inet6 ::1 prefixlen 128 scopeid 0x10
      loop txqueuelen 1000 (Local Loopback)
      RX packets 208 bytes 18250 (18.2 KB)
      RX errors 0 dropped 0 overruns 0 frame 0
      TX packets 208 bytes 18250 (18.2 KB)
      TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

      cnv18@cnv18:~$ cat /etc/hostname
      cnv18
      cnv18@cnv18:~$ cat /etc/hosts
      127.0.0.1 localhost
      127.0.1.1 local.cnv.18 cnv18

      The following lines are desirable for IPv6 capable hosts

      ::1 ip6-localhost ip6-loopback
      fe00::0 ip6-localnet
      ff00::0 ip6-mcastprefix
      ff02::1 ip6-allnodes
      ff02::2 ip6-allrouters
      cnv18@cnv18:~$ cat ‘/var/lib/gridengine/default/common/act_qmaster’
      cnv18
      cnv18@cnv18:~$ ‘/usr/lib/gridengine/gethostbyaddr’
      Version: 8.1.9
      usage: gethostbyaddr [-help|-name|-aname|-all]x.x.x.x
      cnv18@cnv18:~$ ‘/usr/lib/gridengine/gethostbyname’
      Version: 8.1.9
      usage: gethostbyname [-help|-name|-aname|-all]
      cnv18@cnv18:~$ ‘/usr/lib/gridengine/gethostname’
      Hostname: local.cnv.18
      Aliases: cnv18
      Host Address(es): 127.0.1.1
      cnv18@cnv18:~$ ‘/usr/lib/gridengine/getservbyname’
      Version: 8.1.9
      usage:
      getservbyname [-help|-number] service | -check port_number

      get number of a tcp service
      cnv18@cnv18:~$ cat ‘/var/spool/gridengine/qmaster/messages’
      11/19/2020 11:34:16| main|cnv18|W|local configuration cnv18 not defined – using global configuration
      11/19/2020 11:34:16| main|cnv18|E|global configuration not defined
      11/19/2020 11:34:16| main|cnv18|C|setup failed
      cnv18@cnv18:~$

      1. Hi Jose, apologies for the delayed response, things have been a bit hectic. In the end we abandoned SGE in favour of Slurm, which seems more actively supported, and supports everything we needed. I still intend to write up how we did it, but as I was less involved in the technical work, it will take some time.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.