Fedora Cluster‎ > ‎

Torque/PBS queue

The last ingredient of the manual is how to set up a server that can handle computational jobs to be run on an arbitrary machine of the cluster. At this moment we are only going to show how to create a single queue for non-interactive jobs. Using this information it should be pretty straightforward to do other things which may be added to the manual in the future.

Roughly, a cluster is composed of one server and one ore more computation nodes. The server itself may be declared to be a computation node and be used to run jobs, but since we already have some machines with little power, I recommend using one of those machines as the node everybody logs into to send the jobs (That machine, provided enough disk, could also be the GlusterFS server). So in our case this is going to be 'deutsch' and the computation nodes are going to be 'kitaev' and 'toffoli'.

On the master node

We start by setting up the master node.
yum install 'torque*'
We then create a server with a basic queue that allows jobs running up to four days, with up to 12 cores. Feel free to customise the lines below to your needs:
# This is the user who will manage the queues
# This is the host name of the master node
# We create the configuration from scratch
pbs_server -D -t create
# ... and press Ctrl-C to interrupt this fake server
# We start the server for real now
service pbs_server start
chkconfig pbs_server on
service trqauthd start
chkconfig trqauthd on
# configure manager/operator user
qmgr -c "set server operators += $THEUSER@$THEHOST"
qmgr -c "set server managers += $THEUSER@$THEHOST"
# scheduling options
qmgr -c 'set server scheduling = true'
qmgr -c 'set server keep_completed = 300'
# create the default queue called 'batch'
# this will consist of a single node and
# allow a maximum of 7 jobs to be run at
# one time. This was for a dual quad-core
# desktop machine.
qmgr -c 'create queue batch'
qmgr -c 'set queue batch queue_type = execution'
qmgr -c 'set queue batch started = true'
qmgr -c 'set queue batch enabled = true'
qmgr -c 'set queue batch resources_default.walltime = 4:00:00:00'
qmgr -c 'set queue batch resources_default.nodes = 1'
qmgr -c 'set queue batch max_running = 12'
qmgr -c 'set server default_queue = batch'
echo "kitaev.iff.csic.es np=20" > /var/lib/torque/server_priv/nodes
echo "toffoli.iff.csic.es np=16" >> /var/lib/torque/server_priv/nodes

We now have to open the ports so that the server accepts connections from the slave nodes:
# iptables -D INPUT -j REJECT --reject-with icmp-host-prohibited
# iptables -A INPUT -p tcp -m tcp --dport 15000:15004 -j ACCEPT
# iptables -A INPUT -p udp -m udp --dport 15000:15004 -j ACCEPT
# iptables -A INPUT -j REJECT --reject-with icmp-host-prohibited
# service iptables save
# service iptables restart

We finally have to set up the server to run at boot time
chkconfig pbs_server on
chkconfig pbs_sched on

On the computation nodes

The following steps have to be followed on all nodes that will be slaves of the server and will run code from it. We first install the packages:
yum install 'torque*'

We then inform the system who is directing the operations:
echo $THEMASTER > /etc/torque/server_name
echo '$pbs_server = '$THEMASTER > /etc/torque/mom/config
echo '$pbs_server = '$THEMASTER > /var/lib/torque/mom_priv/config
touch /var/lib/torque/mom_priv/mom.layout

This is the right time to open the ports:
# iptables -D INPUT -j REJECT --reject-with icmp-host-prohibited
# iptables -A INPUT -p tcp -m tcp --dport 15000:15004 -j ACCEPT
# iptables -A INPUT -p udp -m udp --dport 15000:15004 -j ACCEPT
# iptables -A INPUT -j REJECT --reject-with icmp-host-prohibited
# service iptables save
# service iptables restart

And we finally launch the servers
chckconfig pbs_mom on
service pbs_mom start

On all nodes

Every user that wishes to submit jobs, needs to be able to ssh in and out the machines where the jobs are going to run. My preferred way of doing this consists of two steps. First of all, we need to inform all computers about the SSH keys of all its peers:
# rm ~/.ssh/known_hosts
# for i in deutsch kitaev toffoli; do ssh $i echo done; done
# cp ~/.ssh/known_hosts /etc/ssh/ssh_known_hosts

After this, for each user that you have, you will need to do the following from their own account (i.e. you do not do this as root)
$ ssh-keygen -t dsa
$ cat .ssh/id_dsa.pub > .ssh/authorized_keys
This enables the user to log in back from any machine into its home directory. Because home directories are shared among computers, you only need to do this once.


There are several things to check. The first one is that all nodes are up and running
# pbsnodes -a
This should list all nodes in our cluster, with a status of "free" or "down" as follows
     state = free
     np = 20
     ntype = cluster
     status = rectime=1406649812,varattr=,jobs=,state=free,netload=8594482128,gres=pbs_server:= qubit.iff.csic.es,loadave=0.00,ncpus=24,physmem=32856760kb,availmem=43005596kb,totmem=44579508kb,idletime=645,nusers=2,nsessions=5,sessions=1482 1485 1491 1575 1587,uname=Linux kitaev.iff.csic.es 3.11.10-301.fc20.x86_64 #1 SMP Thu Dec 5 14:01:17 UTC 2013 x86_64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003
     gpus = 0

If a node is listed as down, several things may happen:
  1. The pbs_mom server is not running. Verify by running service pbs_mom status in the node.
  2. The firewall prevents connections in and out from the server. You can verify by using netstat -a on each node to see what connections are open. Check also with iptables -S INPUT that the rules are properly installed on every computer.

More troubleshooting

Problem #1

In Fedora, the package does not update permissions properly and sometimes directories are missing. If you get permission problems with pbs_mom or pbs_server (they do not start), try, as suggested here https://bugs.launchpad.net/ubuntu/+source/torque/+bug/223649 changing the permissions yourself

chmod 1777 /var/lib/torque/spoool
chmod 1777 /var/lib/torque/undelivered

Note that you may even have to do it after every upgrade or reintall of torque!

Problem #2

trqauth does not start with the pbs_server; you may have to start it manually until the bug is fixed.

Problem #3

Upon upgrade, Fedora may decide to remove your files in /etc/torque replacing them with others. Look for files with the name *.rpmsave there if you find problems.