[torqueusers] torque stops by itself

Ricardo Román Brenes roman.ricardo at gmail.com
Fri Mar 21 11:41:29 MDT 2014

Hello everyone.

I am having a problem with torque and i dont really know hwere to look for

i got a configuration made of 10 nodes:
1 torque/maui server
9 compute nodes

this is my nodes file:
[root at meta server_priv]# cat nodes
cadejos-0.cnca np=8 tesla xeon
cadejos-1.cnca np=8 tesla xeon
cadejos-2.cnca np=8 tesla xeon
cadejos-3.cnca np=8 xeon
cadejos-4.cnca np=8 xeon
zarate-0.cnca np=2 ps3
zarate-1.cnca np=2 ps3
zarate-2.cnca np=2 ps3
zarate-3.cnca np=2 ps3

and my queues:
[root at meta server_priv]# qmgr -c 'p s'
create queue xeon
set queue xeon queue_type = Execution
set queue xeon resources_default.neednodes = xeon
set queue xeon resources_default.nodes = 1
set queue xeon resources_default.walltime = 01:00:00
set queue xeon enabled = True
set queue xeon started = True
create queue tesla
set queue tesla queue_type = Execution
set queue tesla resources_default.neednodes = tesla
set queue tesla resources_default.nodes = 1
set queue tesla resources_default.walltime = 01:00:00
set queue tesla enabled = True
set queue tesla started = True
create queue ps3
set queue ps3 queue_type = Execution
set queue ps3 resources_default.neednodes = ps3
set queue ps3 resources_default.nodes = 1
set queue ps3 resources_default.walltime = 01:00:00
set queue ps3 enabled = True
set queue ps3 started = True
set server acl_hosts = meta.cnca
set server acl_roots = root at localhost
set server acl_roots += root at meta.cnca
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 69

Now, im pretty sure zarate's torque version is different from cadejos
version because they have different OS (fedora11-ppc and centos6-x86).

The problem is the pbs_mom at zarate nodes seems to stop suddenly without
warning or error message, and while doing just nothing.

I can send jobs to the cadejos nodes just fine and they run either
interactivly or batch but on the zarate nodes nothing runs.

Anyone has any idea on this subjetc?

