[torqueusers] torque stops by itself

Ricardo Román Brenes roman.ricardo at gmail.com
Fri Mar 21 11:41:29 MDT 2014


Hello everyone.

I am having a problem with torque and i dont really know hwere to look for
help:

i got a configuration made of 10 nodes:
1 torque/maui server
9 compute nodes

this is my nodes file:
[root at meta server_priv]# cat nodes
cadejos-0.cnca np=8 tesla xeon
cadejos-1.cnca np=8 tesla xeon
cadejos-2.cnca np=8 tesla xeon
cadejos-3.cnca np=8 xeon
cadejos-4.cnca np=8 xeon
zarate-0.cnca np=2 ps3
zarate-1.cnca np=2 ps3
zarate-2.cnca np=2 ps3
zarate-3.cnca np=2 ps3


and my queues:
[root at meta server_priv]# qmgr -c 'p s'
create queue xeon
set queue xeon queue_type = Execution
set queue xeon resources_default.neednodes = xeon
set queue xeon resources_default.nodes = 1
set queue xeon resources_default.walltime = 01:00:00
set queue xeon enabled = True
set queue xeon started = True
#
create queue tesla
set queue tesla queue_type = Execution
set queue tesla resources_default.neednodes = tesla
set queue tesla resources_default.nodes = 1
set queue tesla resources_default.walltime = 01:00:00
set queue tesla enabled = True
set queue tesla started = True
#
create queue ps3
set queue ps3 queue_type = Execution
set queue ps3 resources_default.neednodes = ps3
set queue ps3 resources_default.nodes = 1
set queue ps3 resources_default.walltime = 01:00:00
set queue ps3 enabled = True
set queue ps3 started = True
#
set server acl_hosts = meta.cnca
set server acl_roots = root at localhost
set server acl_roots += root at meta.cnca
set server log_events = 511
set server mail_from = adm
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server next_job_number = 69


Now, im pretty sure zarate's torque version is different from cadejos
version because they have different OS (fedora11-ppc and centos6-x86).

The problem is the pbs_mom at zarate nodes seems to stop suddenly without
warning or error message, and while doing just nothing.

I can send jobs to the cadejos nodes just fine and they run either
interactivly or batch but on the zarate nodes nothing runs.

Anyone has any idea on this subjetc?

Thanks.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140321/2ddcda8f/attachment.html 


More information about the torqueusers mailing list