[torqueusers] torque stops by itself

David Beer dbeer at adaptivecomputing.com
Mon Mar 24 14:50:39 MDT 2014


Ricardo,

I would try to track this down in two ways:

1. Make sure that ulimit -c is unlimited when the mom is launched. Having a
lower limit for ulimit -c can prevent core files from being recorded.
2. If this fails, you could attach to the mom's process in gdb (or
something else if preferred). Simply attach and let it run, and then when
the process is no longer responsive look at gdb to see if it crashed or why
it is unresponsive. The gdb prompt could be kept running in a screen
session or something that will just stay there indefinitely.


On Fri, Mar 21, 2014 at 11:41 AM, Ricardo Román Brenes <
roman.ricardo at gmail.com> wrote:

> Hello everyone.
>
> I am having a problem with torque and i dont really know hwere to look for
> help:
>
> i got a configuration made of 10 nodes:
> 1 torque/maui server
> 9 compute nodes
>
> this is my nodes file:
> [root at meta server_priv]# cat nodes
> cadejos-0.cnca np=8 tesla xeon
> cadejos-1.cnca np=8 tesla xeon
> cadejos-2.cnca np=8 tesla xeon
> cadejos-3.cnca np=8 xeon
> cadejos-4.cnca np=8 xeon
> zarate-0.cnca np=2 ps3
> zarate-1.cnca np=2 ps3
> zarate-2.cnca np=2 ps3
> zarate-3.cnca np=2 ps3
>
>
> and my queues:
> [root at meta server_priv]# qmgr -c 'p s'
> create queue xeon
> set queue xeon queue_type = Execution
> set queue xeon resources_default.neednodes = xeon
> set queue xeon resources_default.nodes = 1
> set queue xeon resources_default.walltime = 01:00:00
> set queue xeon enabled = True
> set queue xeon started = True
> #
> create queue tesla
> set queue tesla queue_type = Execution
> set queue tesla resources_default.neednodes = tesla
> set queue tesla resources_default.nodes = 1
> set queue tesla resources_default.walltime = 01:00:00
> set queue tesla enabled = True
> set queue tesla started = True
> #
> create queue ps3
> set queue ps3 queue_type = Execution
> set queue ps3 resources_default.neednodes = ps3
> set queue ps3 resources_default.nodes = 1
> set queue ps3 resources_default.walltime = 01:00:00
> set queue ps3 enabled = True
> set queue ps3 started = True
> #
> set server acl_hosts = meta.cnca
> set server acl_roots = root at localhost
> set server acl_roots += root at meta.cnca
> set server log_events = 511
> set server mail_from = adm
> set server scheduler_iteration = 600
> set server node_check_rate = 150
> set server tcp_timeout = 6
> set server next_job_number = 69
>
>
> Now, im pretty sure zarate's torque version is different from cadejos
> version because they have different OS (fedora11-ppc and centos6-x86).
>
> The problem is the pbs_mom at zarate nodes seems to stop suddenly without
> warning or error message, and while doing just nothing.
>
> I can send jobs to the cadejos nodes just fine and they run either
> interactivly or batch but on the zarate nodes nothing runs.
>
> Anyone has any idea on this subjetc?
>
> Thanks.
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>
>


-- 
David Beer | Senior Software Engineer
Adaptive Computing
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.supercluster.org/pipermail/torqueusers/attachments/20140324/376f0cd0/attachment.html 


More information about the torqueusers mailing list