[torqueusers] torque-2.4.1b1: PBS_Server; LOG_ERROR::Undefined attribute (15002) in send_job, child failed in previous commit request for job 3442.nfssrv.cluster.local

Martin MOKREJŠ mmokrejs at ribosome.natur.cuni.cz
Wed Oct 21 04:25:08 MDT 2009


Hi,
  I have a problem to get my cluster running with 

10/20/2009 23:59:25;0008;PBS_Server;Job;reply_send;Reply sent for request type ResourceQuery on socket 10
10/20/2009 23:59:25;0008;PBS_Server;Job;dispatch_request;dispatching request ModifyJob on sd=10
10/20/2009 23:59:25;0008;PBS_Server;Job;3442.nfssrv.cluster.local;attr comment modified
10/20/2009 23:59:25;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 3442.nfssrv.cluster.local state from QUEUED-QUEUED to QUEUED-QUEUED (1-10)
10/20/2009 23:59:25;0008;PBS_Server;Job;3442.nfssrv.cluster.local;Job Modified at request of Scheduler at nfssrv.cluster.local
10/20/2009 23:59:25;0008;PBS_Server;Job;reply_send;Reply sent for request type ModifyJob on socket 10
10/20/2009 23:59:25;0008;PBS_Server;Job;dispatch_request;dispatching request RunJob on sd=10
10/20/2009 23:59:25;0040;PBS_Server;Req;set_nodes;allocating nodes for job 3442.nfssrv.cluster.local with node expression '1:ppn=4'
10/20/2009 23:59:25;0040;PBS_Server;Req;node_spec;entered spec=1:ppn=4
10/20/2009 23:59:25;0040;PBS_Server;Req;node_spec;job allocation debug: 1 requested, 128 svr_clnodes, 32 svr_totnodes
10/20/2009 23:59:25;0040;PBS_Server;Req;node_spec;job allocation debug(2): 1 requested, 32 svr_numnodes
10/20/2009 23:59:25;0040;PBS_Server;Req;node_spec;job allocation debug(3): returning 1 requested
10/20/2009 23:59:25;0040;PBS_Server;Req;add_job_to_node;allocated node node004/0 to job 3442.nfssrv.cluster.local (nsnfree=4)
10/20/2009 23:59:25;0040;PBS_Server;Req;add_job_to_node;allocated node node004/1 to job 3442.nfssrv.cluster.local (nsnfree=3)
10/20/2009 23:59:25;0040;PBS_Server;Req;add_job_to_node;allocated node node004/2 to job 3442.nfssrv.cluster.local (nsnfree=2)
10/20/2009 23:59:25;0040;PBS_Server;Req;add_job_to_node;allocated node node004/3 to job 3442.nfssrv.cluster.local (nsnfree=1)
10/20/2009 23:59:25;0040;PBS_Server;Req;set_nodes;job 3442.nfssrv.cluster.local allocated 4 nodes (nodelist=node004/3+node004/2+node004/1+node004/0)
10/20/2009 23:59:25;0008;PBS_Server;Job;3442.nfssrv.cluster.local;Job Run at request of Scheduler at nfssrv.cluster.local
10/20/2009 23:59:25;0001;PBS_Server;Svr;PBS_Server;svr_setjobstate: setting job 3442.nfssrv.cluster.local state from QUEUED-QUEUED to RUNNING-PRERUN (4-40)
10/20/2009 23:59:25;0008;PBS_Server;Job;3442.nfssrv.cluster.local;forking in send_job
10/20/2009 23:59:25;0004;PBS_Server;Svr;svr_connect;attempting connect to host 192.168.10.4 port 15002
10/20/2009 23:59:25;0008;PBS_Server;Job;3442.nfssrv.cluster.local;send of job to node004 failed error = 15002
10/20/2009 23:59:25;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::Undefined attribute  (15002) in send_job, child failed in previous commit request for job 3442.nfssrv.cluster.local
10/20/2009 23:59:25;0008;PBS_Server;Job;3442.nfssrv.cluster.local;entering post_sendmom
10/20/2009 23:59:25;0002;PBS_Server;Job;3442.nfssrv.cluster.local;child reported failure for job after 0 seconds (dest=node004), rc=1
10/20/2009 23:59:25;0008;PBS_Server;Job;3442.nfssrv.cluster.local;unable to run job, MOM rejected/rc=1
10/20/2009 23:59:25;0040;PBS_Server;Req;free_nodes;freeing nodes for job 3442.nfssrv.cluster.local
10/20/2009 23:59:25;0040;PBS_Server;Req;free_nodes;freeing node node004/0 from job 3442.nfssrv.cluster.local (nsnfree=0)
10/20/2009 23:59:25;0040;PBS_Server;Req;free_nodes;increased sub-node free count to 1 of 4
10/20/2009 23:59:25;0040;PBS_Server;Req;free_nodes;freeing node node004/1 from job 3442.nfssrv.cluster.local (nsnfree=1)
10/20/2009 23:59:25;0040;PBS_Server;Req;free_nodes;increased sub-node free count to 2 of 4
10/20/2009 23:59:25;0040;PBS_Server;Req;free_nodes;freeing node node004/2 from job 3442.nfssrv.cluster.local (nsnfree=2)
10/20/2009 23:59:25;0040;PBS_Server;Req;free_nodes;increased sub-node free count to 3 of 4
10/20/2009 23:59:25;0040;PBS_Server;Req;free_nodes;freeing node node004/3 from job 3442.nfssrv.cluster.local (nsnfree=3)
10/20/2009 23:59:25;0040;PBS_Server;Req;free_nodes;increased sub-node free count to 4 of 4

Does anybody have some ideas where to look? Jobs are in the 'Q' state, I disabled the setting to help starving jobs.
What else? sometimes pbs_sched died but I haven't managed to find any core dumps.

Thank you,
Martin


More information about the torqueusers mailing list