[torqueusers] rejected request

Corey Hirschman corey at rentec.com
Wed Oct 20 12:53:37 MDT 2004


I am having a strage problem that I am having trouble tracking down and I was hoping that someone out there may be able to help me find the cause.

Maui has started to put a lot of jobs in a defered state and errors such as this are being logged:

maui.log.1:10/20 12:45:25 PBSJobLoad(192275,192275.monstersq,J,TaskList,0)
maui.log.1:10/20 12:45:25 MReqCreate(192275,srcRQ,dstRQ)
maui.log.1:10/20 12:45:25 INFO:     Job[067] loaded '192275'   1      wwc resear ch 4320000       Idle   0 1098290683   [NONE] [NONE] [NONE] >=      0 >=      0 [NONE] 1098290683
maui.log.1:10/20 12:45:25 [067]           192275   1:  2:  1(1) ALL 50:00:00:00( ????????)      wwc research       Idle DEFAULT  [workq 1] 1098290683   [NONE] [N ONE] [NONE] >=      0 >=      0 [NONE]
maui.log.1:10/20 12:45:25 INFO:     236 feasible tasks found for job 192275:0 in  partition DEFAULT (1 Needed)
maui.log.1:10/20 12:45:25 INFO:     tasks located for job 192275:  1 of 1 requir ed (13 feasible)
maui.log.1:10/20 12:45:25 MJobStart(192275)
maui.log.1:10/20 12:45:25 MRMJobStart(192275,Msg,SC)
maui.log.1:10/20 12:45:25 PBSJobStart(192275,base,Msg,SC)
maui.log.1:10/20 12:45:25 PBSJobModify(192275,Resource_List,Resource,monster620)
maui.log.1:10/20 12:45:31 ERROR:    job '192275' cannot be started: (rc: 15041 errmsg: 'Execution server rejected request'  hostlist: 'monster620')
maui.log.1:10/20 12:45:31 ERROR:    cannot start job '192275' in partition DEFAU LT
maui.log.1:10/20 12:45:31 MJobPReserve(192275,DEFAULT,ResCount)
maui.log.1:10/20 12:45:31 INFO:     236 feasible tasks found for job 192275:0 in  partition DEFAULT (1 Needed)
maui.log.1:10/20 12:45:31 INFO:     236 feasible tasks found for job 192275:0 in  partition DEFAULT (1 Needed)
maui.log.1:10/20 12:45:31 INFO:     located resources for 1 tasks (13) in best p artition DEFAULT for job 192275 at time 0:00:01
maui.log.1:10/20 12:45:31 INFO:     tasks located for job 192275:  1 of 1 requir ed (13 feasible)
maui.log.1:10/20 12:45:31 INFO:     job '192275' reserved 1 tasks (partition DEF AULT) to start in 0:00:01 on Wed Oct 20 12:44:44

Everything looks normal at first, Maui sees the job, checks available resources, finds a node suitable to run the job on, submits the job, then it gets rejected:

maui.log.1:10/20 12:45:31 ERROR:    job '192275' cannot be started: (rc: 15041 errmsg: 'Execution server rejected request'  hostlist: 'monster620')
maui.log.1:10/20 12:45:31 ERROR:    cannot start job '192275' in partition DEFAU LT

I have looked on the node it tried to run the job on, monster620, and there is no record of the job id in the MOM logs.  It does not appear that the job was every actually even sumitted to the node, so I don't know how it was rejected.

In additon to this, there are a large number of errors such as these:

10/18/2004 23:59:30;0004;PBS_Server;Svr;WARNING;!!! unable to contact node <monsterxyz>

These seem to increase in proportion to the number of jobs being submitted to the cluster, meaning on days when a lot of jobs are sumitted we get a lot of these errors and on slow days we get a lot less.

The network does not appear to be a problem as we have other machines on these same subnets running PBSPro and they are not experiencing such problems.

Anyone have any ideas?

Thank you,

Corey Hirschman



More information about the torqueusers mailing list