[Mauiusers] Maui and Slurm: Scheduling seems to fail from output from checkjob and log but job is running

Balle, Susanne susanne.balle at hp.com
Wed Dec 15 13:25:55 MST 2004


Hi

Hi,

I am running into a new problem with Maui and SLURM. I submit a job with
srun and the job complete but I get the following Alert and Error output
from checkjob 30 and in the maui.log I am told that it cannot locate
checksum and that cannot start job 30 but it looks like the job get
scheduled and run correctly.

The output from checkjob 30, of the running program and an extract of
maui.log is enclosed.

Does anybody have any suggestions on how I can get rid of these Alerts
as well as what I am doing wrong?

Thanks for your help,

Susanne

12/15 20:08:08 MSURecvPacket(8,Buffer,32,NULL,9000000)
12/15 20:08:08 ALERT:    cannot locate checksum 'TS=1103159288
AUTH=slurm DT=SC=1'
12/15 20:08:08 ERROR:    cannot receive data from server XC14N16:7321
12/15 20:08:08 MSUDisconnect(S)
12/15 20:08:08 ALERT:    cannot start job '30' on WIKI RM on 2 procs
(command failure)
12/15 20:08:08 ALERT:    cannot start job 30 (RM 'XC14N16' failed in
function 'jobstart')
12/15 20:08:08 WARNING:  cannot start job '30' through resource manager
12/15 20:08:08 ALERT:    job '30' deferred after 1 failed start attempts
(API failure on last attempt)
12/15 20:08:08 MJobSetHold(30,16,1:00:00,RMFailure,)
12/15 20:08:08 ALERT:    job '30' cannot run (deferring job for 3600
seconds)
12/15 20:08:08 MSysRegEvent(JOBDEFER:  defer hold placed on job '30'.
reason: 'RMFailure',0,0,1)
12/15 20:08:08 MSysLaunchAction(ASList,1)
12/15 20:08:08 ERROR:    cannot start job '30' in partition lsf
12/15 20:08:08 MJobPReserve(30,lsf,ResCount,ResCountRej)
12/15 20:08:08 MJobReserve(30,Priority)
12/15 20:08:08 INFO:     10 feasible tasks found for job 30:0 in
partition lsf (2 Needed)
12/15 20:08:08 INFO:     10 feasible tasks found for job 30:0 in
partition lsf (2 Needed)
12/15 20:08:08 INFO:     located resources for 2 tasks (10) in best
partition lsf for job 30 at time 00:00:01
12/15 20:08:08 INFO:     tasks located for job 30:  2 of 2 required (10
feasible)
12/15 20:08:08 MJobDistributeTasks(30,XC14N16,NodeList,TaskMap)
12/15 20:08:08 MResJCreate(30,MNodeList,00:00:01,Priority,Res)
12/15 20:08:08 INFO:     job '30' reserved 2 tasks (partition lsf) to
start in 00:00:01 on Wed Dec 15 20:08:09
 (WC: 3600)


[root at xc14n16 maui-3.2.6p9]# checkjob 30

checking job 30

State: Idle  EState: Deferred
Creds:  user:test  group:test  qos:DEFAULT
WallTime: 00:00:00 of 1:00:00
SubmitTime: Wed Dec 15 20:07:57
  (Time Queued  Total: 00:00:11  Eligible: 00:00:11)

StartDate: 00:00:01  Wed Dec 15 20:08:09
Total Tasks: 2

Req[0]  TaskCount: 2  Partition: ALL
Network: [NONE]  Memory >= 1M  Disk >= 1M  Swap >= 0
Opsys: [NONE]  Arch: [NONE]  Features: [NONE]
NodeCount: 1


IWD: [NONE]  Executable:  [NONE]
Bypass: 0  StartCount: 1
PartitionMask: [lsf]
job is deferred.  Reason:  RMFailure  ()
Holds:    Defer  (hold reason:  RMFailure)
PE:  2.00  StartPriority:  1
cannot select job 30 for partition lsf (job hold active)


[test at xc14n16 test]$ srun -n 2 -t 60 ./slurm.sh
srun: job 30 queued and waiting for resources
srun: job 30 has been allocated resources
xc14n13
sleeping 10
xc14n13
sleeping 10

finished
finished
[test at xc14n16 test]$

Extract from maui.log that has something to do with Job 30:

12/15 20:08:03 MUISProcessRequest(S,ErrMsg)
12/15 20:08:03 MUGetIndex(checkjob AUTH=root ARG=3 30 0 [NONE] [NONE]
,ValList,0)
12/15 20:08:03 INFO:   service 'checkjob A' not supported in
MUISProcessRequest
12/15 20:08:03 MUGetIndex(checkjob,ValList,0)
12/15 20:08:03 INFO:     client '[NONE]' read (0 bytes) initiating
service call for 'checkjob' (Auth: root)
12/15 20:08:03 UIJobShow(RBuffer,SBuffer,1,root,BufSize)
12/15 20:08:03 INFO:     cannot locate job '30' in UIJobShow()
12/15 20:08:03 MSUSendData(S,5000000,TRUE,TRUE)
12/15 20:08:03 INFO:     packet sent (117 bytes of 117)
12/15 20:08:03 MSUDisconnect(S)
12/15 20:08:04 INFO:     connect request from 172.20.0.16
12/15 20:08:04 INFO:     received service request from host 'xc14n16'
12/15 20:08:04 INFO:     client socket from 'xc14n16' accepted
12/15 20:08:04 UIProcessCommand(S)
12/15 20:08:04 MSURecvData(S,5000000,1)
12/15 20:08:04 MSURecvPacket(8,Buffer,9,NULL,5000000)
12/15 20:08:04 MSURecvPacket(8,Buffer,95,NULL,5000000)
12/15 20:08:04 MUISProcessRequest(S,ErrMsg)
12/15 20:08:04 MUGetIndex(checkjob AUTH=root ARG=3 30 0 [NONE] [NONE]
,ValList,0)
12/15 20:08:04 INFO:   service 'checkjob A' not supported in
MUISProcessRequest
12/15 20:08:04 MUGetIndex(checkjob,ValList,0)
12/15 20:08:04 INFO:     client '[NONE]' read (0 bytes) initiating
service call for 'checkjob' (Auth: root)
12/15 20:08:04 UIJobShow(RBuffer,SBuffer,1,root,BufSize)
12/15 20:08:04 INFO:     cannot locate job '30' in UIJobShow()
12/15 20:08:04 MSUSendData(S,5000000,TRUE,TRUE)
12/15 20:08:04 INFO:     packet sent (117 bytes of 117)
12/15 20:08:04 MSUDisconnect(S)
12/15 20:08:06 INFO:     connect request from 172.20.0.16
12/15 20:08:06 INFO:     received service request from host 'xc14n16'
12/15 20:08:06 INFO:     client socket from 'xc14n16' accepted
12/15 20:08:06 UIProcessCommand(S)
12/15 20:08:06 MSURecvData(S,5000000,1)
12/15 20:08:06 MSURecvPacket(8,Buffer,9,NULL,5000000)
12/15 20:08:06 MSURecvPacket(8,Buffer,79,NULL,5000000)
12/15 20:08:06 MUISProcessRequest(S,ErrMsg)
12/15 20:08:06 MUGetIndex(showq AUTH=root ARG=0 ALL 0
,ValList,0)
12/15 20:08:06 INFO:   service 'showq AUTH' not supported in
MUISProcessRequest
12/15 20:08:06 MUGetIndex(showq,ValList,0)
12/15 20:08:06 INFO:     client '[NONE]' read (0 bytes) initiating
service call for 'showq' (Auth: root)
12/15 20:08:06 UIQueueShow(RBuffer,Buffer,1,root,BufSize)
12/15 20:08:06 UIQueueShowAllJobs(SBuffer,SBufSize,ALL)
12/15 20:08:06 INFO:     UIQueueShowAllJobs buffer size: 60 bytes
12/15 20:08:06 MSUSendData(S,5000000,TRUE,TRUE)
12/15 20:08:06 INFO:     packet sent (146 bytes of 146)
12/15 20:08:06 MSUDisconnect(S)
12/15 20:08:08 ServerProcessRequests()
12/15 20:08:08 INFO:     not rolling logs (123902 < 10000000)
12/15 20:08:08 MResAdjust(NULL,0,0)
12/15 20:08:08 MStatInitializeActiveSysUsage()
12/15 20:08:08 MStatClearUsage([NONE],Active)
12/15 20:08:08 ServerUpdate()
12/15 20:08:08 MSysUpdateTime()
12/15 20:08:08 INFO:     starting iteration 222
12/15 20:08:08 MRMGetInfo()
12/15 20:08:08 MClusterClearUsage()
12/15 20:08:08 MRMClusterQuery()
12/15 20:08:08 MWikiClusterLoadInfo(XC14N16,RCount,EMsg,SC)
12/15 20:08:08 MWikiDoCommand(XC14N16,7321,9000000,CHECKSUM,CMD=GETNODES
ARG=0:ALL,Data,DataSize,SC)
12/15 20:08:08 MSUSendData(S,9000000,TRUE,FALSE)
12/15 20:08:08 INFO:     packet sent (78 bytes of 78)
12/15 20:08:08 INFO:     command sent to server
12/15 20:08:08 INFO:     message sent: 'CMD=GETNODES ARG=0:ALL'
12/15 20:08:08 MSURecvData(S,9000000,1)
12/15 20:08:08 MSURecvPacket(8,Buffer,9,NULL,9000000)
12/15 20:08:08 MSURecvPacket(8,Buffer,269,NULL,9000000)
12/15 20:08:08 MSUDisconnect(S)
12/15 20:08:08 INFO:     received node list through WIKI RM
12/15 20:08:08 INFO:     loading 4 node(s)
12/15 20:08:08 MWikiNodeUpdate(AList,xc14n13)
12/15 20:08:08 MWikiNodeUpdate(AList,xc14n14)
12/15 20:08:08 MWikiNodeUpdate(AList,xc14n15)
12/15 20:08:08 MWikiNodeUpdate(AList,xc14n16)
12/15 20:08:08 INFO:     0 WIKI resources detected on RM XC14N16
12/15 20:08:08 WARNING:  no resources detected
12/15 20:08:08 MRMWorkloadQuery()
12/15 20:08:08 MWikiWorkloadQuery(XC14N16,JCount,SC)
12/15 20:08:08 MWikiDoCommand(XC14N16,7321,9000000,CHECKSUM,CMD=GETJOBS
ARG=0:ALL,Data,DataSize,SC)
12/15 20:08:08 MSUSendData(S,9000000,TRUE,FALSE)
12/15 20:08:08 INFO:     packet sent (77 bytes of 77)
12/15 20:08:08 INFO:     command sent to server
12/15 20:08:08 INFO:     message sent: 'CMD=GETJOBS ARG=0:ALL'
12/15 20:08:08 MSURecvData(S,9000000,1)
12/15 20:08:08 MSURecvPacket(8,Buffer,9,NULL,9000000)
12/15 20:08:08 MSURecvPacket(8,Buffer,200,NULL,9000000)
12/15 20:08:08 MSUDisconnect(S)
12/15 20:08:08 INFO:     received job list through WIKI RM
12/15 20:08:08 INFO:     loading 1 job(s)
12/15 20:08:08
MWikiJobLoad(30,UPDATETIME=1103159277;STATE=Idle;WCLIMIT=3600;TASKS=2;QU
EUETIME=1103159277;UNAME=test;GNAME=test;PARTITIONMASK=lsf;NODES=1;RMEM=
1;RDISK=1;,J,TaskList,XC14N16)
12/15 20:08:08 MReqCreate(30,SrcRQ,DstRQ,DoCreate)
12/15 20:08:08 MUGetIndex(UPDATETIME=1103159277,ValList,0)
12/15 20:08:08 MUGetIndex(STATE=Idle,ValList,0)
12/15 20:08:08 MUGetIndex(WCLIMIT=3600,ValList,0)
12/15 20:08:08 MUGetIndex(TASKS=2,ValList,0)
12/15 20:08:08 MUGetIndex(QUEUETIME=1103159277,ValList,0)
12/15 20:08:08 MUGetIndex(UNAME=test,ValList,0)
12/15 20:08:08 MUGetIndex(GNAME=test,ValList,0)
12/15 20:08:08 MUGetIndex(PARTITIONMASK=lsf,ValList,0)
12/15 20:08:08 MUGetIndex(NODES=1,ValList,0)
12/15 20:08:08 MUGetIndex(RMEM=1,ValList,0)
12/15 20:08:08 MUGetIndex(RDISK=1,ValList,0)
12/15 20:08:08 MJobSetCreds(30,test,test,)
12/15 20:08:08 INFO:     default QOS for job 30 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
12/15 20:08:08 INFO:     default QOS for job 30 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
12/15 20:08:08 INFO:     job '30' loaded:   2     test     test   3600
Idle   0 1103159277   [NONE] [NONE] [NONE] >=      1 >=      1 [NONE]
1103159277
12/15 20:08:08 INFO:     1 WIKI jobs detected on RM XC14N16
12/15 20:08:08 INFO:     jobs detected: 1
12/15 20:08:08 MStatClearUsage(node,Active)
12/15 20:08:08 MClusterUpdateNodeState()
12/15 20:08:08 MQueueSelectAllJobs(Q,HARD,ALL,JIList,DP,Msg)
12/15 20:08:08 INFO:     job '30' Priority:        1
12/15 20:08:08 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
0(00.0)
12/15 20:08:08 MStatClearUsage([NONE],Active)
12/15 20:08:08 INFO:     total jobs selected (ALL): 1/1 
12/15 20:08:08 MQueueSelectAllJobs(Q,SOFT,ALL,JIList,DP,Msg)
12/15 20:08:08 INFO:     job '30' Priority:        1
12/15 20:08:08 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      0(00.0)  Us:
0(00.0)
12/15 20:08:08 MStatClearUsage([NONE],Idle)
12/15 20:08:08 INFO:     total jobs selected (ALL): 1/1 
12/15 20:08:08
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,FALSE
)
12/15 20:08:08 INFO:     total jobs selected in partition ALL: 1/1 
12/15 20:08:08 MQueueScheduleRJobs(Q)
12/15 20:08:08
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
12/15 20:08:08 INFO:     total jobs selected in partition ALL: 1/1 
12/15 20:08:08
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,ALL,FReason,TRUE)
12/15 20:08:08 INFO:     job 30 not considered for spanning
12/15 20:08:08 INFO:     total jobs selected in partition ALL: 0/1
[PartitionAccess: 1]
12/15 20:08:08
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,lsf,FReason,TRUE)
12/15 20:08:08 INFO:     total jobs selected in partition lsf: 1/1 
12/15 20:08:08 MQueueScheduleIJobs(Q,lsf)
12/15 20:08:08 INFO:     10 feasible tasks found for job 30:0 in
partition lsf (2 Needed)
12/15 20:08:08 INFO:     tasks located for job 30:  2 of 2 required (10
feasible)
12/15 20:08:08 MJobStart(30)
12/15 20:08:08 MJobDistributeTasks(30,XC14N16,NodeList,TaskMap)
12/15 20:08:08 MAMAllocJReserve(30,RIndex,ErrMsg)
12/15 20:08:08 MRMJobStart(30,Msg,SC)
12/15 20:08:08 MWikiJobStart(30,XC14N16,Msg,SC)
12/15 20:08:08 MWikiDoCommand(XC14N16,7321,9000000,CHECKSUM,CMD=STARTJOB
ARG=30 TASKLIST=xc14n13:xc14n13,Data,DataSize,SC)
12/15 20:08:08 MSUSendData(S,9000000,TRUE,FALSE)
12/15 20:08:08 INFO:     packet sent (100 bytes of 100)
12/15 20:08:08 INFO:     command sent to server
12/15 20:08:08 INFO:     message sent: 'CMD=STARTJOB ARG=30
TASKLIST=xc14n13:xc14n13'
12/15 20:08:08 MSURecvData(S,9000000,1)
12/15 20:08:08 MSURecvPacket(8,Buffer,9,NULL,9000000)
12/15 20:08:08 MSURecvPacket(8,Buffer,32,NULL,9000000)
12/15 20:08:08 ALERT:    cannot locate checksum 'TS=1103159288
AUTH=slurm DT=SC=1'
12/15 20:08:08 ERROR:    cannot receive data from server XC14N16:7321
12/15 20:08:08 MSUDisconnect(S)
12/15 20:08:08 ALERT:    cannot start job '30' on WIKI RM on 2 procs
(command failure)
12/15 20:08:08 ALERT:    cannot start job 30 (RM 'XC14N16' failed in
function 'jobstart')
12/15 20:08:08 WARNING:  cannot start job '30' through resource manager
12/15 20:08:08 ALERT:    job '30' deferred after 1 failed start attempts
(API failure on last attempt)
12/15 20:08:08 MJobSetHold(30,16,1:00:00,RMFailure,)
12/15 20:08:08 ALERT:    job '30' cannot run (deferring job for 3600
seconds)
12/15 20:08:08 MSysRegEvent(JOBDEFER:  defer hold placed on job '30'.
reason: 'RMFailure',0,0,1)
12/15 20:08:08 MSysLaunchAction(ASList,1)
12/15 20:08:08 ERROR:    cannot start job '30' in partition lsf
12/15 20:08:08 MJobPReserve(30,lsf,ResCount,ResCountRej)
12/15 20:08:08 MJobReserve(30,Priority)
12/15 20:08:08 INFO:     10 feasible tasks found for job 30:0 in
partition lsf (2 Needed)
12/15 20:08:08 INFO:     10 feasible tasks found for job 30:0 in
partition lsf (2 Needed)
12/15 20:08:08 INFO:     located resources for 2 tasks (10) in best
partition lsf for job 30 at time 00:00:01
12/15 20:08:08 INFO:     tasks located for job 30:  2 of 2 required (10
feasible)
12/15 20:08:08 MJobDistributeTasks(30,XC14N16,NodeList,TaskMap)
12/15 20:08:08 MResJCreate(30,MNodeList,00:00:01,Priority,Res)
12/15 20:08:08 INFO:     job '30' reserved 2 tasks (partition lsf) to
start in 00:00:01 on Wed Dec 15 20:08:09
 (WC: 3600)
Active Jobs------
------------------
12/15 20:08:08 INFO:     resources available after scheduling: N: 4  P:
10
12/15 20:08:08
MQueueSelectJobs(SrcQ,DstQ,HARD,5120,4096,2140000000,EVERY,FReason,TRUE)
12/15 20:08:08 MResDestroy(30)
12/15 20:08:08 MResChargeAllocation(30,2)
12/15 20:08:08 INFO:     total jobs selected in partition ALL: 0/1
[EState: 1]
12/15 20:08:08
MQueueSelectJobs(SrcQ,DstQ,SOFT,5120,4096,2140000000,EVERY,FReason,TRUE)
12/15 20:08:08 INFO:     total jobs selected in partition ALL: 0/1
[EState: 1]
12/15 20:08:08 MSchedUpdateStats()
12/15 20:08:08 INFO:     iteration:  222   scheduling time:  0.008
seconds
12/15 20:08:08 MResUpdateStats()
12/15 20:08:08 INFO:     current util[222]:  0/4 (0.00%)  PH: 0.00%
active jobs: 0 of 2 (completed: 0)
12/15 20:08:08 MQueueCheckStatus()
12/15 20:08:08 MNodeCheckStatus()
12/15 20:08:08 MUClearChild(PID)
12/15 20:08:08 INFO:     scheduling complete.  sleeping 20 seconds
12/15 20:08:09 INFO:     connect request from 172.20.0.16
12/15 20:08:09 INFO:     received service request from host 'xc14n16'
12/15 20:08:09 INFO:     client socket from 'xc14n16' accepted
12/15 20:08:09 UIProcessCommand(S)
12/15 20:08:09 MSURecvData(S,5000000,1)
12/15 20:08:09 MSURecvPacket(8,Buffer,9,NULL,5000000)
12/15 20:08:09 MSURecvPacket(8,Buffer,95,NULL,5000000)
12/15 20:08:09 MUISProcessRequest(S,ErrMsg)
12/15 20:08:09 MUGetIndex(checkjob AUTH=root ARG=3 30 0 [NONE] [NONE]
,ValList,0)
12/15 20:08:09 INFO:   service 'checkjob A' not supported in
MUISProcessRequest
12/15 20:08:09 MUGetIndex(checkjob,ValList,0)
12/15 20:08:09 INFO:     client '[NONE]' read (0 bytes) initiating
service call for 'checkjob' (Auth: root)
12/15 20:08:09 UIJobShow(RBuffer,SBuffer,1,root,BufSize)
12/15 20:08:09 job '30'  State:        Idle  EState:   Deferred
QueueTime:       Wed Dec 15 20:07:57
12/15 20:08:09 MQueueSelectAllJobs(Q,HARD,lsf,JIList,DP,Msg)
12/15 20:08:09 INFO:     total jobs selected (ALL): 0/1 [Hold: 1]
12/15 20:08:09 MSUSendData(S,5000000,TRUE,TRUE)
12/15 20:08:09 INFO:     packet sent (750 bytes of 750)
12/15 20:08:09 MSUDisconnect(S)



More information about the mauiusers mailing list