[torqueusers] qsub -I from compute node?
Aquarijen
aquarijen at gmail.com
Wed Oct 18 15:34:29 MDT 2006
Well, looking at the maui log, maui thinks the job completed successfully:
10/18 17:21:59 MPBSJobLoad(16227,16227.b08l02.oic.ornl.gov,J,TaskList,0)
10/18 17:21:59 MReqCreate(16227,SrcRQ,DstRQ,DoCreate)
10/18 17:21:59 INFO: adding requirement at slot 0
10/18 17:21:59 INFO: processing node request line '8'
10/18 17:21:59 MJobSetCreds(16227,2vt,2vt,)
10/18 17:21:59 INFO: default QOS for job 16227 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
10/18 17:21:59 MCPRestore(JOB,16227,Optr)
10/18 17:21:59 INFO: default QOS for job 16227 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
10/18 17:21:59 INFO: default QOS for job 16227 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
0/18 17:21:59 INFO: job '16227' loaded: 8 2vt 2vt
1800 Idle 0 1161206518 [NONE] [NONE] [NONE] >= 0 >= 0
[NONE] 116120651910/18 17:21:59 INFO: 33 PBS jobs detected on RM
base
<snip>
10/18 17:21:59 INFO: job '16227' Priority: 1
10/18 17:21:59 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
<snip>
10/18 17:21:59 MPolicyAdjustUsage(NULL,16180,NULL,active,NULL,[ALL],1,NULL)
10/18 17:21:59 MPolicyAdjustUsage(NULL,16227,NULL,idle,PU,[ALL],1,NULL)
10/18 17:21:59 MPolicyAdjustUsage(NULL,16227,NULL,idle,NULL,[ALL],1,NULL)
10/18 17:21:59 INFO: total jobs selected (ALL): 7/33 [State: 26]
<snip>
10/18 17:21:59 INFO: job '16227' Priority: 1
10/18 17:21:59 INFO: Cred: 0(00.0) FS: 0(00.0) Attr:
0(00.0) Serv: 0(00.0) Targ: 0(00.0) Res: 0(00.0)
Us: 0(00.0)
<snip>
10/18 17:21:59 MPolicyAdjustUsage(NULL,16227,NULL,idle,PU,[ALL],1,NULL)
<snip>
10/18 17:21:59 MJobPReserve(16226,DEFAULT,ResCount,ResCountRej)
10/18 17:21:59 MJobPReserve(16227,DEFAULT,ResCount,ResCountRej)
Active Jobs------
<snip>
10/18 17:21:59 INFO: attempting BF backfill with job '
16227' (p: 8 t: 1800)
10/18 17:21:59 MJobSelectMNL(16227,DEFAULT,FNL,MNodeList,NodeMap,MaxSpeed,2)
10/18 17:21:59 MReqGetFNL(16227,0,DEFAULT,SrcNL,DstNL,NC,TC,2140000000,0)
10/18 17:21:59 INFO: 20 feasible tasks found for job 16227:0 in
partition DEFAULT (8 Needed)
10/18 17:21:59 MJobGetINL(16227,FNL,INL,DEFAULT,NodeCount,TaskCount)
10/18 17:21:59 INFO: idle resources (20 tasks/10 nodes) found with
feasible list specified
10/18 17:21:59 MNodeSelectIdleTasks(16227,0,SrcNL,IdleMNL,TC,NC,NMap,RCount,RejReason)
10/18 17:21:59 INFO: 20(0) tasks/10(0) nodes found for job 16227
in MJobSelectMNL
10/18 17:21:59 MJobNLDistribute(16227,SrcMNL,DstMNL)
10/18 17:21:59 INFO: resources found for job 16227 tasks: 20+0 of
8 nodes: 10+0 of 0
10/18 17:21:59 INFO: located job '16227' in MBFBestFit (size: 8
duration: 1800)
10/18 17:21:59 MJobSelectMNL(16227,DEFAULT,FNL,MNodeList,NodeMap,MaxSpeed,2)
10/18 17:21:59 MReqGetFNL(16227,0,DEFAULT,SrcNL,DstNL,NC,TC,2140000000,0)
10/18 17:21:59 INFO: 20 feasible tasks found for job 16227:0 in
partition DEFAULT (8 Needed)
10/18 17:21:59 MJobGetINL(16227,FNL,INL,DEFAULT,NodeCount,TaskCount)
10/18 17:21:59 INFO: idle resources (20 tasks/10 nodes) found with
feasible list specified
10/18 17:21:59 MNodeSelectIdleTasks(16227,0,SrcNL,IdleMNL,TC,NC,NMap,RCount,RejReason)
10/18 17:21:59 INFO: 20(0) tasks/10(0) nodes found for job 16227
in MJobSelectMNL
10/18 17:21:59 MJobNLDistribute(16227,SrcMNL,DstMNL)
10/18 17:21:59 INFO: resources found for job 16227 tasks: 20+0 of
8 nodes: 10+0 of 0
10/18 17:21:59 MJobAllocMNL(16227,MFeasibleList,NodeMap,NULL,LASTAVAILABLE,1161206519)
10/18 17:21:59 INFO: tasks located for job 16227: 8 of 8 required
(12 feasible)
10/18 17:21:59 INFO: allocated MNode[000]x2 'b03n034.oic.ornl.gov'
to 16227:0
10/18 17:21:59 INFO: allocated MNode[001]x2 'b03n033.oic.ornl.gov'
to 16227:0
10/18 17:21:59 INFO: allocated MNode[002]x2 'b03n032.oic.ornl.gov'
to 16227:0
10/18 17:21:59 INFO: allocated MNode[003]x2 'b03n031.oic.ornl.gov'
to 16227:0
10/18 17:21:59 MJobStart(16227)
10/18 17:21:59 MJobDistributeTasks(16227,base,NodeList,TaskMap)
10/18 17:21:59 INFO: 4 node(s)/8 task(s) added to 16227:0
10/18 17:21:59 INFO: MNode[000] 'b03n034.oic.ornl.gov'(x2) added
to job '16227'
[019] b03n034.oic.ornl.gov: (P:2,S:7914,M:3950,D:1)
[Idle][DEFAULT][linux]<0.000000> C:[interq 2:2][routeq 2:2][DEFAULT]
[all] [interq 2:2][routeq 2:2]
10/18 17:21:59 INFO: MNode[001] 'b03n033.oic.ornl.gov'(x2) added
to job '16227'
[018] b03n033.oic.ornl.gov: (P:2,S:7914,M:3950,D:1)
[Idle][DEFAULT][linux]<0.000000> C:[interq 2:2][routeq 2:2][DEFAULT]
[all] [interq 2:2][routeq 2:2]
10/18 17:21:59 INFO: MNode[002] 'b03n032.oic.ornl.gov'(x2) added
to job '16227'
[017] b03n032.oic.ornl.gov: (P:2,S:7919,M:3950,D:1)
[Idle][DEFAULT][linux]<0.000000> C:[interq 2:2][routeq 2:2][DEFAULT]
[interactive] [interq 2:2][routeq 2:2]
10/18 17:21:59 INFO: MNode[003] 'b03n031.oic.ornl.gov'(x2) added
to job '16227'
[016] b03n031.oic.ornl.gov: (P:2,S:7915,M:3950,D:1)
[Idle][DEFAULT][linux]<0.000000> C:[interq 2:2][routeq 2:2][DEFAULT]
[interactive] [interq 2:2][routeq 2:2]
10/18 17:21:59 INFO: end of list reached. 4 nodes found
10/18 17:21:59 INFO: tasks distributed: 8 (Round-Robin)
10/18 17:21:59 MAMAllocJReserve(16227,RIndex,ErrMsg)
10/18 17:21:59 MRMJobStart(16227,Msg,SC)
10/18 17:21:59 MPBSJobStart(16227,base,Msg,SC)
10/18 17:21:59 MPBSJobModify(16227,Resource_List,Resource,b03n034.oic.ornl.gov:ppn=2+b03n033.oic.ornl.gov:ppn=2+b03n032.oic.ornl.gov:ppn=2+b03n031.oic.ornl.gov:ppn=2)
10/18 17:21:59 MPBSJobModify(16227,Resource_List,Resource,1)
10/18 17:21:59 INFO: job '16227' successfully started
10/18 17:21:59 MQueueAddAJob(16227)
10/18 17:21:59 MStatUpdateActiveJobUsage(16227)
10/18 17:21:59 MPolicyAdjustUsage(NULL,16227,NULL,active,NULL,[ALL],1,NULL)
10/18 17:21:59 MResJCreate(16227,MNodeList,00:00:00,ActiveJob,Res)
10/18 17:21:59 MResAdjustDRes(16227,FALSE)
10/18 17:21:59 MPolicyAdjustUsage(NULL,16227,NULL,idle,PU,[ALL],-1,NULL)
10/18 17:21:59 MPolicyAdjustUsage(NULL,16227,NULL,idle,NULL,[ALL],-1,NULL)
10/18 17:21:59 MParUpdate(DEFAULT)
10/18 17:21:59 INFO: P[DEFAULT]: Total 382:764 Up 371:742 Idle
24:50 Active 347:693
10/18 17:21:59 INFO: P[DEFAULT]: Total 382:764 Up 371:742 Idle
20:42 Active 351:701
10/18 17:21:59 MJobAddToNL(16227,NULL)
10/18 17:21:59 INFO: node b03n034.oic.ornl.gov added to job 16227.
PSlot: [interq 0:2][routeq 2:2]
10/18 17:21:59 INFO: node b03n033.oic.ornl.gov added to job 16227.
PSlot: [interq 0:2][routeq 2:2]
10/18 17:21:59 INFO: node b03n032.oic.ornl.gov added to job 16227.
PSlot: [interq 0:2][routeq 2:2]
10/18 17:21:59 INFO: node b03n031.oic.ornl.gov added to job 16227.
PSlot: [interq 0:2][routeq 2:2]
10/18 17:21:59 INFO: starting job '16227'
<snip>
10/18 17:22:10 INFO: active PBS job 16227 has been removed from
the queue. assuming successful completion
10/18 17:22:10 MJobProcessCompleted(16227)
10/18 17:22:10 MAMAllocJDebit(A,16227,SC,ErrMsg)
10/18 17:22:10 MJobWriteStats(16227)
10/18 17:22:10 MJobToTString(16227,230,Buf,65536)
10/18 17:22:10 INFO: job stats written for '16227'
10/18 17:22:10 INFO: node 'b03n034.oic.ornl.gov' returned to idle pool
10/18 17:22:10 INFO: node 'b03n033.oic.ornl.gov' returned to idle pool
10/18 17:22:10 INFO: node 'b03n032.oic.ornl.gov' returned to idle pool
10/18 17:22:10 INFO: node 'b03n031.oic.ornl.gov' returned to idle pool
10/18 17:22:10 INFO: job ' 16227' completed.
QueueTime: 1 RunTime: 11 Accuracy: 0.61 XFactor: 0.01
10/18 17:22:10 INFO: start: 1161206519 complete: 1161206530
SystemQueueTime: 1161206518
10/18 17:22:10 INFO: overall statistics. Accuracy: 0.06 XFactor: 5.47
10/18 17:22:10 INFO: updating statistics for Grid[time: 4][proc: 3]
10/18 17:22:10 INFO: job '16227' completed X: 0.006667 T: 11
PS: 88 A: 0.006111
10/18 17:22:10 MJobSendFB(16227)
10/18 17:22:10 MSysLaunchAction(ASList,2)
10/18 17:22:10 INFO: job usage sent for job '16227'
10/18 17:22:10 MJobRemove(16227)
10/18 17:22:10 MResChargeAllocation(16227,2)
10/18 17:22:10 MResAdjustDRes(16227,TRUE)
10/18 17:22:10 MJobRemoveHash(16227)
10/18 17:22:10 MJobDestroy(16227)
10/18 17:22:10 INFO: 32 PBS jobs detected on RM base
10/18 17:22:10 INFO: jobs detected: 32
------------------------
And that's it... So, it doesn't really seem like a maui problem?
Does anyone have this working successfully on their cluster?
Thanks so much for the help!!!
Jen
On 10/18/06, Jerry Smith <jdsmit at sandia.gov> wrote:
> Jen,
>
> Are you by chance running Maui or Moab as well? We see this problem when
> there are incorrect ACLs setup.
>
> Jerry
>
>
>
> >
> > Hi All,
> >
> > I have users who like to debug using an interactive pbs job. We have
> > 8 nodes designated for job submission and these all work fine when
> > submitting batch jobs, but give an error when submitting an
> > interactive job... The only node that will not give an error when
> > submitting an interactive job is the node that runs the pbs_server.
> >
> > This is what my users (and I) get when submitting on a node that is
> > not running the pbs_server:
> >
> > [2vt at b07l01 hello-parallel-worlds]$ qsub -I -q interq
> > qsub: waiting for job 16223.b08l02.oic.ornl.gov to start
> > qsub: job 16223.b08l02.oic.ornl.gov apparently deleted
> > [2vt at b07l01 hello-parallel-worlds]$
> >
> > Doing a qstat, I breifly see this:
> >
> > [2vt at b08l02 ~]$ qstat
> > 16225.b08l02 STDIN 2vt 0 Q interq
> > [2vt at b08l02 ~]$ qstat
> > 16225.b08l02 STDIN 2vt 0 R interq
> >
> > But then it is gone and I never actually get a node. The job seems to
> > wait for another 30 seconds or so and then give the "apparently
> > deleted" message.
> >
> >
> > This is what is in the pbs_server log:
> >
> > 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type AuthenticateUser request
> > received from 2vt at b07l01.oic.ornl.gov, sock=11 10/18/2006
> > 16:27:11;0100;PBS_Server;Req;;Type QueueJob request received from
> > 2vt at b07l01.oic.ornl.gov, sock=9
> > 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type ReadyToCommit request
> > received from 2vt at b07l01.oic.ornl.gov, sock=9
> > 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type Commit request received
> > from 2vt at b07l01.oic.ornl.gov, sock=9
> > 10/18/2006 16:27:11;0100;PBS_Server;Job;16225.b08l02.oic.ornl.gov;enqueuing
> > into interq, state 1 hop 1
> > 10/18/2006 16:27:11;0008;PBS_Server;Job;16225.b08l02.oic.ornl.gov;Job
> > Queued at request of 2vt at b07l01.oic.ornl.gov, owner =
> > 2vt at b07l01.oic.ornl.gov, job name = STDIN, queue = interq
> > <snip>
> > 10/18/2006 16:27:41;0100;PBS_Server;Req;;Type AuthenticateUser request
> > received from 2vt at b07l01.oic.ornl.gov, sock=11
> > 10/18/2006 16:27:41;0100;PBS_Server;Req;;Type LocateJob request
> > received from 2vt at b07l01.oic.ornl.gov, sock=9
> > 10/18/2006 16:27:41;0080;PBS_Server;Req;req_reject;Reject reply
> > code=15001(Unknown Job Id), aux=0, type=LocateJob, from
> > 2vt at b07l01.oic.ornl.gov
> >
> > Does anyone have any ideas for this? I'd really appreciate the help -
> > the users are getting restless. :P
> >
> > Thanks!!
> > -Jen
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
>
>
>
--
The more compassionate you are, the more generous you can be. The more
generous you are, the more loving-friendliness you cultivate to help
the world.
-Thich Nhat Hanh, "Buddhist Peacework"
More information about the torqueusers
mailing list