[torqueusers] qsub -I from compute node?

Aquarijen aquarijen at gmail.com
Wed Oct 18 15:34:29 MDT 2006


Well, looking at the maui log, maui thinks the job completed successfully:

10/18 17:21:59 MPBSJobLoad(16227,16227.b08l02.oic.ornl.gov,J,TaskList,0)
10/18 17:21:59 MReqCreate(16227,SrcRQ,DstRQ,DoCreate)
10/18 17:21:59 INFO:     adding requirement at slot 0
10/18 17:21:59 INFO:     processing node request line '8'
10/18 17:21:59 MJobSetCreds(16227,2vt,2vt,)
10/18 17:21:59 INFO:     default QOS for job 16227 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
10/18 17:21:59 MCPRestore(JOB,16227,Optr)
10/18 17:21:59 INFO:     default QOS for job 16227 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
10/18 17:21:59 INFO:     default QOS for job 16227 set to DEFAULT(0)
(P:DEFAULT,U:[NONE],G:[NONE],A:[NONE],C:[NONE])
0/18 17:21:59 INFO:     job '16227' loaded:   8      2vt      2vt
1800 Idle   0 1161206518   [NONE] [NONE] [NONE] >=      0 >=      0
[NONE] 116120651910/18 17:21:59 INFO:     33 PBS jobs detected on RM
base
<snip>
10/18 17:21:59 INFO:     job '16227' Priority:        1
10/18 17:21:59 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
   0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      0(00.0)
Us:      0(00.0)
<snip>
10/18 17:21:59 MPolicyAdjustUsage(NULL,16180,NULL,active,NULL,[ALL],1,NULL)
10/18 17:21:59 MPolicyAdjustUsage(NULL,16227,NULL,idle,PU,[ALL],1,NULL)
10/18 17:21:59 MPolicyAdjustUsage(NULL,16227,NULL,idle,NULL,[ALL],1,NULL)
10/18 17:21:59 INFO:     total jobs selected (ALL): 7/33 [State: 26]
<snip>
10/18 17:21:59 INFO:     job '16227' Priority:        1
10/18 17:21:59 INFO:     Cred:      0(00.0)  FS:      0(00.0)  Attr:
   0(00.0)  Serv:      0(00.0)  Targ:      0(00.0)  Res:      0(00.0)
Us:      0(00.0)
<snip>
10/18 17:21:59 MPolicyAdjustUsage(NULL,16227,NULL,idle,PU,[ALL],1,NULL)
<snip>
10/18 17:21:59 MJobPReserve(16226,DEFAULT,ResCount,ResCountRej)
10/18 17:21:59 MJobPReserve(16227,DEFAULT,ResCount,ResCountRej)
Active Jobs------
<snip>
10/18 17:21:59 INFO:     attempting BF backfill with job '
16227' (p:  8 t: 1800)
10/18 17:21:59 MJobSelectMNL(16227,DEFAULT,FNL,MNodeList,NodeMap,MaxSpeed,2)
10/18 17:21:59 MReqGetFNL(16227,0,DEFAULT,SrcNL,DstNL,NC,TC,2140000000,0)
10/18 17:21:59 INFO:     20 feasible tasks found for job 16227:0 in
partition DEFAULT (8 Needed)
10/18 17:21:59 MJobGetINL(16227,FNL,INL,DEFAULT,NodeCount,TaskCount)
10/18 17:21:59 INFO:     idle resources (20 tasks/10 nodes) found with
feasible list specified
10/18 17:21:59 MNodeSelectIdleTasks(16227,0,SrcNL,IdleMNL,TC,NC,NMap,RCount,RejReason)
10/18 17:21:59 INFO:     20(0) tasks/10(0) nodes found for job 16227
in MJobSelectMNL
10/18 17:21:59 MJobNLDistribute(16227,SrcMNL,DstMNL)
10/18 17:21:59 INFO:     resources found for job 16227 tasks: 20+0 of
8  nodes: 10+0 of 0
10/18 17:21:59 INFO:     located job '16227' in MBFBestFit (size: 8
duration: 1800)
10/18 17:21:59 MJobSelectMNL(16227,DEFAULT,FNL,MNodeList,NodeMap,MaxSpeed,2)
10/18 17:21:59 MReqGetFNL(16227,0,DEFAULT,SrcNL,DstNL,NC,TC,2140000000,0)
10/18 17:21:59 INFO:     20 feasible tasks found for job 16227:0 in
partition DEFAULT (8 Needed)
10/18 17:21:59 MJobGetINL(16227,FNL,INL,DEFAULT,NodeCount,TaskCount)
10/18 17:21:59 INFO:     idle resources (20 tasks/10 nodes) found with
feasible list specified
10/18 17:21:59 MNodeSelectIdleTasks(16227,0,SrcNL,IdleMNL,TC,NC,NMap,RCount,RejReason)
10/18 17:21:59 INFO:     20(0) tasks/10(0) nodes found for job 16227
in MJobSelectMNL
10/18 17:21:59 MJobNLDistribute(16227,SrcMNL,DstMNL)
10/18 17:21:59 INFO:     resources found for job 16227 tasks: 20+0 of
8  nodes: 10+0 of 0
10/18 17:21:59 MJobAllocMNL(16227,MFeasibleList,NodeMap,NULL,LASTAVAILABLE,1161206519)
10/18 17:21:59 INFO:     tasks located for job 16227:  8 of 8 required
(12 feasible)
10/18 17:21:59 INFO:     allocated MNode[000]x2 'b03n034.oic.ornl.gov'
to 16227:0
10/18 17:21:59 INFO:     allocated MNode[001]x2 'b03n033.oic.ornl.gov'
to 16227:0
10/18 17:21:59 INFO:     allocated MNode[002]x2 'b03n032.oic.ornl.gov'
to 16227:0
10/18 17:21:59 INFO:     allocated MNode[003]x2 'b03n031.oic.ornl.gov'
to 16227:0
10/18 17:21:59 MJobStart(16227)
10/18 17:21:59 MJobDistributeTasks(16227,base,NodeList,TaskMap)
10/18 17:21:59 INFO:     4 node(s)/8 task(s) added to 16227:0
10/18 17:21:59 INFO:     MNode[000] 'b03n034.oic.ornl.gov'(x2) added
to job '16227'
[019] b03n034.oic.ornl.gov: (P:2,S:7914,M:3950,D:1)
[Idle][DEFAULT][linux]<0.000000> C:[interq 2:2][routeq 2:2][DEFAULT]
[all] [interq 2:2][routeq 2:2]
10/18 17:21:59 INFO:     MNode[001] 'b03n033.oic.ornl.gov'(x2) added
to job '16227'
[018] b03n033.oic.ornl.gov: (P:2,S:7914,M:3950,D:1)
[Idle][DEFAULT][linux]<0.000000> C:[interq 2:2][routeq 2:2][DEFAULT]
[all] [interq 2:2][routeq 2:2]
10/18 17:21:59 INFO:     MNode[002] 'b03n032.oic.ornl.gov'(x2) added
to job '16227'
[017] b03n032.oic.ornl.gov: (P:2,S:7919,M:3950,D:1)
[Idle][DEFAULT][linux]<0.000000> C:[interq 2:2][routeq 2:2][DEFAULT]
[interactive] [interq 2:2][routeq 2:2]
10/18 17:21:59 INFO:     MNode[003] 'b03n031.oic.ornl.gov'(x2) added
to job '16227'
[016] b03n031.oic.ornl.gov: (P:2,S:7915,M:3950,D:1)
[Idle][DEFAULT][linux]<0.000000> C:[interq 2:2][routeq 2:2][DEFAULT]
[interactive] [interq 2:2][routeq 2:2]
10/18 17:21:59 INFO:     end of list reached.  4 nodes found
10/18 17:21:59 INFO:     tasks distributed: 8 (Round-Robin)
10/18 17:21:59 MAMAllocJReserve(16227,RIndex,ErrMsg)
10/18 17:21:59 MRMJobStart(16227,Msg,SC)
10/18 17:21:59 MPBSJobStart(16227,base,Msg,SC)
10/18 17:21:59 MPBSJobModify(16227,Resource_List,Resource,b03n034.oic.ornl.gov:ppn=2+b03n033.oic.ornl.gov:ppn=2+b03n032.oic.ornl.gov:ppn=2+b03n031.oic.ornl.gov:ppn=2)
10/18 17:21:59 MPBSJobModify(16227,Resource_List,Resource,1)
10/18 17:21:59 INFO:     job '16227' successfully started
10/18 17:21:59 MQueueAddAJob(16227)
10/18 17:21:59 MStatUpdateActiveJobUsage(16227)
10/18 17:21:59 MPolicyAdjustUsage(NULL,16227,NULL,active,NULL,[ALL],1,NULL)
10/18 17:21:59 MResJCreate(16227,MNodeList,00:00:00,ActiveJob,Res)
10/18 17:21:59 MResAdjustDRes(16227,FALSE)
10/18 17:21:59 MPolicyAdjustUsage(NULL,16227,NULL,idle,PU,[ALL],-1,NULL)
10/18 17:21:59 MPolicyAdjustUsage(NULL,16227,NULL,idle,NULL,[ALL],-1,NULL)
10/18 17:21:59 MParUpdate(DEFAULT)
10/18 17:21:59 INFO:     P[DEFAULT]:  Total 382:764  Up 371:742  Idle
24:50  Active 347:693
10/18 17:21:59 INFO:     P[DEFAULT]:  Total 382:764  Up 371:742  Idle
20:42  Active 351:701
10/18 17:21:59 MJobAddToNL(16227,NULL)
10/18 17:21:59 INFO:     node b03n034.oic.ornl.gov added to job 16227.
 PSlot: [interq 0:2][routeq 2:2]
10/18 17:21:59 INFO:     node b03n033.oic.ornl.gov added to job 16227.
 PSlot: [interq 0:2][routeq 2:2]
10/18 17:21:59 INFO:     node b03n032.oic.ornl.gov added to job 16227.
 PSlot: [interq 0:2][routeq 2:2]
10/18 17:21:59 INFO:     node b03n031.oic.ornl.gov added to job 16227.
 PSlot: [interq 0:2][routeq 2:2]
10/18 17:21:59 INFO:     starting job '16227'
<snip>
10/18 17:22:10 INFO:     active PBS job 16227 has been removed from
the queue. assuming successful completion
10/18 17:22:10 MJobProcessCompleted(16227)
10/18 17:22:10 MAMAllocJDebit(A,16227,SC,ErrMsg)
10/18 17:22:10 MJobWriteStats(16227)
10/18 17:22:10 MJobToTString(16227,230,Buf,65536)
10/18 17:22:10 INFO:     job stats written for '16227'
10/18 17:22:10 INFO:     node 'b03n034.oic.ornl.gov' returned to idle pool
10/18 17:22:10 INFO:     node 'b03n033.oic.ornl.gov' returned to idle pool
10/18 17:22:10 INFO:     node 'b03n032.oic.ornl.gov' returned to idle pool
10/18 17:22:10 INFO:     node 'b03n031.oic.ornl.gov' returned to idle pool
10/18 17:22:10 INFO:     job '             16227' completed.
QueueTime:      1  RunTime:     11  Accuracy:  0.61  XFactor:  0.01
10/18 17:22:10 INFO:     start: 1161206519  complete: 1161206530
SystemQueueTime: 1161206518
10/18 17:22:10 INFO:     overall statistics.  Accuracy:  0.06  XFactor:  5.47
10/18 17:22:10 INFO:     updating statistics for Grid[time: 4][proc: 3]
10/18 17:22:10 INFO:     job '16227' completed  X: 0.006667  T: 11
PS: 88  A: 0.006111
10/18 17:22:10 MJobSendFB(16227)
10/18 17:22:10 MSysLaunchAction(ASList,2)
10/18 17:22:10 INFO:     job usage sent for job '16227'
10/18 17:22:10 MJobRemove(16227)
10/18 17:22:10 MResChargeAllocation(16227,2)
10/18 17:22:10 MResAdjustDRes(16227,TRUE)
10/18 17:22:10 MJobRemoveHash(16227)
10/18 17:22:10 MJobDestroy(16227)
10/18 17:22:10 INFO:     32 PBS jobs detected on RM base
10/18 17:22:10 INFO:     jobs detected: 32
------------------------

And that's it...  So, it doesn't really seem like a maui problem?

Does anyone have this working successfully on their cluster?

Thanks so much for the help!!!
Jen




On 10/18/06, Jerry Smith <jdsmit at sandia.gov> wrote:
> Jen,
>
> Are you by chance running Maui or Moab as well?  We see this problem when
> there are incorrect ACLs setup.
>
> Jerry
>
>
>
> >
> > Hi All,
> >
> > I have users who like to debug using an interactive pbs job.  We have
> > 8 nodes designated for job submission and these all work fine when
> > submitting batch jobs, but give an error when submitting an
> > interactive job...  The only node that will not give an error when
> > submitting an interactive job is the node that runs the pbs_server.
> >
> > This is what my users (and I) get when submitting on a node that is
> > not running the pbs_server:
> >
> > [2vt at b07l01 hello-parallel-worlds]$ qsub -I -q interq
> > qsub: waiting for job 16223.b08l02.oic.ornl.gov to start
> > qsub: job 16223.b08l02.oic.ornl.gov apparently deleted
> > [2vt at b07l01 hello-parallel-worlds]$
> >
> > Doing a qstat, I breifly see this:
> >
> > [2vt at b08l02 ~]$ qstat
> > 16225.b08l02        STDIN            2vt                     0 Q interq
> > [2vt at b08l02 ~]$ qstat
> > 16225.b08l02        STDIN            2vt                     0 R interq
> >
> > But then it is gone and I never actually get a node.  The job seems to
> > wait for another 30 seconds or so and then give the "apparently
> > deleted" message.
> >
> >
> > This is what is in the pbs_server log:
> >
> > 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type AuthenticateUser request
> > received from 2vt at b07l01.oic.ornl.gov, sock=11 10/18/2006
> > 16:27:11;0100;PBS_Server;Req;;Type QueueJob request received from
> > 2vt at b07l01.oic.ornl.gov, sock=9
> > 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type ReadyToCommit request
> > received from 2vt at b07l01.oic.ornl.gov, sock=9
> > 10/18/2006 16:27:11;0100;PBS_Server;Req;;Type Commit request received
> > from 2vt at b07l01.oic.ornl.gov, sock=9
> > 10/18/2006 16:27:11;0100;PBS_Server;Job;16225.b08l02.oic.ornl.gov;enqueuing
> > into interq, state 1 hop 1
> > 10/18/2006 16:27:11;0008;PBS_Server;Job;16225.b08l02.oic.ornl.gov;Job
> > Queued at request of 2vt at b07l01.oic.ornl.gov, owner =
> > 2vt at b07l01.oic.ornl.gov, job name = STDIN, queue = interq
> > <snip>
> > 10/18/2006 16:27:41;0100;PBS_Server;Req;;Type AuthenticateUser request
> > received from 2vt at b07l01.oic.ornl.gov, sock=11
> > 10/18/2006 16:27:41;0100;PBS_Server;Req;;Type LocateJob request
> > received from 2vt at b07l01.oic.ornl.gov, sock=9
> > 10/18/2006 16:27:41;0080;PBS_Server;Req;req_reject;Reject reply
> > code=15001(Unknown Job Id), aux=0, type=LocateJob, from
> > 2vt at b07l01.oic.ornl.gov
> >
> > Does anyone have any ideas for this?  I'd really appreciate the help -
> > the users are getting restless. :P
> >
> > Thanks!!
> > -Jen
> > _______________________________________________
> > torqueusers mailing list
> > torqueusers at supercluster.org
> > http://www.supercluster.org/mailman/listinfo/torqueusers
> >
>
>
>


-- 
The more compassionate you are, the more generous you can be. The more
generous you are, the more loving-friendliness you cultivate to help
the world.

-Thich Nhat Hanh, "Buddhist Peacework"


More information about the torqueusers mailing list