[torqueusers] Node is not responding!

Gustavo Correa gus at ldeo.columbia.edu
Sat Oct 15 08:50:32 MDT 2011


Hi Hakeem

Look for them in ${TORQUE}/spool or ${TORQUE}/undelivered in the 'mother superior' node,
i.e., the first node Torque gave to your job, wherever you installed ${TORQUE}.
This is for jobs that fail ungracefully.  
You can find out which node was it  with 'qstat -n' or looking up the job number in the Torque pbs_server logs.

My two cents,
Gus Correa

On Oct 14, 2011, at 5:36 PM, Hakeem Almabrazi wrote:

> Steve,
> 
> Another easy question, where the output of my script should be stored at?  For example looking the script below, I could not find the test.o or test.err files in the node where I submitted the job from.
> 
> Thanks
> Hak 
> 
> -----Original Message-----
> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-bounces at supercluster.org] On Behalf Of Steve Crusan
> Sent: Thursday, October 13, 2011 5:47 PM
> To: Torque Users Mailing List
> Subject: Re: [torqueusers] Node is not responding!
> 
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> On Oct 13, 2011, at 3:34 PM, Hakeem Almabrazi wrote:
> 
>> Dear All,
>> 
>> First, I am newbie to Torque and this is my first message to this group.  I hope I will not waste anyone's time by asking such stupid question but I have tried to look for some answers in the archived listinfo but since there is no search capabilities built in I find it harder to find what I need.
>> 
>> Here is where I am so far:
>> 
>> I installed the Torque 3.0 package on my  Linux box (SUSE 11.2).  I also configured a node on a different VM that is running SUSE as well.  It seems things are installed and configured correctly (I think).
>> 
>> When I run the pbsnodes I get
>> 
>> suse-ptpd-16
>>    state = free
>>    np = 1
>>    ntype = cluster
>>    status = rectime=1318533469,varattr=,jobs=,state=free,netload=116214003,gres=,loadave=0.00,ncpus=1,physmem=1017908kb,availmem=3012532kb,totmem=3115056kb,idletime=76,nusers=2,nsessions=7,sessions=1753 1767 1770 1889 1894 1997 3017,uname=Linux suse-ptpd-16 2.6.34-12-desktop #1 SMP PREEMPT 2010-06-29 02:39:08 +0200 i686,opsys=linux
>>    mom_service_port = 15002
>>    mom_manager_port = 15003
>>    gpus = 0
>> 
>> When I shut down the node it changes to "down" in the state.  This tells me everything is okay.
>> 
>> However, when I tried to send my first job to the node.  I used this example found online
>> 
>>> test.job
>> 
>> #!/bin/bash
>> # --- send the output to the test.out file
>> #     the default is .o<jobid>
>> #PBS -o test.out
>> # --- send the error output to the test.err file
>> #     the default is .e<jobid>
>> #PBS -e test.err
>> 
>> echo "Print out the hostname and date"
>> /bin/hostname
>> /bin/date
>> exit 0
>> 
>> And then I ran it from the head node (not as a root)
>> 
>>> qsub test.job
>> 
>> Looking at the submitted jobs  ( I submitted the jobs twice)
>> 
>>> qstat
>> Job id                    Name             User            Time Use S Queue
>> ------------------------- ---------------- --------------- -------- - -----
>> 16.suse-halmabr            test.job         torqueuser             0 Q batch
>> 17.suse-halmabr            test.job         torqueuser             0 Q batch
>> 
>> 
>> However,  nothing seems to be happening after that.
>> 
>> Can any body tell me what I am doing wrong or if I am missing something here?  Also, it will be great if someone can direct me to the right site for examples on how to use the server that will be highly appreciated.
> 
> 
> What scheduler are you using above TORQUE ( PBS_SCHED, MAUI, etc)? If there is nothing that tells TORQUE to run the job, it won't run (unless you force it to run with qrun <jobid>). 
> 
> If everything is setup right with a scheduler and such, you should be able to submit a job interactively to run some basic tests, a la:    qsub -I -q batch ,etc,etc...
> 
> If you are using Maui/Moab as a scheduler, run a checkjob -v <nodeid> which will give you some information on why the scheduler isn't starting the job.
> 
> Here is a link to integrating TORQUE+MAUI [open source]:
> http://www.adaptivecomputing.com/resources/docs/maui/pbsintegration.php
> 
> You should be able to use the base Maui setup with TORQUE.
> 
> Hope that helps.
> 
> 
> 
> ~Steve
> 
>> 
>> Regards,
>> 
>> ~Hak
>> 
>> 
>> 
>> 
>> _______________________________________________
>> torqueusers mailing list
>> torqueusers at supercluster.org
>> http://www.supercluster.org/mailman/listinfo/torqueusers
> 
> ----------------------
> Steve Crusan
> System Administrator
> Center for Research Computing
> University of Rochester
> https://www.crc.rochester.edu/
> 
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
> Comment: GPGTools - http://gpgtools.org
> 
> iQEcBAEBAgAGBQJOl2qHAAoJENS19LGOpgqK/CUH/0xmHHHhXYq0AHpM7FqP7d5L
> xQRtNpbxlVejU68XFfxM7lHA/pp6lzb/niFBZG4ujHMofv84qKnAq7vFBskgIhKm
> 9AtTU+W3PkkjdHWS7lhYkSt0Wun+k0te5TMu/QfBfKpLTioEU0SqlHG+RhGgaWcn
> ow5/+TXN/tQCTayaZko7m/VbOcFv258B1lqEQFwczf7KgUcUDKsYc27lZlxG3IGj
> 20CGHGytSueGwIv1YdD9QRo7zFXMy49keN7z8nDaasDuBCMtsBpJrX2Xt77zvfcY
> ncyWknmd9+hgiAenLptFjm3T6Mt7Q+BDEWCT58buUQtgODHMiRMIHEwhTW1WE3w=
> =uEFl
> -----END PGP SIGNATURE-----
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers



More information about the torqueusers mailing list