[torqueusers] Empty output/error log file

Coyle, James J [ITACD] jjc at iastate.edu
Thu Mar 24 09:32:23 MDT 2011


Francois,

   I don't have an answer for the behavior you are seeing,
but I'd venture that is a mis-configuration somewhere
on the system.  We have sometimes had thousands of serial
jobs in the queue, and not had these kinds of problems.
However, I'd like to suggest another way to accomplish what 
you want to do which will likely get your jobs through faster.

   One of the user groups I support is a bioinformatics group
which was running many serial jobs.  The cluster was set up
to favor parallel job, so their group did not get much throughput
because the number of jobs per group was limited, and they were getting
only one processor per job. 

   The nodes had 16 processors each, and by using pbsdsh 
the groups were able to write a job which starts 16 processes
in a single job so that the node is dedicated.  Their
throughput went up by nearly 16x.

   The other way you could do this for your cluster which has
8 processors on a node, is in a job with commands like

#PBS_NODES -lnodes=1:ppn=8,....

cd $PBS_O_WORKDIR
foreach DIR ( D1 D2 D3 D4 D5 D6 D7 D8 )
  cd $DIR
  ./program options < input >& output   &
  cd ..
end
wait

  This starts 8 processes in the background on the single dedicated node 
and uses the csh wait command to wait for the command to complete.
The wait is crucial, as (on my cluster at least) your programs would
be killed on job exit to clean the node for the next user.

  The background-wait approach is limited to a single node. pbsdsh is more general
and can use many nodes, but (in my opinion) a little more complicated to write the 
script.

  Good luck,

 James Coyle, PhD
 High Performance Computing Group     
 115 Durham Center            
 Iowa State Univ.           phone: (515)-294-2099
 Ames, Iowa 50011           web: http://www.public.iastate.edu/~jjc


 
>-----Original Message-----
>From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
>bounces at supercluster.org] On Behalf Of FyD
>Sent: Wednesday, March 23, 2011 3:47 PM
>To: Torque Users Mailing List
>Subject: Re: [torqueusers] Empty output/error log file
>
>Dear Michael,
>
>Thanks for your answer.
>
>> I am not sure I understand your problem correctly.
>
>OK I am going to try to better explain...
>
>I run - let's say - 20 _identical_ PBS jobs from 20 sub-directories.
>
>foreach DIR (Directory*)
>    cd $DIR
>    qsub PBS-script.csh
>    cd ..
>    sleep 2
>end
>
>all the jobs are executed on the _same_ cluster node
>each job requests a _single_ core and we have 8 cores per node.
>
>So all the jobs go in the queue, the first 8th ones directly runs
>while the others are queued. The first 8th jobs end well and the
>wanted data files/results are generated.
>
>Then, the problem that shows up is more or less random. In some
>cases
>a PBS jobs simply goes out of the queue (i.e. no data file is
>generated) and generate only empty log & error files.
>
>[---- at master0 Test]$ ls -l Directory-9
>
>-rw-r--r-- 1 xxxx xxxxx 8286 fév 28 15:26 A-All-MM-SLZ1.job
>    --> this is my script
>-rw------- 1 xxxx xxxxx    0 mar 12 07:53 IdoA-IdoA2S.o116
>    --> this is the empty error/log file generated
>
>>> In some circumstances, a queued script is executed (the queuing
>system
>>> does execute the PBS script, but nothing is done), an empty
>output log
>>> file and an empty error log file are generated and the job
>disappears
>>> from the queue.
>
>> If your jobs are executed and produce no error/output then this
>has
>> nothing to do with torque. As long as you get the (empty) files
>back
>> everything looks good. How do you realize that "nothing is done" ?
>
>Because:
>
>-1 each PBS jobs normally run during at least 24 hours with many
>access to the hard drive/scratch directory (no NFS system for this
>hard drive). When "nothing is done" the queue becomes empty in a few
>seconds.
>
>-2 no data file are written in the scratch hard drive/partition on
>the node.
>
>> Can
>> you manually start the script on one of the nodes?
>
>The problem is random. If I re-run the jobs that previously failed
>(using the .csh script or manually) it will work for some of them
>and
>others will fail again.
>
>> Maybe you are missing
>> some dependencies on the nodes? But in that case your program/job
>should
>> complain and produce error messages.
>
>_Nothing_ is done/generated! There is obviously no problem of space;
>we have plenty of room in the scratch directory.
>
>>> It looks like this type of problem appends when the machine on
>which
>>> the PBS script is executed is busy (due to other running PBS jobs
>for
>>> instance).
>>
>> If you have a number of jobs running on a machine and there is
>still
>> room (i.e. free processors) for new jobs then those will get
>scheduled
>> and start to run. If there is no room for new jobs (aka "busy")
>then
>> nothing will be started.
>
>Yes I know that ;-)
>
>My feeling is that when a core is freed because a PBS script ends
>well, its hard drive might remains busy because of other jobs
>running.
>Thus the hard drive might not be available for the next coming
>jobs....
>
>Does it make sense?
>
>If Yes how to avoid this problem?
>
>>> - Is this problem known?
>>> - How to avoid this type of problem?
>>> Is it possible to request an additional 'delay' when a PBS job is
>executed?
>>
>> Well you can change your jobs in a way that they are waiting by
>> themselves. (Put a sleep/wait/pause whatever at the beginning of
>you job)
>> But for what purpose?
>
>Yes 'sleep XX', but as you said for what purpose?
>
>> Please state your problem more precisely, at least I can't even
>guess
>> the nature of your issue.
>
>I hope I was more precise in this new email.
>
>regards, Francois
>
>
>_______________________________________________
>torqueusers mailing list
>torqueusers at supercluster.org
>http://www.supercluster.org/mailman/listinfo/torqueusers


More information about the torqueusers mailing list