[torqueusers] Empty output/error log file

FyD fyd at q4md-forcefieldtools.org
Fri Mar 25 00:03:13 MDT 2011


Quoting "Coyle, James J [ITACD]" <jjc at iastate.edu>:

thanks James for your answer.

A specificity I forgot to mention is that the 8 scripts executed on  
the same node are 'very hard disk' demanding; up to 1 000 000 of files  
are generated by each script...

Thus, instead of using:
#PBS_NODES -lnodes=1:ppn=1
I will try using
#PBS_NODES -lnodes=1:ppn=2
to limit the number of jobs to 4.

I will also try your idea with 'wait'.

regards, Francois


>    I don't have an answer for the behavior you are seeing,
> but I'd venture that is a mis-configuration somewhere
> on the system.  We have sometimes had thousands of serial
> jobs in the queue, and not had these kinds of problems.
> However, I'd like to suggest another way to accomplish what
> you want to do which will likely get your jobs through faster.
>
>    One of the user groups I support is a bioinformatics group
> which was running many serial jobs.  The cluster was set up
> to favor parallel job, so their group did not get much throughput
> because the number of jobs per group was limited, and they were getting
> only one processor per job.
>
>    The nodes had 16 processors each, and by using pbsdsh
> the groups were able to write a job which starts 16 processes
> in a single job so that the node is dedicated.  Their
> throughput went up by nearly 16x.
>
>    The other way you could do this for your cluster which has
> 8 processors on a node, is in a job with commands like
>
> #PBS_NODES -lnodes=1:ppn=8,....
>
> cd $PBS_O_WORKDIR
> foreach DIR ( D1 D2 D3 D4 D5 D6 D7 D8 )
>   cd $DIR
>   ./program options < input >& output   &
>   cd ..
> end
> wait
>
>   This starts 8 processes in the background on the single dedicated node
> and uses the csh wait command to wait for the command to complete.
> The wait is crucial, as (on my cluster at least) your programs would
> be killed on job exit to clean the node for the next user.
>
>   The background-wait approach is limited to a single node. pbsdsh   
> is more general
> and can use many nodes, but (in my opinion) a little more   
> complicated to write the
> script.
>
>   Good luck,
>
>  James Coyle, PhD
>  High Performance Computing Group
>  115 Durham Center
>  Iowa State Univ.           phone: (515)-294-2099
>  Ames, Iowa 50011           web: http://www.public.iastate.edu/~jjc
>
>
>
>> -----Original Message-----
>> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
>> bounces at supercluster.org] On Behalf Of FyD
>> Sent: Wednesday, March 23, 2011 3:47 PM
>> To: Torque Users Mailing List
>> Subject: Re: [torqueusers] Empty output/error log file
>>
>> Dear Michael,
>>
>> Thanks for your answer.
>>
>>> I am not sure I understand your problem correctly.
>>
>> OK I am going to try to better explain...
>>
>> I run - let's say - 20 _identical_ PBS jobs from 20 sub-directories.
>>
>> foreach DIR (Directory*)
>>    cd $DIR
>>    qsub PBS-script.csh
>>    cd ..
>>    sleep 2
>> end
>>
>> all the jobs are executed on the _same_ cluster node
>> each job requests a _single_ core and we have 8 cores per node.
>>
>> So all the jobs go in the queue, the first 8th ones directly runs
>> while the others are queued. The first 8th jobs end well and the
>> wanted data files/results are generated.
>>
>> Then, the problem that shows up is more or less random. In some
>> cases
>> a PBS jobs simply goes out of the queue (i.e. no data file is
>> generated) and generate only empty log & error files.
>>
>> [---- at master0 Test]$ ls -l Directory-9
>>
>> -rw-r--r-- 1 xxxx xxxxx 8286 fév 28 15:26 A-All-MM-SLZ1.job
>>    --> this is my script
>> -rw------- 1 xxxx xxxxx    0 mar 12 07:53 IdoA-IdoA2S.o116
>>    --> this is the empty error/log file generated
>>
>>>> In some circumstances, a queued script is executed (the queuing
>> system
>>>> does execute the PBS script, but nothing is done), an empty
>> output log
>>>> file and an empty error log file are generated and the job
>> disappears
>>>> from the queue.
>>
>>> If your jobs are executed and produce no error/output then this
>> has
>>> nothing to do with torque. As long as you get the (empty) files
>> back
>>> everything looks good. How do you realize that "nothing is done" ?
>>
>> Because:
>>
>> -1 each PBS jobs normally run during at least 24 hours with many
>> access to the hard drive/scratch directory (no NFS system for this
>> hard drive). When "nothing is done" the queue becomes empty in a few
>> seconds.
>>
>> -2 no data file are written in the scratch hard drive/partition on
>> the node.
>>
>>> Can
>>> you manually start the script on one of the nodes?
>>
>> The problem is random. If I re-run the jobs that previously failed
>> (using the .csh script or manually) it will work for some of them
>> and
>> others will fail again.
>>
>>> Maybe you are missing
>>> some dependencies on the nodes? But in that case your program/job
>> should
>>> complain and produce error messages.
>>
>> _Nothing_ is done/generated! There is obviously no problem of space;
>> we have plenty of room in the scratch directory.
>>
>>>> It looks like this type of problem appends when the machine on
>> which
>>>> the PBS script is executed is busy (due to other running PBS jobs
>> for
>>>> instance).
>>>
>>> If you have a number of jobs running on a machine and there is
>> still
>>> room (i.e. free processors) for new jobs then those will get
>> scheduled
>>> and start to run. If there is no room for new jobs (aka "busy")
>> then
>>> nothing will be started.
>>
>> Yes I know that ;-)
>>
>> My feeling is that when a core is freed because a PBS script ends
>> well, its hard drive might remains busy because of other jobs
>> running.
>> Thus the hard drive might not be available for the next coming
>> jobs....
>>
>> Does it make sense?
>>
>> If Yes how to avoid this problem?
>>
>>>> - Is this problem known?
>>>> - How to avoid this type of problem?
>>>> Is it possible to request an additional 'delay' when a PBS job is
>> executed?
>>>
>>> Well you can change your jobs in a way that they are waiting by
>>> themselves. (Put a sleep/wait/pause whatever at the beginning of
>> you job)
>>> But for what purpose?
>>
>> Yes 'sleep XX', but as you said for what purpose?
>>
>>> Please state your problem more precisely, at least I can't even
>> guess
>>> the nature of your issue.
>>
>> I hope I was more precise in this new email.
>>
>> regards, Francois



More information about the torqueusers mailing list