[torqueusers] Empty output/error log file
fyd at q4md-forcefieldtools.org
Fri Mar 25 00:03:13 MDT 2011
Quoting "Coyle, James J [ITACD]" <jjc at iastate.edu>:
thanks James for your answer.
A specificity I forgot to mention is that the 8 scripts executed on
the same node are 'very hard disk' demanding; up to 1 000 000 of files
are generated by each script...
Thus, instead of using:
I will try using
to limit the number of jobs to 4.
I will also try your idea with 'wait'.
> I don't have an answer for the behavior you are seeing,
> but I'd venture that is a mis-configuration somewhere
> on the system. We have sometimes had thousands of serial
> jobs in the queue, and not had these kinds of problems.
> However, I'd like to suggest another way to accomplish what
> you want to do which will likely get your jobs through faster.
> One of the user groups I support is a bioinformatics group
> which was running many serial jobs. The cluster was set up
> to favor parallel job, so their group did not get much throughput
> because the number of jobs per group was limited, and they were getting
> only one processor per job.
> The nodes had 16 processors each, and by using pbsdsh
> the groups were able to write a job which starts 16 processes
> in a single job so that the node is dedicated. Their
> throughput went up by nearly 16x.
> The other way you could do this for your cluster which has
> 8 processors on a node, is in a job with commands like
> #PBS_NODES -lnodes=1:ppn=8,....
> cd $PBS_O_WORKDIR
> foreach DIR ( D1 D2 D3 D4 D5 D6 D7 D8 )
> cd $DIR
> ./program options < input >& output &
> cd ..
> This starts 8 processes in the background on the single dedicated node
> and uses the csh wait command to wait for the command to complete.
> The wait is crucial, as (on my cluster at least) your programs would
> be killed on job exit to clean the node for the next user.
> The background-wait approach is limited to a single node. pbsdsh
> is more general
> and can use many nodes, but (in my opinion) a little more
> complicated to write the
> Good luck,
> James Coyle, PhD
> High Performance Computing Group
> 115 Durham Center
> Iowa State Univ. phone: (515)-294-2099
> Ames, Iowa 50011 web: http://www.public.iastate.edu/~jjc
>> -----Original Message-----
>> From: torqueusers-bounces at supercluster.org [mailto:torqueusers-
>> bounces at supercluster.org] On Behalf Of FyD
>> Sent: Wednesday, March 23, 2011 3:47 PM
>> To: Torque Users Mailing List
>> Subject: Re: [torqueusers] Empty output/error log file
>> Dear Michael,
>> Thanks for your answer.
>>> I am not sure I understand your problem correctly.
>> OK I am going to try to better explain...
>> I run - let's say - 20 _identical_ PBS jobs from 20 sub-directories.
>> foreach DIR (Directory*)
>> cd $DIR
>> qsub PBS-script.csh
>> cd ..
>> sleep 2
>> all the jobs are executed on the _same_ cluster node
>> each job requests a _single_ core and we have 8 cores per node.
>> So all the jobs go in the queue, the first 8th ones directly runs
>> while the others are queued. The first 8th jobs end well and the
>> wanted data files/results are generated.
>> Then, the problem that shows up is more or less random. In some
>> a PBS jobs simply goes out of the queue (i.e. no data file is
>> generated) and generate only empty log & error files.
>> [---- at master0 Test]$ ls -l Directory-9
>> -rw-r--r-- 1 xxxx xxxxx 8286 fév 28 15:26 A-All-MM-SLZ1.job
>> --> this is my script
>> -rw------- 1 xxxx xxxxx 0 mar 12 07:53 IdoA-IdoA2S.o116
>> --> this is the empty error/log file generated
>>>> In some circumstances, a queued script is executed (the queuing
>>>> does execute the PBS script, but nothing is done), an empty
>> output log
>>>> file and an empty error log file are generated and the job
>>>> from the queue.
>>> If your jobs are executed and produce no error/output then this
>>> nothing to do with torque. As long as you get the (empty) files
>>> everything looks good. How do you realize that "nothing is done" ?
>> -1 each PBS jobs normally run during at least 24 hours with many
>> access to the hard drive/scratch directory (no NFS system for this
>> hard drive). When "nothing is done" the queue becomes empty in a few
>> -2 no data file are written in the scratch hard drive/partition on
>> the node.
>>> you manually start the script on one of the nodes?
>> The problem is random. If I re-run the jobs that previously failed
>> (using the .csh script or manually) it will work for some of them
>> others will fail again.
>>> Maybe you are missing
>>> some dependencies on the nodes? But in that case your program/job
>>> complain and produce error messages.
>> _Nothing_ is done/generated! There is obviously no problem of space;
>> we have plenty of room in the scratch directory.
>>>> It looks like this type of problem appends when the machine on
>>>> the PBS script is executed is busy (due to other running PBS jobs
>>> If you have a number of jobs running on a machine and there is
>>> room (i.e. free processors) for new jobs then those will get
>>> and start to run. If there is no room for new jobs (aka "busy")
>>> nothing will be started.
>> Yes I know that ;-)
>> My feeling is that when a core is freed because a PBS script ends
>> well, its hard drive might remains busy because of other jobs
>> Thus the hard drive might not be available for the next coming
>> Does it make sense?
>> If Yes how to avoid this problem?
>>>> - Is this problem known?
>>>> - How to avoid this type of problem?
>>>> Is it possible to request an additional 'delay' when a PBS job is
>>> Well you can change your jobs in a way that they are waiting by
>>> themselves. (Put a sleep/wait/pause whatever at the beginning of
>> you job)
>>> But for what purpose?
>> Yes 'sleep XX', but as you said for what purpose?
>>> Please state your problem more precisely, at least I can't even
>>> the nature of your issue.
>> I hope I was more precise in this new email.
>> regards, Francois
More information about the torqueusers