[torqueusers] Empty output/error log file

FyD fyd at q4md-forcefieldtools.org
Wed Mar 23 14:46:58 MDT 2011


Dear Michael,

Thanks for your answer.

> I am not sure I understand your problem correctly.

OK I am going to try to better explain...

I run - let's say - 20 _identical_ PBS jobs from 20 sub-directories.

foreach DIR (Directory*)
    cd $DIR
    qsub PBS-script.csh
    cd ..
    sleep 2
end

all the jobs are executed on the _same_ cluster node
each job requests a _single_ core and we have 8 cores per node.

So all the jobs go in the queue, the first 8th ones directly runs  
while the others are queued. The first 8th jobs end well and the  
wanted data files/results are generated.

Then, the problem that shows up is more or less random. In some cases  
a PBS jobs simply goes out of the queue (i.e. no data file is  
generated) and generate only empty log & error files.

[---- at master0 Test]$ ls -l Directory-9

-rw-r--r-- 1 xxxx xxxxx 8286 fév 28 15:26 A-All-MM-SLZ1.job
    --> this is my script
-rw------- 1 xxxx xxxxx    0 mar 12 07:53 IdoA-IdoA2S.o116
    --> this is the empty error/log file generated

>> In some circumstances, a queued script is executed (the queuing system
>> does execute the PBS script, but nothing is done), an empty output log
>> file and an empty error log file are generated and the job disappears
>> from the queue.

> If your jobs are executed and produce no error/output then this has
> nothing to do with torque. As long as you get the (empty) files back
> everything looks good. How do you realize that "nothing is done" ?

Because:

-1 each PBS jobs normally run during at least 24 hours with many  
access to the hard drive/scratch directory (no NFS system for this  
hard drive). When "nothing is done" the queue becomes empty in a few  
seconds.

-2 no data file are written in the scratch hard drive/partition on the node.

> Can
> you manually start the script on one of the nodes?

The problem is random. If I re-run the jobs that previously failed  
(using the .csh script or manually) it will work for some of them and  
others will fail again.

> Maybe you are missing
> some dependencies on the nodes? But in that case your program/job should
> complain and produce error messages.

_Nothing_ is done/generated! There is obviously no problem of space;  
we have plenty of room in the scratch directory.

>> It looks like this type of problem appends when the machine on which
>> the PBS script is executed is busy (due to other running PBS jobs for
>> instance).
>
> If you have a number of jobs running on a machine and there is still
> room (i.e. free processors) for new jobs then those will get scheduled
> and start to run. If there is no room for new jobs (aka "busy") then
> nothing will be started.

Yes I know that ;-)

My feeling is that when a core is freed because a PBS script ends  
well, its hard drive might remains busy because of other jobs running.  
Thus the hard drive might not be available for the next coming jobs....

Does it make sense?

If Yes how to avoid this problem?

>> - Is this problem known?
>> - How to avoid this type of problem?
>> Is it possible to request an additional 'delay' when a PBS job is executed?
>
> Well you can change your jobs in a way that they are waiting by
> themselves. (Put a sleep/wait/pause whatever at the beginning of you job)
> But for what purpose?

Yes 'sleep XX', but as you said for what purpose?

> Please state your problem more precisely, at least I can't even guess
> the nature of your issue.

I hope I was more precise in this new email.

regards, Francois




More information about the torqueusers mailing list