[torqueusers] torque tmpdir on Lustre filesystem

Lukasz Flis l.flis at cyf-kr.edu.pl
Fri Feb 24 03:24:30 MST 2012


Hello Christopher, Hi *

>
> We don't use Lustre (we have Panasas and GPFS), but just wondering
> does this happen all the time, or only occasionally ?

It happens occasionaly. But as I said - this seems like bug in Lustre 
FS, and it has nothing to do with torque code. Torque is using unlucky 
sequence of stat/mkdir functions which exposes lustre misbehaviour.

> If occasionaly then if the job fails once, will it always fail, or
> will it work if you try again?

Another call to the mkdirtree() function should succeed after few 
seconds of sleep.

I belive this behaviour in Lustre client appeared in 1.8.x line and 
remains in 2.1.X. HP SFS IIRC is based on 1.4 and 1.6 so it's not affected.

We have observed the BUG in Lustre 1.8.(4,5.6) infrastructure. Then we 
moved to 2.1 line replacing all the components (servers,arrays,fabric) 
and the bug remained.

The problem with lustre is that mkdir() call on EXISTING directory 
returns EPERM error instead of EEXIST once in a while, usually when 
stat() is called before mkdir.

I belive doing mkdir on a existing path is not very common practice and 
that's the reason the BUG was unnoticed for a long time

Cheers,
--
Lukasz Flis





More information about the torqueusers mailing list