[torqueusers] Empty output/error log file
fyd at q4md-forcefieldtools.org
Fri Mar 25 06:21:37 MDT 2011
>> /dev/sdb1 ext3 917G 414G 457G 48% /scratch
>> as you can see the /scratch partition is not full...
> still, those 1M files might be the problem.
I can requested 256 000 files (smaller grid) instead of 1 000 000 &
the same problem
> Are those files temporary or do they belong to the result set?
> Will they be copied back to your head node and deleted afterwards?
> I guess you use /home with nfs to get your results back?
No otherwise (because of NFS) it will take for ever...
> Please check "df -i", too. A good way to exclude inode problems is to
> run 8 jobs and issue "df -i" during computation.
[xxxx at node2 ~]$ df -i
Sys. de fich. Inodes IUtil. ILib. %IUti. Monté sur
/dev/sda3 767232 93248 673984 13% /
/dev/sda5 59211776 15 59211761 1% /tmp
/dev/sda1 26104 41 26063 1% /boot
tmpfs 1537806 1 1537805 1% /dev/shm
/dev/sdb1 122109952 80024695 42085257 66% /scratch
master0:/home 5859342208 1326970 5858015238 1% /home
master0:/usr/local 7285856 166312 7119544 3% /usr/local
master0:/opt 3840192 27564 3812628 1% /opt
Here is what we guess:
When the 8 first jobs are started all goes well. Then, among these 8
jobs, one will finish first while the 7 others still write on the
common hard drive (/scratch partition), making the hard drive very
busy; So a core is freed and the 9th job can be ran. However, for a
reason we do not master, nothing is done for this 9th job & an empty
error log file is generated.
We suspect that our hard drive (/scratch partition) is busy and the
'system' does not 'answer' when PBS send the 9th job.
Does it make sense to you?
More information about the torqueusers