[torqueusers] A strange problem with jobs getting killed during quotacheck

Prakash Velayutham Prakash.Velayutham at cchmc.org
Mon Nov 27 20:29:54 MST 2006


Hello,

I have a strange problem. One of my users is running a bunch of
perl-based serial jobs in a cluster using Torque-2.1.6. His jobs
typically run for more than a day.

Earlier, we had noticed that his jobs generally gets stopped at around
3am in the morning. But it was not consistent. Today I realized that my
home directory NFS server does its quotacheck at 3 am on Mon, Wed, Fri.
And that is exactly the time when his jobs stop and get killed.

To check this, I changed the quotacheck cron to run at 8 pm tonight. And
I see that 6-7 of his jobs had been killed right around that time.

Here is an excerpt from the server_priv/accounting file.

11/27/2006 20:00:01;E;56054.ribosome.cchmc.org;user=mkordos group=users
jobname=Atlas1B_12 queue=users ctime=1164641156 qtime=1164641156
etime=1164641156 start=1164641156 exec_host=valine/1
Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 session=21734 end=1164675601 Exit_status=30
resources_used.cput=05:57:38 resources_used.mem=578564kb
resources_used.vmem=594316kb resources_used.walltime=09:34:05
11/27/2006 20:00:02;E;56064.ribosome.cchmc.org;user=mkordos group=users
jobname=Atlas1L_22 queue=users ctime=1164641187 qtime=1164641187
etime=1164641187 start=1164641187 exec_host=histidine/1
Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 session=5308 end=1164675602 Exit_status=30
resources_used.cput=06:44:36 resources_used.mem=146788kb
resources_used.vmem=162548kb resources_used.walltime=09:33:35
11/27/2006 20:00:03;E;56060.ribosome.cchmc.org;user=mkordos group=users
jobname=Atlas1H_18 queue=users ctime=1164641176 qtime=1164641176
etime=1164641176 start=1164641177 exec_host=tyrosine/1
Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 session=29537 end=1164675603 Exit_status=30
resources_used.cput=07:08:31 resources_used.mem=417064kb
resources_used.vmem=430168kb resources_used.walltime=09:33:46
11/27/2006 20:00:03;E;56063.ribosome.cchmc.org;user=mkordos group=users
jobname=Atlas1K_21 queue=users ctime=1164641184 qtime=1164641184
etime=1164641184 start=1164641185 exec_host=histidine/0
Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 session=5150 end=1164675603 Exit_status=30
resources_used.cput=08:14:48 resources_used.mem=578588kb
resources_used.vmem=594316kb resources_used.walltime=09:33:39
11/27/2006 20:00:04;E;56056.ribosome.cchmc.org;user=mkordos group=users
jobname=Atlas1D_14 queue=users ctime=1164641162 qtime=1164641162
etime=1164641162 start=1164641162 exec_host=lysine/1
Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 session=7358 end=1164675604 Exit_status=30
resources_used.cput=07:08:04 resources_used.mem=578568kb
resources_used.vmem=594316kb resources_used.walltime=09:34:02
11/27/2006 20:00:05;E;56065.ribosome.cchmc.org;user=mkordos group=users
jobname=Atlas1M_23 queue=users ctime=1164641190 qtime=1164641190
etime=1164641190 start=1164641190 exec_host=glutamine/0
Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 session=28876 end=1164675605 Exit_status=30
resources_used.cput=08:59:34 resources_used.mem=578576kb
resources_used.vmem=594316kb resources_used.walltime=09:33:35
11/27/2006 20:00:06;E;56066.ribosome.cchmc.org;user=mkordos group=users
jobname=Atlas1N_24 queue=users ctime=1164641194 qtime=1164641194
etime=1164641194 start=1164641194 exec_host=glutamine/1
Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 session=29033 end=1164675606 Exit_status=30
resources_used.cput=07:31:38 resources_used.mem=191632kb
resources_used.vmem=207340kb resources_used.walltime=09:33:32
11/27/2006 20:00:08;E;56068.ribosome.cchmc.org;user=mkordos group=users
jobname=Atlas1P_26 queue=users ctime=1164641199 qtime=1164641199
etime=1164641199 start=1164641199 exec_host=cysteine/1
Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 session=6263 end=1164675608 Exit_status=30
resources_used.cput=08:36:37 resources_used.mem=591740kb
resources_used.vmem=608492kb resources_used.walltime=09:33:29
11/27/2006 20:00:10;E;56059.ribosome.cchmc.org;user=mkordos group=users
jobname=Atlas1G_17 queue=users ctime=1164641174 qtime=1164641174
etime=1164641174 start=1164641174 exec_host=tyrosine/0
Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 session=29379 end=1164675610 Exit_status=30
resources_used.cput=07:55:55 resources_used.mem=40748kb
resources_used.vmem=63012kb resources_used.walltime=09:33:56
11/27/2006 20:00:39;E;56067.ribosome.cchmc.org;user=mkordos group=users
jobname=Atlas1O_25 queue=users ctime=1164641196 qtime=1164641196
etime=1164641196 start=1164641197 exec_host=cysteine/0
Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 session=6106 end=1164675639 Exit_status=13
resources_used.cput=08:55:50 resources_used.mem=62824kb
resources_used.vmem=79552kb resources_used.walltime=09:34:03
11/27/2006 20:00:40;E;56061.ribosome.cchmc.org;user=mkordos group=users
jobname=Atlas1I_19 queue=users ctime=1164641179 qtime=1164641179
etime=1164641179 start=1164641179 exec_host=serine/0
Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 session=19205 end=1164675640 Exit_status=13
resources_used.cput=07:30:03 resources_used.mem=578980kb
resources_used.vmem=594700kb resources_used.walltime=09:34:21
11/27/2006 20:03:41;E;56053.ribosome.cchmc.org;user=mkordos group=users
jobname=Atlas1A_11 queue=users ctime=1164641146 qtime=1164641146
etime=1164641146 start=1164641146 exec_host=valine/0
Resource_List.neednodes=1:ppn=1 Resource_List.nodect=1
Resource_List.nodes=1:ppn=1 session=21557 end=1164675821 Exit_status=13
resources_used.cput=09:36:16 resources_used.mem=29476kb
resources_used.vmem=65196kb resources_used.walltime=09:37:55

Some of them have an exit status of 30 and the rest 13. I have no idea
what these mean. Any help?

Thanks,
Prakash


More information about the torqueusers mailing list