[torqueusers] 100+ job lauch failures - 15009 errors.

Amitoj G Singh amitoj at cs.uh.edu
Fri Nov 9 10:44:38 MST 2007


OK, so we have a band-aid solution for the problem for now. To recap, for
large jobs (128+ nodes) the MOM superior and MOM sisters have a race
condition wherein at the start of a job, within a second, the MOM superior
attempts to the start the job even before the MOM sisters have joined,
thus causing the job to fail, job goes from Q->R->Q several times and
never runs.

In src/server/process_request.c, in function dispatch_request (line 512)
we have added a sleep for 5 seconds as noted in the following code snipet
..

    case PBS_BATCH_Commit:

      req_commit(request);

/* Fermilab hack - add sleep after req_commit - djholm at FNAL.GOV */
      sleep(5);

      net_add_close_func(sfds,(void (*)())0);

      break;

Garrick, we will be in Reno next week and if you have time we can share
thoughts on this problem.

- Amitoj.

> On Tue, Nov 06, 2007 at 02:52:27PM -0600, Amitoj G Singh alleged:
>> Introduction
>> ============
>> o Torque version: 2.1.6
>> o Maui version: 3.2.6p18
>> o Server kernel version: 2.6.9-55.0.2.ELsmp
>> o Worker kernel version: 2.6.21
>> o 600-node AMD dual-core dual-socket cluster, 4GB memory per worker
>> node,
>> fast Ethernet.
>> o 520-node Intel P4 single-core cluster, 1GB memory per worker node,
>> fast
>> Ethernet.
>>
>> Problem
>> ========
>>
>> The server serves both the 600-node and the 520-node cluster, each
>> cluster
>> has a different queue. When submitting a 100+ node job to either queue,
>> the job goes from Q -> R -> Q and the following messages appear on the
>> job
>> worker-mom-head-node and mom-worker-nodes ..
>>
>> pbs_mom;Svr;pbs_mom;Job with requested ID already exists (15009) in
>> im_request, KILL/ABORT request for job 202902 returned error
>
> This means pbs_mom is timing out and requests are being resent.
>
> There is a latency issue somewhere.  It could be in the OS, filesystems,
> network problems, or blocking in pbs_mom.
>
> Can we see MOM's config and the server config from qmgr?
>
>
>> Doing several "qrun <jobid>" kicks the job off finally. We configured
>> Torque 2.1.6 with the following:
>>
>> ./configure --prefix=/usr/local/pbs --enable-docs --enable-server
>> --enable-mom --disable-gui --with-server-home=/var/spool/PBS
>> --with-default-server=foobar --with-rcp=/usr/bin/rcp
>>
>> What started this
>> =================
>> Everything was working fine until we upgraded the worker kernels from
>> 2.6.12.6 to 2.6.21.
>
> So go back to 2.6.12.6?
>
>
>> Our diagnosis attempts
>> ======================
>> Building torque 2.1.9 with ...
>>
>> ./configure --prefix=/usr/local/pbs --enable-docs --enable-server
>> --enable-mom --disable-gui --with-server-home=/var/spool/PBS
>> --with-default- server=foobar --with-rcp=/usr/bin/rcp --disable-rpp
>
> Don't use --disable-rpp.  It has no effect on your reported problem and is
> just
> a bad idea.  I don't know where people keep getting this from.
>
>
>> We used our old 2.1.6 server settings. With torque 2.1.9 as configured
>> above, when requesting nodes we would get only 1 node per job, we could
>> request 1 or 300 nodes, torque would only assign 1 node per job.
>>
>> Any help in this regard would be much appreciated.
>
> So you've got 2 problems?  The latency issues that came when you upgraded
> the
> kernel, and the scheduling problems.  The later is probably because of a
> default_resources.ncpus.
>
> Not that if you are using Maui, then torque isn't assigning nodes.  You
> can use
> 'checkjob' to see what Maui is assigning.
>
>
>
> _______________________________________________
> torqueusers mailing list
> torqueusers at supercluster.org
> http://www.supercluster.org/mailman/listinfo/torqueusers
>




More information about the torqueusers mailing list