[gold-users] Integration with LSF, HPC MPI job

Scott Jackson scottmo at adaptivecomputing.com
Wed Mar 16 17:49:59 MDT 2011


The only thing I've seen was with using greserve once as a whole job. I do not think it will work well to greserve per host or per core. As a side note, the Machine in Gold means a cluster name, not a host name. The design assumed there would normally be only one reservation per job, even for parallel jobs. However, I could see the possibility of having multiple reservations per job if your jobs have multiple independent tasks in taskgroups with different resource or time characteristics. But not to confuse you, in the average case the general rule is one job, one reservation.

----- Original Message -----
> From: "Wei Lin" <weilin at platform.com>
> To: "Gold Users Mailing List" <gold-users at supercluster.org>
> Sent: Wednesday, March 16, 2011 3:12:21 PM
> Subject: [gold-users] Integration with LSF, HPC MPI job
> Hi, Scott
> 
> MPI job can run on multiple hosts, did customer like to "greserve"
> per host or just "greserve" once as a whole job ?
> See example:
> Thanks
> Wei Lin
> 
> --------------------------------------------------------
> EXAMPLE:
> (1) submit a mpi job:
> [weilin at amd64dcore conf]$ bsub -P lsf_p1 -q normal -W 20 -n2 -m
> "amd64dcore! pprh3" -a lammpi -R"span[ptile=1]" -L /bin/tcsh
> mpirun.lsf
> /home/weilin/shell/cpi_mpi
> Quote command: /opt/gold/bin/gquote -u weilin -p "lsf_p1" -m
> amd64dcore
> -P 2 -t 1200 --verbose --quiet
> -----------------------------------------------------------
> Quote response: 2400
> -----------------------------------------------------------
> Balance response: 99727
> Balance available response: 99727
> Job <2414> is submitted to queue <normal>.
> 
> (2) job run at 2 hosts , 1 slot per host
> 
> [weilin at amd64dcore conf]$ bjobs 2414
> JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME
> SUBMIT_TIME
> 2414 weilin RUN normal amd64dcore amd64dcore *l/cpi_mpi Mar
> 15 12:42
> 
> pprh3.asia.corp.platform.com
> (3) the log of eexec: reserve twice
> [weilin at amd64dcore conf]$ vi ../log/eexec.log
> 796 LSB_MCPU_HOSTS: amd64dcore 1 pprh3.asia.corp.platform.com 1
> 797 Reserve command: /opt/gold/bin/greserve -J 2414 -p lsf_p1 -u
> weilin -m amd64dcore -P 1 -t 1200 --verbose --quiet
> 798 -----------------------------------------------------------
> 799 Reserve response: 83 671200 < =========================first
> reservation
> 800 -----------------------------------------------------------
> 801 Reserve command: /opt/gold/bin/greserve -J 2414 -p lsf_p1 -u
> weilin -m pprh3.asia.corp.platform.com -P 1 -t 1200 --verbose --quiet
> 802 -----------------------------------------------------------
> 803 Reserve response: 84 681200 < =========================second
> reservation
> 804 -----------------------------------------------------------
> 805 local_machine = amd64dcore
> 
> (4) display the reservation on Gold, two items:
> [weilin at amd64dcore conf]$ glsres
> Id Name Amount StartTime EndTime Job User
> Project Machine Accounts Description
> -- ---- ------ ------------------- ------------------- --- ------
> ------- ---------------------------- -------- -----------
> 67 2414 1200 2011-03-15 12:42:32 2011-03-15 13:12:32 83 weilin lsf_p1
> amd64dcore 4
> 68 2414 1200 2011-03-15 12:42:32 2011-03-15 13:12:32 84 weilin lsf_p1
> pprh3.asia.corp.platform.com 4
> [weilin at amd64dcore conf]$
> 
> _______________________________________________
> gold-users mailing list
> gold-users at supercluster.org
> http://www.supercluster.org/mailman/listinfo/gold-users


More information about the gold-users mailing list