|
From: | George Marselis |
Subject: | Re: parallel + blast + LSF |
Date: | Wed, 15 Apr 2015 23:27:26 +0300 |
gHi George!I am not sure who you are talking with. Martin or me? I remind the original topic is about using blast under parallel with LSF.
Martin's problem sounds like something offtopic.
You have both sysadmin and bioinformatics experience so I would really appreciate your help!I am working on a cluster so I must use LSF to get slots and I would prefer using parallel also since it splits input automatically with --recstart (which is quite nice:D otherwise I have to use another script for that). I see I could do better with chunksize (I have 1 record at time in my example) but that's a secondary problem now. First I have the "lsb_launch(): Failed while waiting for tasks to finish." issue to solve.cheers,On Wed, Apr 15, 2015 at 7:44 PM, George Marselis <george@marsel.is> wrote:By the way, LSF and GNU parallel do almost the same thing. So using one of the two, defeats the purpose of using the other.In the same way, you could have used LSF to submit your jobs to LSF:bsub < script.shwhere script.sh wasbsub -J amoeba -q smalljobs qfasta file1
bsub -J amoeba -q smalljobs qfasta file2...bsub -J amoeba -q smalljobs qfasta file2000On Wed, Apr 15, 2015 at 8:39 PM, George Marselis <george@marsel.is> wrote:Hi. LSF/Openlava sysadmin in bioinformatics and parallel user here.I have seen this a couple more times: You are trying to use GNU parallel to submit the jobs to all nodes.THat's now the way to do things: You should not submit jobs on *all* your nodes. Please don't do that, as bsub was not designed to read large chunks of jobs. bsub writes the jobs to your home directory, so if your storage is not designed for a lot of writes, you are going to blow the cluster out of the water.What you want to do is look up either:1. bsub scripts https://rc.fas.harvard.edu/resources/documentation/legacy-lsf/lsf-submit-an-lsf-job/or2. job arrays https://rc.fas.harvard.edu/resources/documentation/legacy-lsf/lsf-submitting-lots-of-short-jobs-job-arrays/Both bsub scripts and job arrays are useful to you: bsub scripts can be submitted as part of a pipeline: you can program the output of the bsub script from your pipeline and then submit it to bsub. So, instead of submitting your job 2000 times as inbsub job0bsub job1....bsub job1999you just submit "bsub < scriptname" which contains 2000 lines which describe your jobs and you are done. The rest is done by bsub/LSF
Now, if your jobs are similar in a way that you just increment counter (as in most bioinformatics jobs), use arrays.bsub -J JOBNAME[0-1999], where JOBNAME is a string you would like to name your job as, eg "fasta files alignment"These techniques are useful because you can submit all 2000 jobs in less than a second, you can do it from a single node and you will not have to deal with a grumpy sysadmin or grumpy colleagues who cannot use the cluster. Just make sure you use the appropriate queue.Let me know if you have any questions.Best Regards,George MarselisOn Wed, Apr 15, 2015 at 6:48 PM, Martin d'Anjou <martin.danjou14@gmail.com> wrote:Hi,
Thanks for clarifying. I want to use GNU Parallel to bsub jobs. This way I can use GNU Parallel to throttle the number of jobs that are submitted to LSF, and it is easier than writing a loop.
parallel -j 100 my_script [bsub options] ::: {1..2000}
my_script (pseudo-code):
#!/bin/bash
...
bsub [bsub options] command ...
post-process data
This way I can submit jobs, say 100 at a time. When I submit all 2000 jobs, it gets problematic and I start hitting limits with file descriptors, etc.
Thanks for sharing,
Martin
On 15-04-15 11:35 AM, Giuseppe Aprea wrote:
Hi Martin,
I am not sure I understand. As far as I can see, things work exactly the opposite way: you have an LSF script which launches GNU Parallel on some hosts provided by LSF. Something like:
--------------------------------------------------------------------------------------------------------------------------------------------------------------#!/bin/bash
#BSUB -J gnuParallel_blast_test # Name of the job.#BSUB -o %J.out # Appends std output to file %J.out. (%J is the Job ID)#BSUB -e %J.err # Appends std error to file %J.err.#BSUB -q large # Queue name.#BSUB -n 30 # Number of CPUs.
module load 4.8.3/ncbi/12.0.0module load 4.8.3/parallel/20150122
SLOTS=`cat ${LSB_DJOB_HOSTFILE} |wc -l`
SERVER=""
for i in `cat ${LSB_DJOB_HOSTFILE}| sort`
doecho "/afs/enea.it/software/bin/blaunch.sh ${i}" >> serversdone
cat absolute_path_to_sequences.fasta | parallel --no-notice -vv -j ${SLOTS} --slf servers --plain --recstart '>' -N 1 --pipe blastp -evalue 1e-05 -outfmt 6 -db absolute_path_to_db_file -query - -out absolute_path_to_result_file_{%}
--------------------------------------------------------------------------------------------------------------------------------------------------------------
LSF is the one which gives you the execution hosts so if you are launching bsub from GNU parallel how do you know how to set the --slf option?
g
On Wed, Apr 15, 2015 at 4:24 PM, Martin d'Anjou <martin.danjou14@gmail.com> wrote:
On 15-04-15 09:34 AM, Giuseppe Aprea wrote:
Hi all,
I would like to ask you, please, some help in using parallel with blast alignment software.
I am trying to use GNU parallel v. 20150122 with blast for a very large sequences alignment. I am using Parallel on a cluster which uses LSF as queue system.
Hello Giuseppe,
I am an avid LSF user, and I want to use GNU Parallel to dispatch jobs to LSF. Could you please explain a little bit to me how GNU Parallel works with LSF? I do not see it in the on-line tutorials. For example, I would like to understand how to pass "bsub" options like -oo, -q queue_name, etc. to LSF from GNU Parallel.
Thanks,
Martin
[Prev in Thread] | Current Thread | [Next in Thread] |