Hi Ole,
sorry for this late reply but our cluster had to undergo maintenance.
I have some notices/questions, please.
Remote nodes. LSF just reserve slots on several remote servers and launch your command line on one of those remote servers which we can call the master node. LSF reserved nodes are written on a file whose path is in LSF evironment variable LSB_DJOB_HOSTFILE. As an example If LSF gives you 2 slots non server_1 and 3 slots on server_2 this file is given by:
server_1
server_1
LSF slots should corresponds to server cores. That doesnt' mean LSF is able to enforce the number of program instances. That mus be done by users which may be given slots on the same server. Following LSF syntax, which is also similar to MPI hostfile syntax, I repeated the server names but you are saying that's useless. My question is: (Q1) How do I specify the maximum job number per host? Is it something like (following prevoius example)
Empty result files. I guess I retrieved empty results file for different reasons; one was, as you noticed, the wrong replacement string ( {%} insted of {#} ) but I also had the wrong temporary directory(which must be on a shared filesystem in my case). Now I think I reached a good point with the following script:
#!/bin/bash
#BSUB -J gnuParallel_blast_test # Name of the job.
#BSUB -o %J.out # Appends std output to file %J.out. (%J is the Job ID)
#BSUB -e %J.err # Appends std error to file %J.err.
#BSUB -q cresco3_h144 # Queue name.
#BSUB -n 70 # Number of CPUs.
module load 4.8.3/ncbi/12.0.0
module load 4.8.3/parallel/20150122
SLOTS=`cat ${LSB_DJOB_HOSTFILE} |wc -l`
SERVER=""
for i in `cat ${LSB_DJOB_HOSTFILE}| sort`
do
echo "${i}" >> servers
done
cat /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/goodProteins.fasta | parallel --no-notice -vv -j ${SLOTS} --tmpdir /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/tmp --wait --slf servers --block 200k --recstart '>' --pipe blastp -evalue 1e-05 -outfmt 6 -db /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/goodProteins -query - -out /gporq1_1M/usr/aprea/bio/solanum_melongena/analysis/orthomcl_00/resultd_{#}
wait
server file generated at runtime was:
(I had not read your message about repeated hostnames when I launched)
this time the stderr seemed not too bad (just a few warnings):
Using only -1 connections to avoid race conditions.
Using only -1 connections to avoid race conditions.
Using only -1 connections to avoid race conditions.
(Q2) Do you have any comments on that?
I retrieved 348 result files (all of them non empty) and I cat-ed them on a single file. The problem now is that for this test I run an all vs all BLAST so I expect at least 1 hit for each sequence in the input (each sequence vs itself) . Unfortunately that is not the case:
awk '{print $1}' resultd_all | sort | uniq | wc -l
175610
egrep "^>" goodProteins.fasta |wc -l
175625
As you can see I have 15 sequences ID missing. I am still investigating but I would like to ask you (Q3) if those IDs could have been lost in the data chunks creation (I used "-block 200k --recstart '>' --pipe") and, in case, how could I avoid that?
This is the input file structure:
head -n 12 goodProteins.fasta
>tom|Solyc00g005000.2.1
MFVPSIFLVFIMSCIISASVSYESKSTSGHAISFPTHEHLDVNQAIKEIIQPPETVHDNI
NNIVDDDDDNSRWKLKLLHRDKLPFSHFTDHPHSFQARMKRDLKRVHTLTNTTTNDNNKV
IKEEELGFGFGSEVISGMEQGSGEYFVRIGVGSPVRQQYMVIDAGSDIVWVQCQPCTHCY
HQSDPVFDPSLSASFTGVPCSSSLCNRIDNSGCHAGRCKYQVMYGDGSYTKGTMALETLT
FGRTVIRDVAIGCGHSNHGMFIGAAGGAFSYCLVSRGTNTGSTGSLEFGREVLPAGAAWV
PLIRNPRAPSFYYIGMLGLGVGGVRVPIPEDAFRLTEEGDGGVVMDTGTAVTRLPHEAYV
AFRDAFVAQTSSLPRAPAMSIFDTCYDLNGFVTVRVPTISFFLMGGPILTLPARNFLIPV
DTKGTFCFAFAPSPSRLSIIGNIQQEGIQISIDGANGFVGFGPNIC*
>tom|Solyc00g005020.1.1
MYVICKCICIDILIYMLLKVVEEKPQKDKKRRASDRGVLAQSHENVTNTEMAQERNVNER
LSRGRGITQHSQTSSEANCSGGVLGRGKRPAEHEDTSEGQTRPFKWPRMVGVGIYQAEDG
.....
Many thanks,
giuseppe