[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: starvation with "--load" on multicore (was: parallel stops working f
From: |
Jay Hacker |
Subject: |
Re: starvation with "--load" on multicore (was: parallel stops working for no obvious reason) |
Date: |
Mon, 2 Apr 2012 10:37:52 -0400 |
I have had similar issues with --load for a long time, which is a
shame because it's a great feature I'd really like to use. I can
reproduce it every time on 4-core and 16-core RedHat 5 64-bit boxes (v
20120322):
PARALLEL=--load 100% --verbose
cores on XXX: 16
creating 4 files:
touch 1
touch 2
touch 3
touch 4
deleting 4 files:
rm 1
rm 2
rm 3
rm 4
...hang.
Unfortunately if I try to strace the offending parallel, it works
fine. :) If I add --debug this is what I get:
deleting 4 files:
1 128
128 0
2 2048
2048 0
3 32768
32768 0
4 524288
524288 -1
Maxlen: 32768,524288,278528
5 278528
278528 -1
Maxlen: 32768,278528,155648
6 155648
155648 -1
Maxlen: 32768,155648,94208
7 94208
94208 0
Maxlen: 94208,155648,124928
8 124928
124928 0
Maxlen: 124928,155648,140288
9 140288
140288 -1
Maxlen: 124928,140288,132608
10 132608
132608 -1
Maxlen: 124928,132608,128768
11 128768
128768 0
Maxlen: 128768,132608,130688
12 130688
130688 0
Maxlen: 130688,132608,131648
13 131648
131648 -1
Maxlen: 130688,131648,131168
14 131168
131168 -1
Maxlen: 130688,131168,130928
15 130928
130928 0
Maxlen: 130928,131168,131048
16 131048
131048 0
Maxlen: 131048,131168,131108
17 131108
131108 -1
Maxlen: 131048,131108,131078
18 131078
131078 -1
Maxlen: 131048,131078,131063
19 131063
131063 0
Maxlen: 131063,131078,131070
20 131070
131070 0
Maxlen: 131070,131078,131074
21 131074
131074 -1
Maxlen: 131070,131074,131072
22 131072
131072 -1
Maxlen: 131070,131072,131071
23 131071
131071 0
Wanted procs: 16
MultifileQueue->empty
RecordQueue->empty
read 1
Time to fork 1 procs: 0 (processes so far: 1)
MultifileQueue->empty
RecordQueue->empty
read 2
Time to fork 2 procs: 0 (processes so far: 2)
MultifileQueue->empty
RecordQueue->empty
read 3
Time to fork 3 procs: 0 (processes so far: 3)
MultifileQueue->empty
RecordQueue->empty
read 4
Time to fork 4 procs: 0 (processes so far: 4)
MultifileQueue->empty 1
RecordQueue->empty 1
MultifileQueue->empty 1
RecordQueue->empty 1
CommandLineQueue->empty 1
JobQueue->empty 1
MultifileQueue->empty 1
RecordQueue->empty 1
CommandLineQueue->empty 1
JobQueue->empty 1
RecordQueue-unget 'ARRAY(0x8db4d30) ARRAY(0x8db9510) ARRAY(0x8db9740)
ARRAY(0x8db95d0)'
Limited to procs: 4
Running jobs before on :: 0
No loadavg file: /home/XXX/.parallel/tmp/loadavg-11743-:Updating
loadavg file/home/XXX/.parallel/tmp/loadavg-11743-:Reaper called 1
Reaper exit 1
Start draining
RecordQueue->empty
CommandLineQueue->empty
JobQueue->empty
Running jobs before on :: 0
New loadavg: 0.01Last update: 1333376942max_loadavg: : 16RecordQueue->empty
CommandLineQueue->empty
JobQueue->empty
: has 0 out of 4 jobs running. Start another.
RecordQueue->empty
CommandLineQueue->empty
JobQueue->empty
RecordQueue->empty
RecordQueue->empty
RecordQueue->empty
RecordQueue->empty
MultifileQueue->empty 1
RecordQueue->empty 1
MultifileQueue->empty 1
RecordQueue->empty 1
RecordQueue-unget 'ARRAY(0x8db4d30) ARRAY(0x8db9510) ARRAY(0x8db9740)
ARRAY(0x8db95d0)'
cmd_line->number_of_args 1
Command to run on 'SSHLogin=HASH(0x8a528d0)': 'rm 1'
rm 1
1 processes. Starting (1): rm 1
Started as seq 1
Job started on :
RecordQueue->empty
CommandLineQueue->empty
JobQueue->empty
: has 1 out of 4 jobs running. Start another.
RecordQueue->empty
CommandLineQueue->empty
JobQueue->empty
RecordQueue->empty
cmd_line->number_of_args 1
Command to run on 'SSHLogin=HASH(0x8a528d0)': 'rm 2'
rm 2
2 processes. Starting (2): rm 2
Reaper called 1 died (0): 1>>joboutput rm 1
ERR:
OUT:
<<joboutput rm 1
Running jobs before on :: 0
New loadavg: 0.01Last update: 1333376942max_loadavg: : 16RecordQueue->empty
CommandLineQueue->empty
JobQueue->empty
: has 0 out of 4 jobs running. Start another.
RecordQueue->empty
CommandLineQueue->empty
JobQueue->empty
RecordQueue->empty
cmd_line->number_of_args 1
Command to run on 'SSHLogin=HASH(0x8a528d0)': 'rm 3'
rm 3
2 processes. Starting (3): rm 3
Started as seq 3
Job started on :
RecordQueue->empty
CommandLineQueue->empty
JobQueue->empty
: has 1 out of 4 jobs running. Start another.
RecordQueue->empty
CommandLineQueue->empty
JobQueue->empty
RecordQueue->empty
cmd_line->number_of_args 1
Command to run on 'SSHLogin=HASH(0x8a528d0)': 'rm 4'
rm 4
3 processes. Starting (4): rm 4
Started as seq 4
Job started on :
MultifileQueue->empty 1
RecordQueue->empty 1
CommandLineQueue->empty 1
JobQueue->empty 1
Running jobs after on :: 2 of 4
died (0): 3>>joboutput rm 3
ERR:
OUT:
<<joboutput rm 3
Running jobs before on :: 1
New loadavg: 0.01Last update: 1333376942max_loadavg: : 16MultifileQueue->empty 1
RecordQueue->empty 1
CommandLineQueue->empty 1
JobQueue->empty 1
Running jobs after on :: 1 of 4
Reaper exit 1
Reaper called 1 Reaper exit 1
Started as seq 2
Job started on :
MultifileQueue->empty 1
RecordQueue->empty 1
CommandLineQueue->empty 1
JobQueue->empty 1
Running jobs after on :: 2 of 4
Sleeping 0.22 millisecs
jobs running: 2==2 slots: 4 Memory usage:98131968 Sleeping 0.242 millisecs
Reaper called 1 died (0): 4>>joboutput rm 4
ERR:
OUT:
<<joboutput rm 4
Running jobs before on :: 1
New loadavg: 0.01Last update: 1333376942max_loadavg: : 16MultifileQueue->empty 1
RecordQueue->empty 1
CommandLineQueue->empty 1
JobQueue->empty 1
Running jobs after on :: 1 of 4
Reaper exit 1
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.2662 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.29282 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.322102 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.3543122 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.38974342 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping 0.428717762 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
0.4715895382 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
0.51874849202 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
0.570623341222 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
0.6276856753442 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
0.69045424287862 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
0.759499667166483 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
0.835449633883131 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
0.918994597271444 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
1.01089405699859 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
1.11198346269845 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
1.22318180896829 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
1.34549998986512 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
1.48004998885163 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
1.6280549877368 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
1.79086048651048 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
1.96994653516153 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
2.16694118867768 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
2.38363530754545 millisecs
Reaper called 1 Reaper exit 1
jobs running: 1==1 slots: 4 Memory usage:98131968 Sleeping
2.62199883829999 millisecs
...and this just goes on forever.
On Mon, Apr 2, 2012 at 4:59 AM, Thomas Sattler
<sattler@med.uni-frankfurt.de> wrote:
>>> As you probably can imagine that is hard to reproduce. See if
>>> you can make smaller example fail - preferably something that
>>> can run on smaller machines.
>>
>> I wrote a small script that shows the problem. It completes
>> in less than 10 seconds on my desktop (two cores), but hangs
>> (read: "does not complete within hours") on two other
>> machines (8/32 cores).
>
> I left the script running and it did not complete within 3 days!
> A modified version of the trigger is attached. Having a look at
> the temporary directory, 'parallel' hangs _after_ all files
> have been created (or removed).
>
> I just tested the new script on all machines again: "2core" and
> "8core" successfully completed 10 consecutive runs, but "32core"
> still hungs _everytime_ a script is run.
>
> Could someone with 8-32 (or even more?) cores please try to
> reproduce the issue?
>
> Thomas