Re: Revision of GNU Parallel's processing of SIGTERM

parallel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Revision of GNU Parallel's processing of SIGTERM

From:	Martin d'Anjou
Subject:	Re: Revision of GNU Parallel's processing of SIGTERM
Date:	Sun, 12 Apr 2015 19:53:03 -0400
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0

On 15-04-12 07:14 AM, Ole Tange wrote:

On Sat, Apr 11, 2015 at 12:56 AM, Martin d'Anjou
<martin.danjou14@gmail.com> wrote:

Hello Ole,

I worked on the SIGTERM propagation feature today. I have questions, the
questions are also in the code in the form of comments, if you prefer to
read them there (search for "Question"):
https://github.com/martinda/gnu-parallel/compare/sigterm-1?expand=1#diff-5379ba718ef5b0a2feb45981e768a9fd

Q1:
Inside sub wait_and_exit, job->kill(TERM") is called twice. As I am trying
to update the documentation, I find this complex to explain.
Do you know why the call is made twice?
Should I write my own "wait_and_exit" for the SIGTERM propagation feature?

It think it is a left over from when $job->kill() did not send 2 TERMs.

The idea for this is if programs like GNU Parallel (that needs 2 TERMs
to exit) are started from GNU Parallel.

I understand now. Very clear. Another special program is emacs: I haveread that SIGINT does not kill it! I have one other program like this,3rd party binary unfortunately.

Q2:
I have added a [--wait-for-children [GRACE_PERIOD]] option for the user to
extend the grace period of $sleepsum in case the user is dealing with
processes that are long to "put to rest".
My question: should this option be available in general, or just for the
propagation feature?

Do we really need an option for this? I would like to see at least 2
real life scenarios, where this makes sense and for which a hard coded
value will not work.

I really do not like the current --wait-for-children solution that Iproposed. After much thinking it is a bit too specific, and it does notfit well.

I have prepared the documentation for a different approach. I will sendanother email to keep things separate. This discussion is getting to bea lot of text.


In terms of a real life scenario, I can offer an overview of my workflow.

Some processes take a long time to terminate from the point of view ofGNU Parallel, because from the time GNU Parallel issues the TERM signaland the time GNU Parallel hears back from the processes, there could bean amount of time longer than 200ms. For example, the current chain ofcommand with SIGTERM in my workflow is: Jenkins, script, script, GNUMake, GNU Parallel, script, grid engine submission host, grid enginemaster, grid engine execution host, script, program. The last program isCPU/RAM/IO intensive, the layers above are for build management. Whenusers hit the "kill the running job" button, SIGTERM has to make its waydown to the low level program, the low level program does some work toproperly terminate the process (could be a few seconds), and then itgoes back up the chain. At each level, a little processing needs tohappen to close that level properly. Each level along the way worksbetter when its child process terminates in an orderly fashion. Thedelay between sending SIGTERM and hearing back from the child-mostprocess can be more than 200ms.

I hope this demonstrates that in some cases, extending the grace periodbeyond 200ms benefits the user.

Q3:
Still in the wait_and_exit subroutine, the grace period is "ANDed" with the
family_pids[0].
Why just the 0'th element? Why not the entire array?

You mean in sub Job::kill():

             # Wait up to 200 ms between TERMs - but only if any pids
are alive
             my $sleep = 1;
             for (my $sleepsum = 0; kill 0, $family_pids[0] and $sleepsum < 200;
                  $sleepsum += $sleep) {
                 $sleep = ::reap_usleep($sleep);
             }

'kill 0, pid' returns true if the process is still running.
$family_pids[0] is the immediate child (i.e. the parent of any
(grand)*children)).
There is no need to see if any (grand)*children are running: it is the
job of $family_pids[0] to kill those.


Ok, I understand now. Yes this makes sense. I agree.

The for loop runs up to 200 ms, but if the pid dies earlier, then the
loop exits.

But maybe this should be revised:

When a job times out (--timeout) we want to kill it. It is OK to give
it 200 - 1000 ms to clean up, so 'kill TERM', wait, 'kill TERM', wait,
'kill KILL'.
When GNU Parallel receives 2 TERMs, it should for all jobs 'kill
TERM', wait, 'kill TERM', wait, 'kill KILL'.
The wait should always be an upper limit: Do not wait a full second,
if the job finishes faster.

I am not sure whether GNU Parallel should also kill the
(grand*)children, and if so how that should be done to work well for
most cases. Maybe:

'kill TERM', wait, 'kill TERM', wait, 'kill KILL', 'kill KILL
@grandchildren_pid'

This way the parent is given a chance to cleanup, but if it did not
manage, then GNU Parallel does the cleaning. It would be good to have
testcases for this kind of scenario.


The new tests I wrote are very close to this. They are on github for now:
https://github.com/martinda/gnu-parallel/blob/sigterm-1/testsuite/tests-to-run/parallel-local-signals.sh

I should be able to write one for this scenario if needed.

Thank you very much for your explanations, it helps a lot.

Martin

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Revision of GNU Parallel's processing of SIGTERM, Ole Tange, 2015/04/12
- Re: Revision of GNU Parallel's processing of SIGTERM, Martin d'Anjou <=
  - Re: Revision of GNU Parallel's processing of SIGTERM, Ole Tange, 2015/04/13
    - Re: Revision of GNU Parallel's processing of SIGTERM, Martin d'Anjou, 2015/04/13
- Re: Revision of GNU Parallel's processing of SIGTERM, Martin d'Anjou, 2015/04/13
  - Re: Revision of GNU Parallel's processing of SIGTERM, Ole Tange, 2015/04/13
    - Re: Revision of GNU Parallel's processing of SIGTERM, Martin d'Anjou, 2015/04/14
    - Re: Revision of GNU Parallel's processing of SIGTERM, Martin d'Anjou, 2015/04/16

Prev by Date: Re: Revision of GNU Parallel's processing of SIGTERM
Next by Date: GNU Parallel processing of SIGTERM: proposal 2
Previous by thread: Re: Revision of GNU Parallel's processing of SIGTERM
Next by thread: Re: Revision of GNU Parallel's processing of SIGTERM
Index(es):
- Date
- Thread