[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Broken Pipe
From: |
Ole Tange |
Subject: |
Re: Broken Pipe |
Date: |
Wed, 26 Sep 2012 18:48:52 +0200 |
On Tue, Sep 25, 2012 at 3:47 AM, Joseph White <hybris246@gmail.com> wrote:
:
> cat url-list | parallel --eta --progress --joblog jobnew.log -j0 ./linkcheck
> {} >> errors.log
:
> Here is an example of what the url-list contains:
>
> http://www.hairforsale.com
> http://www.rdhjobs.com
> http://www.gdha.org
> http://www.hotdogsafari.com
>
> Would using the pipe command speed up the process significantly?
The --pipe completely changes the way GNU Parallel works. Just like
'xargs' and 'cat' are not the same and cannot be used to solve the
same problem. parallel (without --pipe) is similar to xargs, parallel
--pipe is similar to cat. But using parallel in --pipe mode to start
parallel without --pipe might be the solution, see below.
> Also would the -u option speed up the process?
-u will speed up parallel, but it comes at a price: The output can be
mixed up and thus is no good for further processing. Since you save
the output I will advice against this.
I assume that you have measured and that you actually see that GNU
Parallel is the limiting factor in your process (i.e. it is using 100%
CPU). So what is limiting you is GNU Parallel's ability to spawn jobs
quickly enough. What you can do is to split up the url-list into
chunks and run multiple GNU Parallels in parallel. And you can of
course use GNU Parallel to do just that:
cat urllist | parallel -j10 --pipe parallel -j0 ./linkcheck >> errors.log
This will read urllist in chunks of 1 MB and pass that to the second
parallel which will spawn linkcheck for each of the lines.
If -j0 normally spawns 500 jobs, then the above will spawn 5000 jobs.
You can adjust -j10, but be warned: using -j0 instead is likely to
kill your machine, as that will try to spawn 250000 jobs.
/Ole
- Broken Pipe, Joseph White, 2012/09/24
- Re: Broken Pipe,
Ole Tange <=