[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [lmi] Using auto-vectorization
From: |
Vadim Zeitlin |
Subject: |
Re: [lmi] Using auto-vectorization |
Date: |
Mon, 23 Jan 2017 22:26:43 +0100 |
On Sat, 21 Jan 2017 11:21:12 +0000 Greg Chicares <address@hidden> wrote:
GC> On 2017-01-21 01:20, Vadim Zeitlin wrote:
GC> > On Fri, 20 Jan 2017 23:47:14 +0000 Greg Chicares <address@hidden> wrote:
GC> >
GC> > GC> Would you like to propose a patch to 'expression_template_0_test.cpp'
GC>
GC> I should also mention 'vector_test.cpp', which implements a tiny ET class
GC> and compares its performance to std::valarray and hand-written C.
Thanks, I forgot about this one. Running it gives slightly different
results: with -O2 we get
Time (seconds) for array0 = array1 + array2 by various methods
length C et et/C va va/c
1 5.043e-08 4.850e-08 0.962 4.571e-08 0.906
10 5.372e-08 5.187e-08 0.966 5.142e-08 0.957
20 5.547e-08 5.448e-08 0.982 5.261e-08 0.949
50 8.757e-08 7.518e-08 0.859 7.305e-08 0.834
100 1.177e-07 1.058e-07 0.898 1.036e-07 0.880
1000 7.356e-07 7.067e-07 0.961 6.884e-07 0.936
10000 7.202e-06 7.016e-06 0.974 6.688e-06 0.929
100000 8.970e-05 8.790e-05 0.980 8.588e-05 0.957
i.e. both "et" and "va" are faster than C, somehow. And with -O3 (which
includes auto-vectorization):
Time (seconds) for array0 = array1 + array2 by various methods
length C et et/C va va/c
1 5.218e-08 5.057e-08 0.969 4.714e-08 0.903
10 5.515e-08 5.191e-08 0.941 4.702e-08 0.853
20 5.105e-08 5.043e-08 0.988 4.645e-08 0.910
50 6.275e-08 6.856e-08 1.093 5.653e-08 0.901
100 7.027e-08 7.475e-08 1.064 7.224e-08 1.028
1000 3.537e-07 3.505e-07 0.991 3.404e-07 0.962
10000 4.932e-06 4.824e-06 0.978 4.878e-06 0.989
100000 6.623e-05 6.552e-05 0.989 6.634e-05 1.002
The precision of the output doesn't reflect the actual confidence interval
at all, i.e. the numbers vary a lot between runs and you can only trust the
first digit after the point for the absolute timings. But it still seems
safe to say that:
- Auto-vectorization brings noticeable benefits starting from N=50.
- Both "et" and "va" are roughly as fast as manual C code.
Looking at the generated assembly seems to show, additionally, that:
- The generated code is not optimal because, if I understand it correctly,
compiler doesn't assume that the arrays are aligned on 32 byte boundary.
I thought using an explicit __attribute__((aligned(32))) would help with
this, but somehow it doesn't change the generated code at all and I don't
know why. I could only convince the compiler to generate code without
misalignment fixups by explicitly using __builtin_assume_aligned() which
is ugly:
---------------------------------- >8 --------------------------------------
diff --git a/vector_test.cpp b/vector_test.cpp
index cc0ddcd..44d22fd 100644
--- a/vector_test.cpp
+++ b/vector_test.cpp
@@ -287,9 +287,12 @@ void demo1()
void mete_c()
{
+ double* const w = (double*)__builtin_assume_aligned (g_w.begin(), 32);
+ double* const u = (double*)__builtin_assume_aligned (g_u.begin(), 32);
+ double* const v = (double*)__builtin_assume_aligned (g_v.begin(), 32);
for(int i = 0; i < g_array_length; ++i)
{
- g_w[i] = g_u[i] + g_v[i];
+ w[i] = u[i] + v[i];
}
}
---------------------------------- >8 --------------------------------------
and doesn't seem to change the results all that much.
- The code of mete_{c,et,va} is not exactly identical, for some reason,
but, again I don't understand why as it seems to do the same thing -- but
in different ways, i.e. using different instructions. What I do see,
however, is that the inner loop is exactly the same in all 3 cases and
looks like this:
.L:
movupd (%rbx,%rax), %xmm0
addl $1, %r8d
addpd (%r12,%rax), %xmm0
movups %xmm0, (%rdx,%rax)
addq $16, %rax
cmpl %r11d, %r8d
jb .L
i.e. it operates on 64 bits at once. Enabling the use of AVX (present in
all Intel CPUs since Sandy Bridge) replaces it with the following (don't
pay attention to the register renaming, I don't think it's significant):
.L:
vmovapd (%r8,%r9), %ymm0
addl $1, %r10d
vaddpd (%rdi,%r9), %ymm0, %ymm0
vmovapd %ymm0, (%rsi,%r9)
addq $32, %r9
cmpl %r10d, %edx
ja .L
which now operates on 128 bits at once and while it doesn't make the loop
twice as fast, it still results in noticeable gains starting from N as
small as 20 but is significantly slower for N > 10000:
Time (seconds) for array0 = array1 + array2 by various methods
length C et et/C va va/c
1 5.399e-08 5.068e-08 0.939 4.935e-08 0.914
10 4.972e-08 4.837e-08 0.973 5.022e-08 1.010
20 4.504e-08 4.430e-08 0.984 4.949e-08 1.099
50 4.780e-08 4.667e-08 0.976 5.324e-08 1.114
100 5.332e-08 5.256e-08 0.986 6.280e-08 1.178
1000 2.152e-07 2.076e-07 0.965 2.845e-07 1.322
10000 5.658e-06 5.525e-06 0.976 5.509e-06 0.974
100000 7.167e-05 6.636e-05 0.926 6.669e-05 0.930
I am very puzzled by the results of these benchmarks and I wonder if there
is no problem in the benchmarking methodology itself because I just don't
see how to explain this. I am going to try doing 2 things in order to clear
it up, unless you think it's a needless waste of time: first, I'd like to
change TimeAnAliquot() to also calculate the standard deviation of the
times and maybe use it to run the test more times in order to lower it. And
second, I'm going to rerun these tests on another machine, to check if the
results are reproducible.
GC> > GC> so that we can measure what you'd prefer
GC>
GC> I'm guessing that you'll prefer lambdas. If you like, you could just give
GC> me your implementation parallel to mete_c() in each of those unit tests,
GC> and I'll add it. Or would your preferred implementation simply replace
GC> the classic for-loops with ranged ones? and, if so, could you show me how
GC> you'd write mete_cxx11_typical() to parallel the other mete_*_typical(),
GC> if that's a better test of what you have in mind?
The tests results seem to invalidate what I had in mind, namely just
writing the loops manually, as expression template approaches have the same
performance as manual loops even with auto-vectorization. So now I'm not
really sure what, if anything, needs to be done.
FWIW I think lambdas _combined_ with a range library such as Boost.Range
or https://github.com/ericniebler/range-v3 could be a reasonable approach
but if we already have an expression template library, I don't see any good
reason not to use it.
GC> BTW, due to problem-domain considerations, typical lmi
GC> vectors are of size fifty or so, and probably always in [10, 100].
Thanks, this is by far the most important point in this discussion.
Knowing this now I wonder if it really matters what do we do here at all.
Unless we perform thousands of operations on arrays of this size, we could
be shelling out to a Python interpreter for performing these operations and
the effect on the performance would still be unnoticeable.
So, after spending quite some time on all this, I'm led to wondering what
are we exactly trying to achieve here? Sorry for only asking this question
now, but better late than never, I guess...
VZ
- Re: [lmi] Replacing boost with std C++11 [Was: Fix value_cast defect shown by the unit test], (continued)
- Re: [lmi] Replacing boost with std C++11, Greg Chicares, 2017/01/20
- Re: [lmi] Replacing boost with std C++11, Vadim Zeitlin, 2017/01/20
- Re: [lmi] Replacing boost with std C++11, Greg Chicares, 2017/01/20
- Re: [lmi] Replacing boost with std C++11, Vadim Zeitlin, 2017/01/20
- Re: [lmi] Replacing boost with std C++11, Greg Chicares, 2017/01/20
- Re: [lmi] Using auto-vectorization (was: Replacing boost with std C++11), Vadim Zeitlin, 2017/01/20
- Re: [lmi] Using auto-vectorization, Greg Chicares, 2017/01/21
- Re: [lmi] Using auto-vectorization,
Vadim Zeitlin <=
- Re: [lmi] Using auto-vectorization, Greg Chicares, 2017/01/23
- Re: [lmi] Using auto-vectorization, Vadim Zeitlin, 2017/01/23
- Re: [lmi] Using auto-vectorization, Greg Chicares, 2017/01/23
- Re: [lmi] Using auto-vectorization, Vadim Zeitlin, 2017/01/24
- Re: [lmi] Using auto-vectorization, Greg Chicares, 2017/01/25
- Re: [lmi] Replacing boost with std C++11, Greg Chicares, 2017/01/21
- Re: [lmi] Replacing boost with std C++11, Vadim Zeitlin, 2017/01/22
- Re: [lmi] Replacing boost with std C++11, Greg Chicares, 2017/01/23