[phobos] Ndslice speed

Discussion:

Matthias Redies via phobos

2016-08-27 19:43:03 UTC

Hello,

I've come across the library experimental.ndslice, which is supposed to
mimic NumPy. In order to test it I wrote a very crude matrix multiplication:

http://pastebin.com/Ew4u2iVz

and for comparison I also implemented it in Fortran90:

http://pastebin.com/6afnVyZF

Then I used linux's "time" command to time them each:

ifort test.f90 && time ./a.out
600

real 0m0.154s
user 0m0.148s
sys 0m0.004s

dmd test.d && time ./test
1.16681e+08
600 600

real 0m6.770s
user 0m6.772s
sys 0m0.004s

I understand that dmd is not optimized for speed, but in the end both do
basically the same thing. Both implement 2D array and both array types
include the size of the array (unlike C). Given that both are compiled
languages the difference seems to be unreasonably large.

If I turn on boundschecking for Fortran I get:

ifort -check all test.f90 && time ./a.out
600

real 0m6.049s
user 0m6.044s
sys 0m0.004s

which is roughly the speed difference I'd expect, but if I use the
-boundscheck=off option for dmd that doesn't help. Am I using ndslice
correctly? Why is the speed difference so large? How do I speed it up?

Kind regards

Matthias

Daniel Murphy via phobos

2016-08-31 07:18:59 UTC

Permalink

It may help to turn dmd's optimizer and inliner on - "dmd -inline
-release -O -boundscheck=off".

On Sun, Aug 28, 2016 at 5:43 AM, Matthias Redies via phobos

Post by Matthias Redies via phobos
Hello,
I've come across the library experimental.ndslice, which is supposed to
http://pastebin.com/Ew4u2iVz
http://pastebin.com/6afnVyZF
ifort test.f90 && time ./a.out
600
real 0m0.154s
user 0m0.148s
sys 0m0.004s
dmd test.d && time ./test
1.16681e+08
600 600
real 0m6.770s
user 0m6.772s
sys 0m0.004s
I understand that dmd is not optimized for speed, but in the end both do
basically the same thing. Both implement 2D array and both array types
include the size of the array (unlike C). Given that both are compiled
languages the difference seems to be unreasonably large.
ifort -check all test.f90 && time ./a.out
600
real 0m6.049s
user 0m6.044s
sys 0m0.004s
which is roughly the speed difference I'd expect, but if I use the
-boundscheck=off option for dmd that doesn't help. Am I using ndslice
correctly? Why is the speed difference so large? How do I speed it up?
Kind regards
Matthias
_______________________________________________
phobos mailing list
http://lists.puremagic.com/mailman/listinfo/phobos

Martin Nowak via phobos

2016-12-29 18:13:34 UTC

Permalink

Post by Matthias Redies via phobos
I understand that dmd is not optimized for speed, but in the end both
do basically the same thing. Both implement 2D array and both array
types include the size of the array (unlike C). Given that both are
compiled languages the difference seems to be unreasonably large.

The difference is likely so huge, because one is using vectorized ops
(SSE), which dmd doesn't do.
Use an optimizing compiler that support auto-vectorization (ldc or gdc)
and get into touch with the ndslice authors.
You should be able to get very similar numbers.

-Martin

Martin Nowak via phobos

2016-12-30 01:34:47 UTC

Permalink

Post by Matthias Redies via phobos
Hello,
I've come across the library experimental.ndslice, which is supposed
to mimic NumPy. In order to test it I wrote a very crude matrix

Posting a reply from Ilya here:

Hi Matthias,

It is incorrect to compare the same code for ndslice and fortran because:

1. current ndslice is numpy like vectors (matrixes are always has both
string and raw strides).
2. m[i, j] can not be vectrized for non-strided vectors too because D
language constraint: D has not macros engine; operator overloading for
[i, j] destruct vectorisation for LDC and GDC.

You can achieve the same speed as fortran if you will use
mir.ndslice.algorithm [1]. It is available at [3] (with mir.ndslice).
The blog post is about mir.ndslice.algorithm can be found at [2]. An LDC
compiler should be used (DMD is supported but it is too slow).

We are working on new version of ndslice, which will include classic
BLAS-like matrixes, and will simplify mir.ndslice.algorithm logic [4]
(it is still can not be used, will be released during one month). With
new ndslice m[i, j] will be still slow, however indexing as m[i][j] will
be fast as fortran.

In general forward access (front/popFront) is more user-friendly for
vectorisation then random access (indexing like [i, j]).

Please use mir.ndslice.algorithm for now.

Best regards,
Ilya

[1] http://docs.mir.dlang.io/latest/mir_ndslice_algorithm.html
[2]
http://blog.mir.dlang.io/ndslice/algorithm/optimization/2016/12/12/writing-efficient-numerical-code.html
[3] https://github.com/libmir/mir
[4] https://github.com/libmir/mir-algorithm