I built this after watching 7/8 CPU cores idle during a Monte Carlo sim. multiprocessing added 189ms serialization overhead to a 9ms computation.
ironkernel lets you write element-wise expressions with a Python decorator, compiles them to a Rust expression tree at definition time, and executes via rayon on all cores. ~2k lines of Rust, ~500 lines of Python.
The win is expression fusion: NumPy evaluates `where(x > 0, sqrt(abs(x)) + sin(x), 0)` as 5 passes with 4 temporaries. ironkernel fuses into 1 pass, zero temporaries, and skips dead branches (no NaN from sqrt of negatives). 2.25x NumPy on compound expressions at 10M elements. For BLAS ops like SAXPY, NumPy is faster — ironkernel doesn't call BLAS.
Early stage: f64 only, 1-D only, expression subset only (intentional — parallel safety guarantee). Numba warm is 3.2x faster (LLVM JIT vs interpreter).
The expression fusion win is huge for cache locality. Since you're using Rayon for the multicore side, I'm curious if the generated Rust expression tree is 'flat' enough for LLVM to trigger auto-vectorization (SIMD) on the individual cores or if the tree traversal adds enough branching to break that?
As a specific example: The generated diagram showing the expression tree under "build in python" is simply wrong. It doesn't correspond to the expression x * 2 + 1, which should have only 1 child node on the right. The "GIL Released - Released" is just confusing. The dataflow omits the fact that the results end up back in python - there should be a return arrow. etc., etc.
If you use diagrams like this, at least ensure they are accurately conveying the right understanding.
And in general, listen to the person I'm responding to -- be really deliberate with your graphics or omit. Most AI-generated diagrams are crap.
Do you have benchmarks? Naively I would compare this to Numba? But maybe I am way off the mark here
I built this after watching 7/8 CPU cores idle during a Monte Carlo sim. multiprocessing added 189ms serialization overhead to a 9ms computation.
ironkernel lets you write element-wise expressions with a Python decorator, compiles them to a Rust expression tree at definition time, and executes via rayon on all cores. ~2k lines of Rust, ~500 lines of Python.
The win is expression fusion: NumPy evaluates `where(x > 0, sqrt(abs(x)) + sin(x), 0)` as 5 passes with 4 temporaries. ironkernel fuses into 1 pass, zero temporaries, and skips dead branches (no NaN from sqrt of negatives). 2.25x NumPy on compound expressions at 10M elements. For BLAS ops like SAXPY, NumPy is faster — ironkernel doesn't call BLAS.
Early stage: f64 only, 1-D only, expression subset only (intentional — parallel safety guarantee). Numba warm is 3.2x faster (LLVM JIT vs interpreter).
The expression fusion win is huge for cache locality. Since you're using Rayon for the multicore side, I'm curious if the generated Rust expression tree is 'flat' enough for LLVM to trigger auto-vectorization (SIMD) on the individual cores or if the tree traversal adds enough branching to break that?
For the love of god, don't use these ai generated infographics/diagrams.
If that's your bar for quality, I'll think less of your code. I can't help it.
Also your saxpy example seems to be daxpy. s and d are short for single or double precision.
As a specific example: The generated diagram showing the expression tree under "build in python" is simply wrong. It doesn't correspond to the expression x * 2 + 1, which should have only 1 child node on the right. The "GIL Released - Released" is just confusing. The dataflow omits the fact that the results end up back in python - there should be a return arrow. etc., etc.
If you use diagrams like this, at least ensure they are accurately conveying the right understanding.
And in general, listen to the person I'm responding to -- be really deliberate with your graphics or omit. Most AI-generated diagrams are crap.