Waiting for garbage collection can kill parallelism?
I am reposting my mail from Haskell-cafe, since I got no replies in over a week and I think it is an interesting case. I was reading “Parallel Performance Tuning for Haskell” by Jones, Marlow and Singh and wanted to replicate the results for their first case study. The code goes like this:
module Main where
import Control.Parallel
main :: IO ()
= print . parSumFibEuler 38 $ 5300
main
fib :: Int -> Int
0 = 0
fib 1 = 1
fib = fib (n - 1) + fib (n - 2)
fib n
mkList :: Int -> \[Int\]
= \[1..n-1\]
mkList n
relprime :: Int -> Int -> Bool
= gcd x y == 1
relprime x y
euler :: Int -> Int
= length (filter (relprime n) (mkList n))
euler n
sumEuler :: Int -> Int
= sum . (map euler) . mkList
sumEuler
sumFibEuler :: Int -> Int -> Int
= fib a + sumEuler b
sumFibEuler a b
parSumFibEuler :: Int -> Int -> Int
= f \`par\` (e \`pseq\` (e + f))
parSumFibEuler a b where f = fib a
= sumEuler b e
In the paper authors show that this code performs computation of fib
ans
sumEuler
in parallel and that good speed-up is achieved:
To make the parallelism more robust, we need to be explicit about the evaluation order we intend. The way to do this is to use
pseq
in combination withpar
, the idea being to ensure that the main thread works onsumEuler
while the sparked thread works onfib
. (…) This version does not make any assumptions about the evaluation order of+
, but relies only on the evaluation order ofpseq
, which is guaranteed to be stable.
These results were obtained on older GHC version ((Paper does not mention which version exactly. I believe it was 6.10, since “Runtime Support for Multicore Haskell” by the same authors released at the same time uses GHC 6.10)). However, compiling program with:
$ ghc -O2 -Wall -threaded -rtsopts -fforce-recomp -eventlog parallel.hs
and then running on GHC 7.4.1 using:
$ ./parallel +RTS -qa -g1 -s -ls -N2
yields a completely different result. These are statistics for a parallel run on two cores:
SPARKS: 1 (1 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.52s ( 2.51s elapsed)
GC time 0.03s ( 0.05s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.55s ( 2.56s elapsed)
Running the same code on one core results in:
SPARKS: 1 (0 converted, 0 overflowed, 0 dud, 1 GC'd, 0 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.51s ( 2.53s elapsed)
GC time 0.03s ( 0.05s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.55s ( 2.58s elapsed)
Looking and MUT
(mutator time) it looks that there is no speed-up at all.
Investigating eventlog using ThreadScope sheds some light on execution of a
program:
Both threads start computation, but HEC 1 soon blocks and only resumes when HEC 0 finishes computation. Zooming in it looks that HEC 1 stops because it requests garbage collection, but HEC 0 does not respond to that request so GC begins only when HEC 0 is done with its computation:
Why does this happen? I am no expert on GHC’s garbage collection, my only knowledge of that comes from section 6 of “Runtime Support for Multicore Haskell”. If I understood correctly this should not happen - it certainly didn’t happen when the paper was published. Do we have a regression or am I misunderstanding something?