This sumdiff_n is much slower on Haswell (2.6c/l) than on Skylake (2c/l) but it still provides a ~3% speed up for a 1M limb FFT compared to having no sumdiff_n at all.