Here is a very peculiar piece of C++ code. For some strange reason sorting the data miraculously causes the code to run 3 times faster.
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! Con esto el siguiente bucle se ejecuta más rápido.
std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << std::endl;
std::cout << "sum = " << sum << std::endl;
}
- Without
std::sort(data, data + arraySize);
, the code runs in 12.41 seconds. - With the data sorted it runs in 3.82 seconds.
At first I thought it might be a language or compiler anomaly. So I tried in Java :
import java.util.Arrays;
import java.util.Random;
public class Main
{
public static void main(String[] args)
{
// Generate data
int arraySize = 32768;
int data[] = new int[arraySize];
Random rnd = new Random(0);
for (int c = 0; c < arraySize; ++c)
data[c] = rnd.nextInt() % 256;
// !!! Con esto el siguiente bucle se ejecuta más rápido.
Arrays.sort(data);
// Test
long start = System.nanoTime();
long sum = 0;
for (int i = 0; i < 100000; ++i)
{
// Primary loop
for (int c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
System.out.println((System.nanoTime() - start) / 1000000000.0);
System.out.println("sum = " + sum);
}
}
Which gives similar but less extreme results. 10.7 and 6.2 seconds.
At first I thought it might have to do with sorting the data getting it into the cache, and then I realized that the data had just been generated with what should already be in the cache before sorting.
- What's going on?
- Why is an ordered array faster than an unordered one?
- The code is calculating the sum of independent terms. The order should not influence, in any case, all the elements are traversed and the same ones are always added, giving the same result, whether they are ordered or not.
This is a question translated from the original in English and adapted to the results it gives on my computer: Why is processing a sorted array faster than an unsorted array? by GManNickG
You are a victim of bugs in the branch predictor .
What is jump prediction?
Consider a railroad fork:
Image of Mechanism, courtesy of Wikimedia Commons. Used under a CC-By-SA 3.0 license .
Suppose we are in the 19th century - long before remote or radio communications.
You are the operator of a fork and you hear a train approaching. You have no idea what path it is supposed to follow. So you stop the train to ask the pilot for the direction it is going. And you put the needles in the proper direction.
Trains are heavy and have a lot of inertia, so it takes them a long time to stop.
Is there a better way to do it? You could guess the direction the train is headed!
If you hit every time , the train will never have to stop.
If you fail often , the train will spend a lot of time stopping, going back and restarting.
Consider an if statement At the processor level it is a conditional jump statement.
You are the processor and you see the conditional jump. You have no idea if it will jump or not. To do? Stop execution and wait for the previous instructions to finish. Then you continue on the correct path.
Modern processors are complicated and highly segmented, taking a long time to _start execution_ and _stop execution_.
Is there a better way? Guess if the jump will be made or not!
If you hit always , the execution is never interrupted.
If you crash often , you spend a lot of time stopping, undoing, and restarting.
This is branch prediction. I admit that it is not the best possible analogy since the pilot of the train could indicate the direction with a pennant. But in computers, the processor doesn't know whether or not the jump will be executed until the last moment.
So what prediction strategy to use to minimize the number of times the train must turn back and go the other way? Looking at a historical! If the train goes left %99 of the time, then you predict left. If it toggles, then you toggle your predictions. If it goes in one direction one out of 3 times, you make the same predictions...
In other words, you try to identify a pattern and follow it. This is more or less how jump predictors work.
Most programs have conditional jumps that behave well. So modern branch predictors have >%90 hits , but when faced with unpredictable conditional branches with no recognizable patterns, branch predictors are virtually useless.
To deepen the topic: "Jump Predictor" Wikipedia article .
The above gives us a clue as to where the problem is, in the statement
if
:Note that the data is evenly distributed between 0 and 255. When the data is sorted, approximately the first half of the iterations will not go into the statement
if
. After this, they will always enter the statementif
.This is very good for the branch predictor since the same type of branch is always made many times in a row.
Even a simple saturation counter will correctly predict jumps except for a few iterations after the change.
Quick view:
However, when the data is completely random, the jump predictor is useless because it cannot predict random data.
There will probably be about 50% of failed predictions. (which is no better than randomly predicting)
What can be done?
If the compiler is not able to optimize conditional branching in a conditional assignment, then there are some tricks if you are willing to sacrifice code clarity for performance.
Replaces:
By:
This removes the conditional jump and replaces it with some bitwise operations.
(Note that this trick is not strictly equivalent to the
if
original statement, but in this case, it is valid for all values ofdata[]
.)Benchmarks: Core i7 920 @ 3.5 GHz
C++ - Visual Studio 2010 - x64 Release
Java - Netbeans 7.1.1 JDK 7 - x64
Observations:
With conditional jump: There is a big difference between execution with ordered and unordered data.
With the catch: There is no difference between ordered and unordered data.
As a general rule of thumb, data-dependent conditional jumps should be avoided in critical loops (like the one in this example).
Update :
GCC 4.6.1 with
-O3
or-ftree-vectorize
on a x64 is capable of generating a conditional mapping. So there is no difference between sorted and unsorted data - both are fast.VC++ 2010 is unable to generate conditional assignments for this conditional jump even with
/Ox
.Intel Compiler 11 does something miraculous. Swaps the two loops , thereby taking the conditional jump to the outer loop. Not only is it immune to prediction failures, it's twice as fast as what VC++ and GCC can generate. In other words, ICC has taken advantage of the test loop to beat the performance test...
If you give the Intel Compiler the code without jumping, it directly vectorizes it... and it's exactly as fast as conditional jumping (with loop swapping)
This shows that even in modern and mature compilers there can be wild differences in their ability to optimize code...
This answer is a translation of the English original by Mystical