independent searches on GPU — how to synchronize its finish?-Collection of common programming errors

Assume I have some algorithm generateRandomNumbersAndTestThem() which returns true with probability p and false with probability 1-p. Typically p is very small, e.g. p=0.000001.

I’m trying to build a program in JOCL that estimates p as follows: generateRandomNumbersAndTestThem() is executed in parallel on all available shader cores (preferrably of multiple GPUs), until at least 100 trues are found. Then the estimate for p is 100/n, where n is the total number of times that generateRandomNumbersAndTestThem() was executed.

For p = 0.0000001, this means roughly 10^9 independent attempts, which should make it obvious why I’m looking to do this on GPUs. But I’m struggling a bit how to implement the stop condition properly. My idea was to have something along these lines as the kernel:

__kernel void sampleKernel(all_the_input, __global unsigned long *totAttempts) {
    int gid = get_global_id(0);
    //here code that localizes all_the_input for faster access
    while (lessThan100truesFound) {
        totAttempts[gid]++;
        if (generateRandomNumbersAndTestThem()) 
            reportTrue();
    }
}

How should I implement this without severe performance loss, given that

  • triggering of the “if” will be a very rare event and so it is not a problem if all threads have to wait while reportTrue() is executed
  • lessThan100truesFound has to be modified only once (from true to false) when reportTrue() is called for the 100th time (so I don’t even know if a boolean is the right way)
  • the plan is to buy brand-new GPU hardware for this, so you can assume a recent GPU, e.g. multiple ATI Radeon HD7970s. But it would be nice if I could test it on my current HD5450.

I assume that something can be done similar to Java’s “synchronized” modifier, but I fail to find the exact way to do it. What is the “right” way to do this, i.e. any way that works without severe performance loss?