independent searches on GPU — how to synchronize its finish?-Collection of common programming errors

12 years ago

admin

2 minutes

Assume I have some algorithm generateRandomNumbersAndTestThem() which returns true with probability p and false with probability 1-p. Typically p is very small, e.g. p=0.000001.

I’m trying to build a program in JOCL that estimates p as follows: generateRandomNumbersAndTestThem() is executed in parallel on all available shader cores (preferrably of multiple GPUs), until at least 100 trues are found. Then the estimate for p is 100/n, where n is the total number of times that generateRandomNumbersAndTestThem() was executed.

For p = 0.0000001, this means roughly 10^9 independent attempts, which should make it obvious why I’m looking to do this on GPUs. But I’m struggling a bit how to implement the stop condition properly. My idea was to have something along these lines as the kernel:

__kernel void sampleKernel(all_the_input, __global unsigned long *totAttempts) {
    int gid = get_global_id(0);
    //here code that localizes all_the_input for faster access
    while (lessThan100truesFound) {
        totAttempts[gid]++;
        if (generateRandomNumbersAndTestThem()) 
            reportTrue();
    }
}

How should I implement this without severe performance loss, given that

triggering of the “if” will be a very rare event and so it is not a problem if all threads have to wait while reportTrue() is executed
lessThan100truesFound has to be modified only once (from true to false) when reportTrue() is called for the 100th time (so I don’t even know if a boolean is the right way)
the plan is to buy brand-new GPU hardware for this, so you can assume a recent GPU, e.g. multiple ATI Radeon HD7970s. But it would be nice if I could test it on my current HD5450.

I assume that something can be done similar to Java’s “synchronized” modifier, but I fail to find the exact way to do it. What is the “right” way to do this, i.e. any way that works without severe performance loss?