cuda-gdb: Meaning/occurence of “CUDA_EXCEPTION_9: Warp Hardware Stack Overflow”-Collection of common programming errors
A stack overflow on a Fermi GPU is no different to a stack overflow on any other device. Each thread gets a static stack and heap allocation from global memory at launch. If you exhaust the stack via excessive recursion, allocate more that the available heap memory, or try operating out of bounds on any variable stored in heap memory, a protection fault is generated, and you will get a stack overflow error reported. From your question, I would guess that you are exhausting the available per-thread heap space via device side malloc calls.
The CUDA runtime API includes functions for managing stack and heap memory cudaDeviceSetLimit and cudaDeviceGetLimit. With these you can check how much stack, heap and printf
FIFO each thread is being given by the runtime, and try increasing the heap and stack size to see if your problem goes away.