problem about gpu-programming-Collection of common programming errors
vinodhrajagopal
cuda nvidia gpu-programming
I have a very simple CUDA program. The program when compiled with -arch=sm_11 option, works correctly as expected. However, when compiled with -arch=sm_12, the results are unexpected. Here is the kernel code : __global__ void dev_test(int *test) { *test = 100; }I invoke the kernel code as below :int *dev_int, val; val = 0; cudaMalloc((void **)&dev_int, sizeof(int)); cudaMemset((void *)dev_int, 0, sizeof(int)); cudaMemcpy(dev_int, &val, sizeof(int), cudaMemcpyHostToDevice); dev_test <&
einpoklum
cuda timeout gpgpu gpu-programming
I’ve noticed that CUDA applications tend to have a rough maximum run-time of 5-15 seconds before they will fail and exit out. I realize it’s ideal to not have CUDA application run that long but assuming that it is the correct choice to use CUDA and due to the amount of sequential work per thread it must run that long, is there any way to extend this amount of time or to get around it?
Dyps
c++ opencv cuda gpu gpu-programming
I want to speed up my OpenCV based software for real-time operation using the OpenCV’s GPU support library. My computer does not have an in-built GPU supported by OpenCV, so here goes my questions:Does anybody know if OpenCV will work with a GPU on the PCI card kinda thing?EDIT: Will OpenCV will work with an NVIDIA GPU I have added through a PCI slot? If yes, Can you please recommend a good CUDA supported NVIDIA GPU I can use?! Lastly, would you advice I learn CUDA to make my software truly par
Sean Lynch
c++ thrust gpu-programming
With a just a cursory understanding of these libraries, they look to be very similar. I know that VexCL and Boost.Compute use OpenCl as a backend (although the v1.0 release VexCL also supports CUDA as a backend) and Thrust uses CUDA. Aside from the different backends, what’s the difference between these. Specifically, what problem space do they address and why would I want to use one over the other.Also, on the Thrust FAQ it is stated thatThe primary barrier to OpenCL support is the lack of an O
TallGuy
c# gpu gpu-programming
I have no knowledge of GPU programming concepts and APIs. I have a few questions:Is it possible to write a piece of managed C# code and compile/translate it to some kind of module, which can be executed on the GPU? Or am I doomed to have two implementations, one for managed on the CPU and one for the GPU (I understand that there will be restrictions on what can be executed on the GPU)? Does there exist a decent and mature API to program independently against various GPU hardware vendors (i.e. a
Arkapravo
cuda gpu gpgpu gpu-programming
I am a newbie to GPGPU and GPU programming. I have a laptop with NVIDIA GeForce GT 640 card. I am faced with 2 dilemma, suggestions are most welcome.If I go for CUDA — Ubuntu or Windows Clearly CUDA is more suitable to windows while it can be a severe issue to install on Ubuntu. I have seen some blogposts which claim to have installed CUDA 5 on Ubuntu 11.10 and Ubuntu 12.04. However, I have not been able to get them to work. Also, standard CUDA textbooks prefer to work in the windows domain and
Pablo
cuda cluster-computing gpgpu gpu-programming hpc
I’m working on a cluster with a lot of nodes, and each node has two gpus. In the cluster, I can’t launch “nvidia-smi” to check which device is busy. My code selects the best device (with cudaChooseDevice) in terms of capability, but when the cluster assign me the same node for two different jobs, then I have two tasks running on the same gpu.My question is: There is a way to check at runtime if the device is busy or not?Thanks
Breakthrough
c++ unix ubuntu cuda gpu-programming
I’m trying to get my Cuda SDK samples running, but I get the following error:./bandwidthTest: error while loading shared libraries:libcudart.so.4: cannot open shared object file:No such file or directoryWhy can I compile the example successfully, but not run it? Is there a way to specify the path to the CUDA runtime library manually?
Nike
opencl gpu gpu-programming
I was trying to understand how exactly CL_MEM_USE_HOST_PTR and CL_MEM_COPY_HOST_PTR work. Basically when using CL_MEM_USE_HOST_PTR, say in creating a 2D image, this will copy nothing to the device, instead the GPU will refer the mapped memory(clEnqueueMapBuffer maps it) on the host, do the processing and we can write the results to some other location.On the other hand if I use the CL_MEM_COPY_HOST_PTR, it will create a copy of the data pointed to by host ptr on the device(I guess it will creat
einpoklum
cuda shared-memory gpu-programming
Suppose that we have an array int * dataeach thread will access one element of this array. Since this array will be shared among all threads it will be saved inside the global memory.Let’s create a test kernel:__global__ void test(int *data, int a, int b, int c){ … }I know for sure that the data array will be in global memory because I allocated memory for this array using cudaMalloc. Now as for the other variables, I’ve seen some examples that pass an integer without allocating memory, immedi
ATG
c++ cuda nvidia gpu-programming
I have tried to setup my visual studio environment for programming with cuda,but stil I m getting the errors as cudaMemcpy is unable to resolve..If any of you plz help me in setting up the environment.I am coding in c++.Thanks in advance.
Framester
gpu opencl gpu-programming
our workgroup is slowly trying a little bit of OpenCl in a side project. So far ‘everybody’ is working on NVIDIA Quadro FX 580. Now we are planning to buy new computers for new colleages and instead of the FX 580 we could buy ATI FirePro V4800 instead, which costs only 15Eur more and give us 1Gig instead of 512Gig of Ram which will benificial for our data intensive tasks.So, how much trouble is it to develop OpenCl code at the same time on Nvidia and ATI?I read the following SO question, Running
Bo Persson
c++ gpgpu gpu-programming cuda
I am trying to implement quickHull algorithm (for convex hull) parallely in CUDA. It works correctly for input_size <= 1 million. When I try 10 million points, the program crashes. My graphic card size is 1982 MB and all my data structures in the algorithm collectively require not more than 600 MB for this input size, which is less than 50 % of the available space.By commenting out lines of my kernels, I found out that the crash occurs when I try to access array element and the index of the
user1393349
cuda gpgpu gpu-programming
I am trying to implement Sauvola Binarization in cuda.For this I have read the image in a 2d array in host and allocating memory for 2D array in device using pitch.After allocating the memory I am trying to copy the host 2D array to Device 2d Array using cudaMemcpy2D,it compiles fine but it crashes here on runtime.I am unable to understand where am I missing,Kindly suggest something.The code which I have written is as follows:#include “BinMain.h” #include “Binarization.h” #include <stdlib.h&g
user997836
opencl gpu gpgpu gpu-programming
What other OpenCL functions should be called when enqueueNDRangeKernel is called repeatedly?I have not been able to find a tutorial that shows the use of enqueueNDRangeKernel in this fashion and my coding attempts have unfortunately resulted in an unhandled exception error. A similar question has been asked before but the responses don’t seem to apply to my situation.I currently have a loop in which I call the OpenCL functions in the following sequence: setArg enqueueNDRangeKernel enqueueMapBuff
Asterisk
visual-studio-2010 cuda windows-7-x64 gpu-programming
I am trying to get started with CUDA programming on Windows using Visual Studio 2010 Express on a 64 bit Windows 7. It took me a while setting up the environment, and I just wrote my first program, helloWorld.cu :)Currently I am working with the following program:#include <stdio.h>__global__ void add(int a, int b, int *c){*c = a + b; }int main(void){int c;int *dev_c;HANDLE_ERROR( cudaMalloc( (void**)&dev_c, sizeof(int) ) );add<<<1,1>>>(2, 7, dev_c);HANDLE_ERROR( cuda
user1111929
multithreading synchronization opencl gpgpu gpu-programming
Assume I have some algorithm generateRandomNumbersAndTestThem() which returns true with probability p and false with probability 1-p. Typically p is very small, e.g. p=0.000001.I’m trying to build a program in JOCL that estimates p as follows: generateRandomNumbersAndTestThem() is executed in parallel on all available shader cores (preferrably of multiple GPUs), until at least 100 trues are found. Then the estimate for p is 100/n, where n is the total number of times that generateRandomNumbersAn
Bernardo
performance cuda bandwidth gpu-programming
Since i didnt got a response from the CUDA forum, ill try it here:After doing a few programs in CUDA ive now started to obtain their effective bandwidth. However i have some strange results, for example in the following code, where i can sum all the elements in a vector(regardless of dimension), the bandwidth with the Unroll Code and the “normal” code seems to have the same median result(around 3000 Gb/s) I dont know if im doing something wrong(AFAIK the program works fine) but from what ive rea
Kun Huang
gpgpu gpu-programming cuda
I want to block some blocks until one variable is set to a particular value. So I write this code to test if a simple do-while loop will work.__device__ int tag = 0; __global__ void kernel() {if ( threadIdx.x == 0 ) {volatile int v;do {v = tag;}while ( v == 0 );}__syncthreads();return ; }However, it doesn’t work(No dead loop occurs, very strange).I want to ask if any other method is able to block some blocks until some conditions satisfied or if some changes on the code will work.
Giovanni Azua
cuda gpu-programming cublas
I’m playing with the matrixMulCUBLAS sample code and tried changing the default matrix sizes to something slightly more fun rows=5k x cols=2.5k and then the example fails with the error Failed to synchronize on the stop event (error code unknown error)! at line #377 when all the computation is done and it is apparently cleaning up cublas. What does this mean? and how can be fixed?I’ve got cuda 5.0 installed with an EVGA FTW nVidia GeForce GTX 670 with 2GB memory. The driver version is 314.22 lat
Web site is in building