Forums - Questions about global memory reads, using host pointer buffers, and private memory

5 posts / 0 new
Last post
Questions about global memory reads, using host pointer buffers, and private memory
Jonny D
Join Date: 4 Nov 15
Posts: 7
Posted: Fri, 2016-05-27 08:55
I am trying to determine the read/write speed between processing elements and global memory on an Adreno330 as I don't believe this information is available to developers (?). I'm launching a single work item that does 1,000,000 float reads in kernel A and 1,000,000 float write in kernel B. (Therefore 4MB each way).
 
Device:
Snapdragon 801 - Adreno 330 GPU
HTC One M8 Android 5.0.1
 
**HOST**
    
    // Create arrays on host (CPU/GPU unified memory)
    int size = 1000000;
    float *writeArray = new float[size];
    float *readArray = new float[size];
    for (int i = 0; i<size; ++i){
        readArray[i] = i;
        writeArray[i] = i;
    }
 
    // Initial value = 0.0
    LOGD("Before read : %f", *readArray);
    LOGD("Before write : %f", *writeArray);
 
    // Create device buffer;
    cl_mem readBuffer = clCreateBuffer(
            openCLObjects.context,
            CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
            size * sizeof(cl_float),
            readArray,
            &err );
    cl_mem writeBuffer = clCreateBuffer(
            openCLObjects.context,
            CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
            size * sizeof(cl_float),
            writeArray,
            &err );
    
    //Set kernel arguments
    size_t globalSize[3] = {1,1,1};
    err = clSetKernelArg(openCLObjects.readTest, 0, sizeof(cl_mem), &readBuffer);
    err = clSetKernelArg(openCLObjects.writeTest, 0, sizeof(cl_mem), &writeBuffer);
 
    // Launch kernels
    err = clEnqueueNDRangeKernel(openCLObjects.queue, openCLObjects.readTest, 3, NULL, globalSize, NULL, 0, NULL, NULL);
    clFinish(openCLObjects.queue);
    err = clEnqueueNDRangeKernel(openCLObjects.queue, openCLObjects.writeTest, 3, NULL, globalSize, NULL, 0, NULL, NULL);
    clFinish(openCLObjects.queue);
   
    // Expected result = 7.11
    clReleaseMemObject(readBuffer);
    clReleaseMemObject(writeBuffer);
    LOGD("After read: %f", *readArray); // After read: 0.0 (??)
    LOGD("After write: %f", *writeArray);
******************
 
**KERNELS**
    
    kernel void readTest(__global float* array)
    {
        float privateArray[1000000];
        for (int i = 0; i < 1000000; ++i)
        {
            privateArray[i] = array[i];
        }
    }
 
    kernel void writeTest(__global float* array)
    {
        for (int i = 0; i < 1000000; ++i){
            array[i] = 7.11;
        }
    }
******************
 
**Results via AdrenoProfiler**: 
readTest:
    Global loads: 0 bytes
    Global stores: 0 bytes
    Runtime: 0.010 ms
 
writeTest:
    Global loads: 0 bytes
    Global stores: 4000000 bytes
    Runtime: 65 ms
 
 
**My questions**:
1. Why doesn't readTest do any memory loads? If I change it to array[i] = array[i]+1 then it does 4m reads and 4m writes (120ms) which makes sense. If memory is loaded but never nothing is written back, does the compiler skip it?
 
2. Why does am I not reading the updated values of the arrays after the process completes? If I call enqueuMapBuffer just before printing the results, I see the correct values. I understand why this would be necessary for pinned memory but I thought the purpose of CL_MEM_USE_HOST_PTR was that the work items are modifying actual arrays allocated on the host.
 
3. To my understanding, if I were to declare a private variable within the kernel, it will be stored in private memory (registers?) There are no available specs and I have not been able to find a way to measure the amount of private memory available to a processing element. Any suggestions on how? I'm sure 4mb is much too large, so what is happening with the memory in the readTest kernel. Is privateArray just being stored on the global mem (unified DRAM?) Are private values stored on the local if they don't fit in registers, and global if they don't fit in local? (8kb local in my case.) I can't seem to find an thorough explanation for private memory.
 
Sorry for the lengthy post, I really appreciate any information anyone could provide.
 
  • Up0
  • Down0
Jonny D
Join Date: 4 Nov 15
Posts: 7
Posted: Fri, 2016-05-27 11:36

I have learnt the answer to question #2. After the kernel finishes, I must call enqueueMapBuffer / enqueueUnmapBuffer to ensure that that global cache is written back to the actual global memory. Will I have to do this in between every enqueuNDRange to ensure data consistency?

  • Up0
  • Down0
Carlos Dominguez Moderator
Join Date: 27 Jul 15
Location: San Diego
Posts: 110
Posted: Tue, 2016-05-31 08:39

Hi Jonny,

Thank you for posting your question and sorry for the delayed response:

1.       If the compiler detects that the data will never be written out, it will optimize away the load.

2.       As per the spec clEnqueueMapBuffer is necessary for the results of the GPU operation to become visible to the host CPU.

3.       On the Adreno GPUs some private memory is promoted to registers and some to local memory. The rest is spilled to global memory. In general registers and local memory are faster than global memory. The compiler controls the promotion of private variables. The max private memory per work item is typically capped at 4K.

       Looking beyond the posted questions, using a single work item to measure memory bandwidth is not a recommended strategy. To get the best performance out of the GPU for IO banwidth measurement, you should have large workgroups with multiple work items. A work group size of 128 or more is preferable. Adjacent  work items should read/write adjacent blocks of memory

You should also use multiple workgroups (at least 4)  so that all the functional units on the GPU  are activated.

  • Up0
  • Down0
Jonny D
Join Date: 4 Nov 15
Posts: 7
Posted: Tue, 2016-05-31 10:39

Carlos,

Thank you for the informative response.

1. Very good. Disabling optimizations showed the memory reads.

2. I see that clEnqueueMapBuffer is necessary for the data modification to be visible to the host. I just want to be clear on why. CL_MEM_USE_HOST_PTR will have the GPU to modify the actual data on the host with zero copy. Is the enqueueMapBuffer call necessary afterward to ensure that all information in the GPU global cache is returned to the global memory? 

3. Is there any way for me to determine the size of register memory? Or are there any technical specs I can request beyond what is available online?

Just for clarification, I understand the impractical use of a single work item. I was asked to find the bandwith between global memory and a single processing element, rather than the total bandwith of the GPU. This is why I timed the execution of a single work item. With proper workgroup design, latency hiding should improve performance significantly. This was to provide a base metric for my professor. 

Thanks again.

  • Up0
  • Down0
Carlos Dominguez Moderator
Join Date: 27 Jul 15
Location: San Diego
Posts: 110
Posted: Tue, 2016-06-07 08:36

Hi Jonny,

2    The recommended practice is to use CL_MEM_ALLOC_HOST_PTR and map the buffer or image for host access. This allows the driver to do a zero copy operation for the map. CL_MEM_USE_HOST_PTR actually breaks zero copy because it forces the driver to copy data back to the original host pointer that was specified during buffer/image creation. The map call is essential even in zero copy operations so that caches can be appropriately flushed and invalidated.

3. In general the compiler’s handling of variables is opaque to end users. That said, by using the Snapdragon Profiler to capture your app and inspecting it you should get the instruction storage size which as far as I know is the total size of register used + stack.

Snapdragon Profiler: https://developer.qualcomm.com/software/snapdragon-profiler

Capture OpenCL applications with Snapdragon Profiler video: https://www.youtube.com/watch?v=mevcqGF-jhc

  • Up0
  • Down0
or Register

Opinions expressed in the content posted here are the personal opinions of the original authors, and do not necessarily reflect those of Qualcomm Incorporated or its subsidiaries (“Qualcomm”). The content is provided for informational purposes only and is not meant to be an endorsement or representation by Qualcomm or any other party. This site may also provide links or references to non-Qualcomm sites and resources. Qualcomm makes no representations, warranties, or other commitments whatsoever about any non-Qualcomm sites or third-party resources that may be referenced, accessible from, or linked to this site.