Exploring OpenCL for accelerate processes in Backend Side

April 28, 2023 - Neirth

I was interested in learning about the effective parallelism of certain heavy operations, and how these could be better leveraged in a fairly demanding environment, such as a data center. The result, therefore, was to learn more about OpenCL and how it could be exploited beyond servers to speed up processes.

Undestanding the problem behind

Mainly, the issue we are going to address lies in how we could speed up certain processes that might otherwise be quite costly on the wrong device.

We must understand that in the end in most servers, all computing power is being derived to the processor. This is usually shared by Hypervisor Type 1, different cores that must access in shared time with other cores of other operating systems, and then the services that make this can interact between the core and the real world, in other words, auxiliary services so that our application can run.

Translated into common language, we have the processor busy with a thousand tasks which we must know how to manage very well so that our application can quickly attend to our requests. If our service also has to perform queries to external computer services, in the end we are totally wasting the capacity of our machine.

This problem has been the subject of study in academia for years, where work has been ongoing on better algorithms to improve effective CPU utilization. Although a little more than a decade ago there was experimentation to introduce an additional player in this segment of computation. I am effectively talking about the GPGPU concept.

In the academic world there was a lot of interest in taking advantage of the potential of these devices that were being wasted, not without criticism of course.

It was realized that GPUs were very specialized devices in one type of computation, vector and matrix computation. This type of computation, largely specialized to be able to generate at an acceptable Frames per Second rate in terms of real time, could be exploited for different scientific applications, or technologies that could require a fairly high computational capacity.

As these devices were relieved of the responsibility of having to manage an operating system, or in the worst case a Type 1 Hypervisor running different operating systems, they became a very interesting option to be able to launch workflows waiting for an answer very soon.

How works OpenCL and how coordinates with the CPU

Before introducing OpenCL, it is worth mentioning that it can be used for GPUs, FPGAs, NPUs and even CPUs [4]. This is known as a hetereogeneous computing framework, and normally, although it is mostly exploited in servers, it is not formally linked to the server world. (Probably your mobile device has OpenCL drivers and you didn’t realize it).

OpenCL works with a queuing system allowing the CPU to effectively delegate the workload to the intensive processing unit.

This has a very clear advantage of being able to free up the CPU so that it is taking care of other tasks without having to worry about whether or not it will be processing our request full time. Bye bye CPU Context Changes (For now).

It also allows us to manage from our application the memory regions that the processing unit will have available. This is possible thanks to the fact that complete memory arrays can be transferred to it, or even exploit those of the host itself. The latter has innumerable advantages such as random access to information, where we can avoid having to copy large blocks of memory between transactions. This of course depends on what type of application we want to develop and if it is going to be appropriate or not.

Another point that makes OpenCL interesting is that its programming language is based on one already known by many. It is syntactically based on the C language (I would say that it is based on the C99 standard for some aspects that I have been seeing). Although it is based on C, it is important to take into account that we will not have access to libraries such as #import <strings.h> or #import <stdio.h>, or external libraries such as Boost’s, this is because since In the beginning we have access to the limited OpenCL functions that are defined within their standard.

It must be taken into account that the limitation described above is given because we are trying to make our code compatible with the largest amount of hardware that has certified drivers.

In the implementation of OpenCL drivers, there is a trend towards using the SPIRV binary format, which is also used by Vulkan shaders. This trend aims to simplify graphics drivers to focus on efficient Vulkan driver development. For example, Portable Computing Language (PoCL) also allows OpenCL to be used on devices that only have Vulkan drivers, such as the Raspberry Pi 4 [5]. Intel uses SPIRV as a binary format for their OpenCL drivers [6]. However, NVIDIA uses its own PTX format [3], which is incompatible between platforms, an important aspect to consider.

Implementing Matrix Transpose as a Hello World Example

In this example we are going to try to understand how we can speed up our calculations to transpose a matrix, in a paradigm other than parallelism, this would be developed through a succession of iterations that go through the entire matrix and copy the new result to a result matrix.

In OpenCL we are going to have to think about how we are going to divide the problem into smaller problems, and how we are going to be able to solve them in parallel. In this case, we are going to divide the matrix into rows and columns and we are going to assign each row and column to a work item. This will allow us to have a fairly simple solution to the problem.

__kernel void transpose(__global float *odata, __global float* idata, int width, int height)
{
    // Calculate the global index
    int index = get_global_id(0) * width + get_global_id(1);
    
    // Calculate the coordenates in the matrix
    int x = index / height;
    int y = index % height;
    
    // Calculate the index in the original matrix and the index in the transpose matrix
    int index_in = y * width + x;
    int index_out = x * height + y;
    
    // Copy the value from the original matrix to the transpose matrix
    odata[index_out] = idata[index_in];
}

OpenCL incorporates a concept of multi-universal worker instantiation. This allows us to assign different ID’s for each dimension in which a specific worker is located. This will be very useful to further speed up certain types of matrix operations.

Finally, if we come from programming in other programming languages that are similar to C, we will have noticed that the function is designated as __kernel and as a void return value, this is because OpenCL forces it to have it structured in this way, in the same way that it needs to know what parameters we expect to introduce in the program that we are going to call.

An OpenCL kernel is the sum of a program plus its respective arguments plus a queue definition so that requests to the device can be queued.

The result will be sent through a result matrix to the host program, this is a common practice in OpenCL programs.

For the Host Program, we will write a Rust program for catch the result from OpenCL program. In this case, we will use the ocl crate.

/// Transpose a matrix using OpenCL
///
/// # Arguments
///
/// * `width` - Width of the matrix
/// * `height` - Height of the matrix
/// * `matrix` - Matrix to transpose
///
/// # Returns
///
/// * `matrix_output` - Transpose matrix
///
fn transpose_matrix(width: usize, height: usize, matrix: &mut Vec<f32>) -> ocl::Result<Vec<f32>> {
    // Build the program into device driver
    let program = ProQue::builder().src(accel_src).dims(ocl::SpatialDims::Two(width, height)).build()?;

    // Create memory buffer between hardware accelerator and main ram
    let matrix_buff = program.buffer_builder::<f32>().len(matrix.len()).copy_host_slice(matrix).build()?;

    // Create memory buffer between hardware accelerator and main ram
    let result_buff = program.create_buffer::<f32>()?;

    // Prepare program with arguments to build kernel
    let kernel = program.kernel_builder("transpose")
                        .arg(&result_buff)
                        .arg(&matrix_buff)
                        .arg(width as i32)
                        .arg(height as i32)
                        .build()?;
    
    // Run the kernel inside the device and wait for the result.
    unsafe { kernel.enq()?; }

    // Prepare output matrix
    let mut matrix_output = vec![0.0f32; matrix.len()];

    // Transfer matrix into the main memory
    result_buff.read(&mut matrix_output).enq()?;

    // Return the traspose matrix
    Ok(matrix_output)
}

In this way, we have already generated a wrapper that abstracts the call to OpenCL in a method that will be a black box for whoever wants to use it.

Implementing Shortest Path Algorithm as a Complete Computation Example

In this case I was working on a small project where I would introduce operations to obtain the shortest path possible. For this I was inspired to introduce Dijkstra’s algorithm[2] for the shortest path, from that point I had to consider it so that it was parallelizable.

__kernel void initialize_algorithm_buffers(__global float *result, __global float *distance, __global int *visited, __global float *vertex, __global float *vertex_temp) {
    // Get the global id based on count of vertexs and assigned for thread
    int gid = get_global_id(0);

    // Initialize the buffers in parallel
    if (gid == 0) {
        visited[gid] = 1;
        result[gid] = 0;
    } else {
        visited[gid] = 0;
        result[gid] = FLT_MAX;
    }

    distance[gid] = 0;
    vertex[gid] = 0;
    vertex_temp[gid] = 0;
}

__kernel void shortest_path_algorithm(__global float *result, __global float *matrix, __global float *distance, __global int *visited, __global float *vertex_temp, int vertex_count) {
    // Get the global id based on count of vertexs and assigned for thread
    int gid = get_global_id(0);

    // Validate if the vertex is not visited
    if (visited[gid] != 1) {
        // Mark the vertex as visited
        visited[gid] = 1;

        // Get the start edge
        for(int edge = 0; edge < vertex_count; edge++) {
            // Get the edge from adjacent matrix
            float weight = matrix[gid * vertex_count + edge];

            // Validate if the edge is valid
            if (weight != 0.0f && weight != FLT_MAX) {
                // Get the distance
                float dist = result[edge] + weight;

                // Get the result
                if (distance[gid] == 0.0 || result[gid] > dist) {
                    distance[gid] = dist;
                    vertex_temp[gid] = edge;
                }
            }
        }
    }
}

__kernel void merge_sortest_path(__global float *result, __global float *distance, __global int *visited, __global float *vertex, __global float *vertex_temp) {
    // Get the global id based on count of vertexs and assigned for thread
    int gid = get_global_id(0);

    // Get the result
    if (result[gid] > distance[gid]) {
        result[gid] = distance[gid];
        vertex[gid] = vertex_temp[gid];
    }

    // Reset the visited flag
    if (gid != 0) {
        visited[gid] = 0;
    }
}

In this case I had to divide the algorithm into three different cores, where it is differentiated from the initial algorithm. The initialize_algorithm_buffers kernel is formalized so that we can take advantage of the device’s time to initialize memory.

The next step is shortest_path_algorithm, where we will do all the comparative logic to be able to develop the algorithm as it was established in its day.

Finally, we have the core of merge_sortest_path, where the changes between the temporary matrix and the result matrix will be evaluated to introduce them safely. This is necessary since, for example, in a device such as an NVIDIA GPU [3] there are no concurrent access problems as usual, when trying to execute an OpenCL program within a CPU, context changes can occur that prevent the thread from executing at the same time. same time as the rest of the workers, which can cause inconsistent states in the result. And it is important to remember, your OpenCL program does not know where it will end up running, so these cases are mandatory to consider to avoid disasters.

For this time I will omit some of the technical details of how the host program was structured, but for those who are interested, you can take a look at the repo where I published everything I was learning about the capabilities of this technology. Link to repo: Path Walker - GitHub Repo.

Conclusion

Through this article we have been able to evaluate the real capabilities of this technology. In addition to discovering that it is based on a language familiar to developers, it also allows us to think from the point of view of the maximum use of the hardware that we can equip a server with.

From an external point of view, it may not be possible to see where this technology can shine. But in the real world, matrices and graphs are often used for everything from finding the shortest way to drive your car to understanding what’s in front of you through computer vision. Together with CUDA, this technology is also having a second life for the world of Artificial Intelligence through, for example, Tensorflow.

It is very interesting to see that OpenCL is supported by the Khronos group, precursors of OpenGL and Vulkan, and that it is in good health by the big driver vendors.

References

[1] Midjourney. “Header image of a man walking on a trail”. https://midjourney.com.

[2] T. G. Mattson, D. Ginsburg, B. Gaster, and A. Munshi, “OpenCL Programming Guide,” Pearson Education, 2011.

[3] NVIDIA, “Parallel Thread Execution ISA Version 8.1”, https://docs.nvidia.com/cuda/parallel-thread-execution/

[4] Khronos Group, “OpenCL Overview”, https://www.khronos.org/opencl/

[5] PoCL Developers, “PoCL - Portable Computing Language”, https://github.com/pocl/pocl

[6] Intel Corporation, “SPIR-V*: Default Interface to Intel® Graphics Compiler for OpenCL™ Workloads”, https://www.intel.com/content/www/us/en/developer/articles/case-study/spir-v-default-interface-to-intel-graphics-compiler-for-opencl-workloads.html