CML Inference Engine
April 26, 2023
In a previous post, we have seen what deep neural networks - or, machine learning (ML) agents - consist of, how they are trained, and that they play an indispensable role in Artemis. After designing and training the ML agents, they are to be used in the game to generate realistic, planet-scale maps in real-time. This execution stage of a model - i.e. obtaining outputs for given inputs - is called inferencing.
Neural Network Exchange Format (NNEF)
While popular ML frameworks such as PyTorch and TensorFlow are widely used in the design and training stages of ML models, the native file formats of those frameworks are not optimal for inferencing. Analogous situations are observed in other fields as well, such as in graphics design. Although Adobe’s Photoshop file format (.PSD) might be suitable in a professional workflow, the final image is generally exported into viewing formats such as JPEG. Having an image in the JPEG format allows its file size to be smaller and to be accessible in any web browser, regardless of the image editing software that was used to create the image.
The Neural Network Exchange Format (NNEF) by the Khronos Group is an open format for inferencing of trained neural network models. Using the NNEF tools, any ML model developed using a major ML framework could be converted into an NNEF format and vice versa, allowing interoperability between various ML ecosystems. All our ML agents are currently in the NNEF format.
Shown in the diagram below is a very simple neural network named 'model_abc'. This model accepts two input tensors (input1 and input2) which go through a small number of operations before outputting two tensors (output1 and output2).
A model in the NNEF format is self-contained in its own folder (model_abc.nnef), which contains a text description (graph.nnef) of the network architecture and binary '.dat' files that store the trained weights and biases of the the model. In 'model_abc', there are only two binaries - 'weight1.dat' and 'bias1.dat', both for the same convolution operation - but there could be dozens or hundreds of binary files in larger models.
Below is the content of 'graph.nnef', in which each line describes an operation in 'model_abc'. The clarity in the correspondence between the graphical representation above and the text description below could not be more intuitive.
CML Inference Engine
With a trained model converted into the NNEF format, an inference engine that could interpret and inference the model is required. Initially, we have inferenced our ML agents (in ONNX format) using Nvidia’s TensorRT. While TensorRT has been immensely useful at the development stage of Prologue, it was deemed to be a temporary solution as TensorRT requires Nvidia GPUs. As there were no other viable options that met all of our requirements for Prologue at the time, we have opted to create CML - our own proprietary engine for inferencing NNEF models. Some of the design objectives of CML were:
Real-time inferencing of ML agents while maintaining ~60 fps
Support for all modernGPUs (Nvidia, AMD, Intel) and platforms (Windows, game consoles, Linux)
Validation and performance profiling tools for development and testing purposes
Versatility in the use of custom and compound operators
Seamless integration with the Melba Engine via unified scheduling and memory management
Since 2021, CML has replaced TensorRT in Prologue for inferencing our ML agents — the upres, texture, and population agents - all in real-time. Currently, nearly one-hundred neural network operations are supported. Each operation is written in HLSL for GPU inferencing, but CPU-based inferencing is also available.
What CML Does
Before an NNEF model is inferenced, the following one-time procedure is performed to load the model into the GPU:
Parsing of the text description of the neural network
Determination of the output tensor shapes of each operation within the model
Allocation of an appropriate amount of GPU memory to store the trained weights, input tensors, and output tensors for each operation
Reading of the weights and biases from the binary files in the NNEF model folder and transferring them into the GPU memory
Once these preparatory steps have been completed, the CML engine is able to inference the model as many times as it is needed to. CML accepts an input tensor(s) to start computing each of the operations present within the NNEF model, until all operations have been computed to obtain the final output tensor(s).
The CML has been made to be used as a library, but it can also be used as a stand-alone command-line application to inference a model. The app accepts the model’s input tensors from NumPy files to inference the model on the GPU, then writes the output tensors in the NumPy format.
Below are two images generated by inferencing the StyleGAN2-ADA model (pretrained on the MetFaces dataset), which is one of the pioneering AI models for 2D image generation. StyleGAN2-ADA takes a 512-element random float32 array as an input, and generates a 1024x1024 RGB image. Using CML version 1.0, inferencing StyleGAN2-ADA took 3.7 s on a Geforce RTX 2080 Ti.
In deep learning, there are many standard operations - such as convolution, addition, and ReLU - that are ubiquitously supported in all ML frameworks. At times, however, there could be a need to compute quantities that are not part of the standard list of operations. Attempting to construct such a function by compositing standard operations could easily become messy or impractical. A typical example of such a case occurred when we needed to compute cross products of three-dimensional vectors. We had to compute a cross product c = a x b at each point in a 32 x 32 grid. This operation could be done in a single line in PyTorch ('torch.cross'), but there was no easy way to translate this operation for TensorRT, except to construct the operation as a combination of several operations. In the end, a simple cross product operation had to be constructed by combining 12 smaller operations - consisting of two splits, six multiplications, three subtractions and one concatenate operator - quickly making this simple operation look very complicated.
In CML, introducing a custom operation is easy — enabling us to encapsulate the cross product operation as one operation by writing a custom shader, rather than compositing 12 smaller operations as in Figure 4. This not only simplifies the computational graph but also eliminates the overhead required in forward-feeding the output tensors to subsequent operations. Finally, the custom operation becomes even more useful when there is a need for advanced logic as there is no restriction on conditional branching, which can be cumbersome to achieve using just standard neural network operations.
There are a number of tools that we have created to aid the development of the HLSL shaders for neural network operations. Among those, the rank operation feature was particularly useful for optimization. The rank operation feature executes an NNEF model line-by-line to profile each step, then gives a report of the times taken by different operations. As certain operations often dominated the inference time of our models, the rank operations feature enabled us to identify the performance bottlenecks to improve the speed of those operations.
Shown below is an example of the performance report for a certain network. We see that almost half of the inference time for this particular model is spent on the 'conv' and the 'add' operations. From this report, we could get an idea on which operations need improvement if we wished to speed-up the inference time of this particular model.
Inference time GPU: 0.032153697527218618 seconds (32.153696000000011 millisec) Total operation count: 596 Operation ranking: name percentage time count avg time conv 41.87 % 13.462 ms 99 0.136 ms add 14.67 % 4.716 ms 102 0.046 ms transpose 8.96 % 2.882 ms 5 0.576 ms mul 8.21 % 2.640 ms 50 0.053 ms relu 7.46 % 2.400 ms 92 0.026 ms linear 5.35 % 1.722 ms 104 0.017 ms reshape 4.43 % 1.423 ms 61 0.023 ms leaky_relu 3.72 % 1.195 ms 13 0.092 ms concat 1.30 % 0.419 ms 7 0.060 ms deconv 0.57 % 0.184 ms 2 0.092 ms sub 0.38 % 0.123 ms 10 0.012 ms div 0.38 % 0.123 ms 7 0.018 ms pad 0.37 % 0.118 ms 4 0.029 ms nearest_upsample 0.35 % 0.112 ms 6 0.019 ms ...
CML v2.0 and Outlook
As we have developed our own graphics abstraction layer for Melba, we have recently ported the original CML v1.0 (using DX11) to v2.0 (using our graphics abstraction layer). Our graphics abstraction layer currently uses DX12 as its backend, but we can support other graphics APIs as a backend if desired. With the use of D3D12 or another modern graphics API, we can potentially get much faster performance and more capabilities in CML.
And finally, the CML inference engine is still work in progress. While many of the popular operations are supported, more operations could be added. Also, an improved generality for the implemented operations is desirable, as certain operations' implementations are somewhat limited in their attributes support. And finally, some operations have room for sizable performance improvements by means of various optimization techniques.