Faster Inference for NLP Pipeline’s using Hugging Face Transformers and ONNX Runtime

Yashu Gupta
3 min readJan 3, 2021

--

Transformers are taking the NLP world by storm as it is a powerful engine in understanding the context. Nowadays with use of Transformers we are able to achieve State of the art results in various problems like Question Answering system, Machine Translation, Summarization, Sentence Classification and Text Generation Systems. State of the art is dominated by large Transformer models, which poses production challenges due to their size and Sometime it is very difficult to deploy these large models in production with Faster Inference

BERT-base-uncased has ~110 million parameters, RoBERTa-base has ~125 million parameters, and GPT-2 has ~117 million parameters. Each parameter is a floating-point number that requires 32 bits (FP32). This means the file sizes of these models are huge as is the memory they consume. Not to mention all the computation that needs to happen on all these bits.

Because of these above challenges, model optimization is now prime focus for most of the NLP and Deep Learning Engineers so that Faster Inference can be achieved when deploying these large models to clients. With the Recent release of ONNX (Open Neural Network Exchange )and Deep Integration with Hugging face Transformers models faster inference can be achieved up to 6X for all models mentioned above.

Faster Inference -: Optimizing Transformer model with HF and ONNX Runtime

We will be downloading a Pretrained BERT model and converting it to ONNX format so that model size can be reduced and can be easily loaded , faster inference can be achieved by converting the Floating Pointers(Model parameters) to INT 8 using Quantization. What the hack is Quantization

Quantization-:Quantization approximates floating-point numbers with lower bit width numbers, dramatically reducing memory footprint and accelerating performance. Quantization can introduce accuracy loss since fewer bits limit the precision and range of values. However, researchers have extensively demonstrated that weights and activations can be represented using 8-bit integers (INT8) without incurring significant loss in accuracy.

Lets Export a Transformer model to ONNX Format using Hugging face and python .Let’s Start by installing all the essential libraries

Performing the optimization is pretty straightforward. Hugging Face and ONNX have command line tools for accessing pre-trained models and optimizing them. We can do it all in a single command:

With that one command, we have downloaded a pre-trained BERT, converted it to ONNX, quantized it, and optimized it for inference.

Hugging face has already integrated onnx runtime in its NLP Pipeline .We can directly use that

Performance Results

Latencies below are measured in milliseconds. PyTorch refers to PyTorch 1.6 with TorchScript. PyTorch + ONNX Runtime refers to PyTorch versions of Hugging Face models exported and inferenced with ONNX Runtime 1.4.

Below is the Benchmark for the BERT model comparison with the pytorch deployment and pytorch+onnx

Below is the link for complete end to end implementation from Hugging face

--

--