On Thursday, an Amazon AWS blog post announced that the company has shifted most of the cloud processing for its Alexa personal assistant from Nvidia GPUs to its own application-specific integrated circuit (ASIC). The Amazon developer Sebastien Stormacq describes the hardware design of the Inferentia as follows:
AWS Inferentia is a custom chip developed by AWS that accelerates machine learning inference workloads and optimizes their costs. Each AWS Inferentia chip contains four NeuronCores. Each NeuronCore implements a powerful multiplication engine for systolic array matrices, which massively speeds up typical deep learning operations such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps reduce external memory accesses, drastically reduce latency and increase throughput.
When an Amazon customer – usually someone who owns an Echo or Echo Dot – uses Alexa’s personal assistant, very little of the processing is done on the device itself. The workload for a typical Alexa request looks something like this:
- A person speaks to an Amazon Echo and says, “Alexa, what is the special ingredient in Earl Gray tea?”
- The echo recognizes the wake-up word – Alexa – using its own built-in processing
- The echo transmits the request to Amazon data centers
- Within the Amazon data center, the voice stream is converted into phonemes (Inference AI workload).
- Phonemes are converted into words in the data center (Inference AI workload)
- Words are put together to form phrases (Inference AI Workload)
- Sentences are distilled on purpose (Inference AI Workload)
- The intent is forwarded to a suitable fulfillment service, which returns a response as a JSON document
- The JSON document is parsed, including the text for Alexa’s response
- The text form of Alexa’s answer is converted into natural sounding language (Inference AI workload).
- Natural language audio is streamed back to the Echo device for playback – “It’s bergamot orange oil.”
As you can see, almost all of the actual work in fulfilling an Alexa request takes place in the cloud – not in an Echo or Echo Dot device. And the vast majority of that cloud work is done not through traditional if-then logic, but rather through inference – which is the answer-providing side of neural network processing.
According to Stormacq, moving that inference workload from Nvidia GPU hardware to Amazon’s Inferentia chip resulted in a 30 percent lower cost and a 25 percent improvement in end-to-end latency for Alexa’s text-to-speech Workloads. Amazon isn’t the only company using the Inferentia processor. The chip supports Amazon AWS Inf1 instances, which are available to the public and compete with Amazon’s GPU-based G4 instances.
With Amazon’s AWS Neuron software development kit, machine learning developers can use Inferentia as a target for popular frameworks like TensorFlow, PyTorch, and MXNet.
Listing image from Amazon
These were the details of the news Amazon begins moving Alexa’s cloud AI to its own silicon for this day. We hope that we have succeeded by giving you the full details and information. To follow all our news, you can subscribe to the alerts system or to one of our different systems to provide you with all that is new.
It is also worth noting that the original news has been published and is available at de24.news and the editorial team at AlKhaleej Today has confirmed it and it has been modified, and it may have been completely transferred or quoted from it and you can read and follow this news from its main source.