Intel Launches Movidius Neural Compute Stick: Deep Learning and AI on a $79 USB Stickby Nate Oh on July 20, 2017 11:00 AM EST
- Posted in
- Machine Learning
- Neural Networks
Today Intel subsidiary Movidius is launching their Neural Compute Stick (NCS), a version of which was showcased earlier this year at CES 2017. The Movidius NCS adds to Intel’s deep learning and AI development portfolio, building off of Movidius’ April 2016 launch of the Fathom NCS and Intel’s later acquisition of Movidius itself in September 2016. As Intel states, the Movidius NCS is “the world’s first self-contained AI accelerator in a USB format,” and is designed to allow host devices to process deep neural networks natively – or in other words, at the edge. In turn, this provides developers and researchers with a low power and low cost method to develop and optimize various offline AI applications.
Movidius's NCS is powered by their Myriad 2 vision processing unit (VPU), and, according to the company, can reach over 100 GFLOPs of performance within an nominal 1W of power consumption. Under the hood, the Movidius NCS works by translating a standard, trained Caffe-based convolutional neural network (CNN) into an embedded neural network that then runs on the VPU. In production workloads, the NCS can be used as a discrete accelerator for speeding up or offloading neural network tasks. Otherwise for development workloads, the company offers several developer-centric features, including layer-by-layer neural networks metrics to allow developers to analyze and optimize performance and power, and validation scripts to allow developers to compare the output of the NCS against the original PC model in order to ensure the accuracy of the NCS's model.
The 2017 Movidius NCS vs. 2016 Fathom NCS
According to Gary Brown, VP of Marketing at Movidius, this ‘Acceleration mode’ is one of several features that differentiate the Movidius NCS from the Fathom NCS. The Movidius NCS also comes with a new "Multi-Stick mode" that allows multiple sticks in one host to work in conjunction in offloading work from the CPU. For multiple stick configurations, Movidius claims that they have confirmed linear performance increases up to 4 sticks in lab tests, and are currently validating 6 and 8 stick configurations. Importantly, the company believes that there is no theoretical maximum, and they expect that they can achieve similar linear behavior for more devices. Though ultimately scalability will depend at least somewhat with the neural network itself, and developers trying to use the feature will want to play around with it to determine how well they can reasonably scale.
Meanwhile, the on-chip memory has increased from 1 GB on the Fathom NCS to 4 GB LPDDR3 on the Movidius NCS, in order to facilitate larger and denser neural networks. And to cap it all off, Movidius has been able to reduce the MSRP to $79 – citing Intel’s "manufacturing and design expertise” – lowering the cost of entry even more.
Like other players in the edge inference market, Movidius is looking to promote and capitalize on the need for low-power but capable inference processors for stand-alone devices. That means targeting use cases where the latency of going to a server would be too great, a high-performance CPU too power hungry, or where privacy is a greater concern. In which case, the NCS and the underlying Myriad 2 VPU are Intel's primary products for device manufacturers and software developers.
|Movidius Neural Compute Stick Products|
|Movidius Neural Compute Stick||Fathom Neural Compute Stick|
|Interface||USB 3.0 Type A||USB 3|
|On-chip Memory||4Gb LPDDR3||1Gb/512Mb LPDDR3|
|Deep Learning Framework Support||Caffe
(as of NC SDK v1.09.00)
|Native Precision Support||FP16||FP16, 8bit|
|Nominal Power Envelope||1W||1W|
|SoC||Myriad 2 VPU||Myriad 2 VPU (MA2450)|
As for the older Fathom NCS, the company notes that the Fathom NCS was only ever released in a private beta (which was free of charge). So the Movidius NCS is the de facto production version. For customers who did grab a Fathom NCS, Movidius says that Fathom developers will be able to retain their current hardware and software builds, but the company will be encouraging developers to switch over to the production-ready Movidius NCS.
Stepping back, it’s clear that the Movidius NCS offers stronger and more versatile features beyond the functions described in the original Fathom announcement. As it stands, the Movidius NCS offers native FP16 precision, with over 10 inferences per second at FP16 precision on GoogleNet in single-inference mode, putting it in the same range as the 15 nominal inferences per second of the Fathom. While the Fathom NCS was backwards compatible with USB 1.1 and USB 2, it was noted that the decreased bandwidth reduced performance; presumably, this applies for the Movidius NCS as well.
SoC-wise, while the older Fathom NCS had a Myriad 2 MA2450 variant, a specific Myriad 2 model was not described for the Movidius NCS. A pre-acquisition 2016 VPU product brief outlines 4 Myriad 2 family SoCs to be built on a 28nm HPC process, with the MA2450 supporting 4Gb LPDDR3 while the MA2455 supports 4Gb LPDDR3 and secure boot. Intel’s own Myriad 2 VPU Fact Sheet confirms the 28nm HPC process, implying that the VPU remains fabbed with TSMC. Given that the 2014 Myriad 2 platform specified a TSMC 28nm HPM process, as well as a smaller 5mm x 5mm package configuration, it’s possible that a different, more refined 28nm VPU powers the Movidius NCS. In any case, it was mentioned that the 1W power envelope applies to the Myriad 2 VPU, and that in certain complex cases, the NCS may operate within a 2.5W power envelope.
Ecosystem Transition: From Google’s Project Tango to Movidius, an Intel Company
Close followers of Movidius and the Myriad SoC family may recall Movidius’ previous close ties with Google, having announced a partnership with Myriad 1 in 2014, culminating in the Myriad 1’s appearance in Project Tango. Further agreements in January 2016 saw Google sourcing Myriad processors and Movidius’ entire software development environment in return for Google contributions to Movidius’ neural network technology roadmap. In the same vein, the original Fathom NCS also supported Google’s TensorFlow, in contrast to the Movidius NCS, which is only launching with Caffe support.
Update (10/11/17): Movidius has announced and released a new version of their Neural Compute Software Development Kit (v1.09.00) that brings TensorFlow 1.3 support to the NCS.
As an Intel subsidiary, Movidius has unsurprisingly shifted into Intel’s greater deep learning and AI ecosystem. On that matter, Intel’s acquisition announcement explicitly linked Movidius with Intel RealSense (which also found its way into Project Tango) and computer vision endeavors; though explicit Movidius integration with RealSense is yet to be seen – or if in the works, made public. In the official Movidius NCS news brief, Intel does describe Movidius fitting into Intel’s portfolio as an inference device, while training and optimizing neural networks falls to the Nervana cloud and Intel's new Xeon Scalable processors respectively. To be clear, this doesn’t preclude Movidius NCS compatibility with other devices, and to that effect Mr. Brown commented: “If the network has been described in Caffe with the supported layer types, then we expect compatibility, but we also want to make clear that NCS is agnostic to how and where the network was trained.”
On a more concrete note, Movidius has a working demonstration of a Xeon/Nervana/Caffe/NCS workflow, where an end-to-end workflow of a Xeon-based training scheme generates a Caffe network optimized by Nervana’s Intel Caffe format, which is then deployed via NCS. Movidius plans to debut this demo at Computer Vision and Pattern Recognition (CVPR) conference in Honolulu, Hawaii later this week. In general, Movidius and Intel promise to have plenty to talk about in the future, where Mr. Brown comments: “We will have more to share about technical integrations later on, but we are actively pursuing the best end-to-end experience for training through to deployment of deep neural networks.”
Upcoming News and NCS Demos at CVPR
Alongside the Xeon/Caffe/Nervana/NCS workflow demo, Movidius has a slew of other things to showcase at CVPR 2017. Interestingly, Intel has described their presentations and demos as two separate Movidius and RealSense affairs, implying that the aforementioned Movidius/RealSense unification is still in the works.
For Movidius, Intel describes three demonstrations: “SDK Tools in Action,” “Multi-Stick Neural Network Scaling,” and “Multi-Stage Multi-Task Convolutional Neural Network (MTCNN).” The first revolves around the Movidius Neural Compute SDK and the platform API. The multi-stick demo showcases 4 Movidius NCS’ in accelerating object recognition. Finally, the third demo showcases Movidius NCS support for MTCNN, “a complex multi-stage neural network for facial recognition.” Meanwhile, Intel is introducing the RealSense D400 series, a depth-sensing camera family
The multi-stick demo is presumably what the company mentioned as a multi-stick demo that has been validated on three different host platforms: desktop CPU, laptop CPU, and a low-end SoC. The company also has a separate acceleration demo, where the Movidius NCS accelerates a Euclid developer module and offloads the CPUs, “freeing up the CPU for other tasks such as route planning or running application-level tasks.” The result is around double the framerate and a two-thirds power reduction.
All-in-all, Intel sees and outright states that they consider the Movidius NCS to be a means towards democratizing deep learning application development. As recent as this week, we’ve seen a similar approach as Intel’s recent 15.46 integrated graphics driver brought support for CV and AI workload acceleration on Intel integrated GPUs, tying in with Intel’s open source Compute Library for Deep Neural Networks (clDNN) and associated Computer Vision SDK and Deep Learning Deployment Toolkits. On a wider scale, Intel has already publicly positioned itself for deep learning in edge devices by way of their ubiquitous iGPUs, and Intel’s ambitions are highlighted by its recent history of machine learning and autonomous automotive oriented acquisitions: MobilEye, Movidius, Nervana, Yogitech, and Saffron.
As Intel pushes forward with machine learning development by way of edge devices, it will be very interesting to see how their burgeoning ecosystem coalesces. Like the original Fathom, the Movidius NCS is aimed at lowering the barriers to entry, and as the Fathom launch video supposes, a future where drones, surveillance cameras, robots, and any device can be made smart by “adding a visual cortex” that is the NCS.
With that said, however, technology is only half the challenge for Intel. Neural network inference at the edge is a popular subject for a number of tech companies, all of whom are jockeying for the lead position in what they consider a rapidly growing market. So while Intel has a strong hand with their technology, success here will mean that they need to be able to break into this new market in a convincing way, which is something they've struggled with in past SoC/mobile efforts. The fact that they already have a product stack via acquisitions may very well be the key factor here, since being late to the market has frequently been Intel's Achilles' heel in the past.
Wrapping things up, the Movidius NCS is now available for purchase for a MSRP of $79 through select distributors, as well as at CVPR.
Post Your CommentPlease log in or sign up to comment.
View All Comments
CajunArson - Thursday, July 20, 2017 - linkGive him the stick.
DON'T give him the stick!
Jhlot - Thursday, July 20, 2017 - linkAs a lay person someone explain how a 1W maybe 2.5W tiny USB device contributes anything to AI acceleration. This is slower than using a similarly priced discrete GPU right? If it is similar in speed then Intel is dumb, they should have a PCIe card with a whole bunch of these to make an actual useful product.
bcronce - Thursday, July 20, 2017 - link"As a lay person someone explain how a 1W maybe 2.5W tiny USB device contributes anything to AI acceleration"
This 1watt $80 stick is about 33% of the performance of a $5500 GPU and consumes 1/250th the power. Slight apples to oranges, but ballpark close.
FriendlyUser - Thursday, July 20, 2017 - linkGiven the extreme efficiency of modern GPUs and the considerable know-how of nVidia, which has been building specific libraries for years, I have a hard time believing that this tiny stick can really be that competitive. Probably in a very, very specific scenario involving very, very specific benchmarks. But general AI use? Hard to believe.
saratoga4 - Friday, July 21, 2017 - linkThese devices are a low power mobile processor and a big vector multiply unit with some SRAM cache. They use less power than a GPU because they lack 99.9% of the hardware in a GPU. Aside from matrix multiplication operations, they have the processing power of a low end smartphone.
Yojimbo - Thursday, July 20, 2017 - linkAt 100 GFLOPS of FP16, how do you figure it has 1/3 the performance of a $5500 GPU? It's a useless comparison, anyway. A discrete GPU has a much different intended workload than this thing.
It makes more sense to compare this to NVIDIA's Jetson line, even though the form factor and use case is still different. The Jetson is meant for others to embed into their devices or to be used as a development platform. The Jetson can be used with a lot more than just Caffe, and can handle a lot more tasks than just accelerating CNNs. It can handle full CUDA and graphics workloads. It comes with wifi, video encode/decode blocks, and I think an ISP. It can't just be plugged into a USB port, though. But it's a much better comparison than to a discrete GPU because it is a device for computing at the edge. The Tegra X2 in the Jetson TX2 module has 874 GFLOPS of FP16 at 7.5 W. That's 8.7 times the performance at 7.5 times the power draw. The Jetson TX2 has more and faster memory, but costs 5 times as much.
NVIDIA is open sourcing a deep learning accelerator though, and I wouldn't be surprised if someone came out with a product using it that is meant to compete with this Movidius stick.
ddriver - Thursday, July 20, 2017 - linkSo tx2 has 8.7x the perf at 7.5x the power draw at 8x the memory at 5x the price. Gee, intel offerings are really slipping down in value.
Yojimbo - Thursday, July 20, 2017 - linkThat 7.5W is for the TX2 is for the module, I believe, not the SoC. Maybe it's better to compare the 7.5 W with the 2.5 W number of the stick. But, the two products are probably not really competitors to each other.
I only made the comparison because someone tried to make a comparison to some unnamed "$5500 GPU". I figure that must be the P40, with its 250 W TDP. But again, this Intel stick has a 2.5 W power envelope, not a 1 W power envelop. So it would only be 1/100th the power draw, not 1/250th like he claimed. The P40 has 10000 GFLOPS, whereas this stick has 100 GFLOPS, so I have no idea where he got the 33% performance number. The stick has 1/100th the perfomance at 1/100th the power draw of the P40 using peak throughput and power envelopes. But really, we need actual benchmarks to make such comparisons, not peak throughput and TDP numbers. The P40 is designed to inference with a batch size in the hundreds, however, and will only be efficient when doing so. It has 24 GB of memory. It's a silly comparison.
Yojimbo - Thursday, July 20, 2017 - linkIt is possible, however, that the SoC on this Movidius stick is capable of inferencing a non-batched workload faster than the Tegra X2. GPUs need batching in order to take advantage of their parallelism and perform well with inferencing. The Tegra X2, with only 256 CUDA cores, needs a much smaller batch size than a discrete GPU, though.
ddriver - Thursday, July 20, 2017 - link*pulls arbitrary numbers out of ass*
Intel is scattering to generate as much revenue as possible via cheap silly projects like hypetane cache and compute sticks. Probably to avoid disappointing shareholders on the next quarter results.