Why is my ARM device slow?

At Movial, we are often asked to help customers achieve their dream user interface and multimedia experience with some selected hardware and/or software platform. Usually we are happy to provide our expertise, but all too often those of us at the coalface issue a collective groan when we are told what is expected of us.

The task often proves difficult due to mismatches between the hardware’s capabilities or design, the software toolkit, and the expectations of what can be achieved with them. This article will explain some of the difficulties we encounter and how customers and managers – that’s you – can help to maximise the likelihood of success.

The first and most important thing to remember is that embedded or mobile devices do not have the same capabilities as the PCs that you may be used to on your desk or at home.

Typical PCs have enormously powerful CPUs and graphics processors, linked by high-bandwidth buses, and can therefore get away with a lot of brute-force-and-ignorance in the software department. This ability comes at the cost of extremely high power consumption, which can easily reach into the hundreds of watts.

But this is unacceptable in the mobile domain, where a pocket-sized battery is often required to last for days. Hardware manufacturers therefore work hard to produce highly efficient processors inside a one-watt power envelope. The latest OMAP4 and Tegra2 chips are roughly as fast as a good netbook. Obviously, older chips – which may be cheaper – will have even less performance.

This all means that for a good user experience, the available hardware must be used to maximum efficiency, and special features of the hardware must match up to what is required. When this is not the case, your device will be slow.

The most obvious problem might be the use of high-level interpreted scripting languages, such as Javascript, for core functionality. These languages are literally incapable of using the ARM CPU efficiently. Even Java is usually preferable, since many ARM CPUs have special support for Java, and modern JVMs are often well optimised for it. While often harder to use, a compiled language will usually be a lot faster than an interpreted one, all other things being equal.

Most software toolkits, such as Adobe Flash, Qt, Gstreamer, and X11, offer a very rich array of capabilities to applications. They practically guarantee that if you ask them to do something, they will do it. But what they do not offer is any indication of whether your command will be done quickly or smoothly. What’s worse, most toolkits don’t provide any way for you to determine what can and can’t be done efficiently – which is called introspection.

If the toolkit doesn’t know how to make the hardware do something efficiently, it will do it in an inefficient way – and without telling the application that it is doing so. Usually this means retrieving all the necessary image data from the GPU (an extremely slow operation in itself), doing the job using generic routines on the CPU, and then pushing the completed image back to the GPU. Sometimes the software fallback can be run directly on the graphics memory, but since this is an uncached area, this will still be much slower than expected – the CPU cannot use it’s extensive latency-hiding techniques to optimise loads from uncached memory.

This problem is mostly hidden on desktop hardware – not only are the drivers for desktop GPUs very well-featured, but the connections between modern CPUs and GPUs are very fast, which allows software fallbacks to run relatively efficiently. These advantages are not available on typical ARM hardware.

As a concrete example, we have several times been asked to investigate why some simple blitting hardware was not accelerating alpha-blending properly, when it was being used successfully for fills and copies. We usually found that the hardware could only accelerate non-premultiplied alpha-blending, whereas the graphics framework (eg. XRender or Qt) required premultiplied alpha-blending. The workarounds varied from coaxing some more capable part of the hardware into life, to completely replacing the hardware platform with a more capable one.

We have also occasionally discovered that features mentioned in the hardware documentation simply did not work. This not only reduces the capability of the hardware, it also completely throws off our effort estimates as we must scrape around for alternatives and workarounds which do work correctly.

Another typical problem area is integrating video decode acceleration into a rich graphics framework. Video is one of the most demanding tasks asked of a typical mobile device, with many videos now being at 720p or 1080p resolution (requiring 30-60 megapixels per second) and often requiring rescaling (without blockiness!) to fit the device’s screen. In a one-watt power envelope, this level of capability requires dedicated hardware acceleration.

Unfortunately, the video decoder’s output buffers are usually provided in a variety of formats which are often not directly interpretable by the main graphics APIs, and the buffers (being uncached) cannot be read efficiently by the CPU either. After a while, writing yet another convert-YUV-to-RGB routine to run on uncached source memory, and seeing it eat up nearly all of the CPU because of the memory-access inefficiency, gets a bit tiring. Even so, copying RGB data from those same uncached buffers would be even more taxing, because RGB data is larger than 4:2:0 YUV data.

Another potentially showstopping pitfall is where the display hardware can read the video buffers, but only for display directly on the framebuffer or on some overlay structure. This is acceptable for many simple applications, but Adobe Flash requires that the video frames are sent through the whole graphics pipeline, so retrieving them from the framebuffer or an overlay is unacceptable. If your idea of “the full desktop/Web experience” includes Flash – or even just YouTube – you will need hardware which is designed to accommodate Flash, which requires extremely tight integration of the video and graphics accelerators.

The above paragraph certainly goes a long way towards explaining Steve Jobs’ attitude towards Flash on the iPhone and iPad.

In a separate article to follow, I will outline how to set up your project for success, by avoiding the above traps.

Why is my ARM device slow?

Trending Articles

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

Practice Sheet of Right form of verbs for HSC Students

Suspected burglar to know fate in January

18A St. Fintan's Villas, Deansgrange, Co. Dublin - €365,000

[MP3] Texzy Ft Dr. Ritzy –“Leg Over” (Prod. @DrRitzy & @KezzyKlef)

How To: Uninstall & Reinstall The Shavlik (ST) Remote Scheduler Service On A...

99 God Status for Whatsapp, Facebook

Breaking Down Bumpy’s Boys: NYC Black Mob Boss Of Old Surrounded Himself With...

Central Maine arrest log: May 3-10, 2024

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Thomas Grundy – Bradwell

Who Is Sisanda Jonas? | Biography| Profile| History Of South African Media...

God of war 3 PPSSPP Download For Android 1.3 GB

Black Angus Grilled Artichokes

The 10 Tennessee Cities With The Largest Black Population For 2021

Walkthrough Pokemon Victory Fire Complete | English Language

Redruth man Nathan Ellis spared jail after admitting assaulting...

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Attharintiki Daaredhi: Bappu Gari Bommo Lyrics Translation

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्