Improving app performance with benchmarking (Google I/O’19)

[MUSIC PLAYING] CHRIS CRAIK: Hey, everybody. We're here to talk to you
about benchmarking today. I'm Chris. DUSTIN LAM: And I'm Dustin. And we're engineers who work
on Android Toolkit stuff. So first, I'd like to
start off with a note. We're not here to talk
about benchmarking devices. I know a lot of people out there
are probably really excited compare phone A to
phone B, but that's not what this talk is about. This talk is about benchmarking
Android app code, specifically, Kotlin and Java-based app code,
and a little bit of NDK code, too.

So what is benchmarking? There's probably a couple
of definitions out there. But for all intents and
purposes of this talk, we're talking about
measuring code performance. Right? We want to know how
much time it takes to run some chunk of code. And we want to
measure this in a way where we can see our
improvements reflected in those measurements. And of course, unlike
profiling, we're not running the full
application in here. We're looking for a tool
for rapid feedback loops. We want to iterate and quickly. So you publish an app. Very exciting, right? We're going to be
rich and famous. And then your users
start writing reviews. It's the first of five
out of five stars. We're doing great,
hearts in the eyes. Was this review helpful? Yes, for my self-esteem. Five out of five again. Awesome. Little hesitation
space, exclamation mark. We're doing great. And then the inevitable
one star review, "laggg". Oh. You're so upset you can't
even spell it right. Well, we've been there too. And we'd like to share
what we've learned.

So let's go on a benchmarking
adventure, very exciting. We'll start off with a test. We'll call it, myFirstBenchmark. And here's an
object that we might be interested in benchmarking. If you haven't seen this before,
this is the Jetpack WorkManager library. And it's basically an
abstraction over async tasks. And here's a synchronous test
API we can use to measure. So what's the first
thing we might do? Well, we could get
the system time, and then get the system time
after we've done the work, subtract that out, output
that as a result, and great, we have our first
benchmark, right? We're done. Well, not quite. We'll find that as we run
this over and over again, we'll get vastly
varying results.

And that's no good
for measurements. But what if we ran it in a loop? And maybe we'll run
this loop five times, and then we'll
report the average. So great, problem solved, right? There's still a few issues here. Well, first, five is kind
of a random magic number. And ideally we'd like
to dynamically compute this number. We want to report
the result when we're ready and confident
that this number won't change too much. We're also measuring
immediately. And the issue here is
that the first couple runs of some piece
of code might not be indicative of what the
real user will experience. And this can be due
to code paging memory or just-in-time
compilization optimizations. And we're also
including outliers. So everything on device
is a shared resource. And this can cause
complications, background
interference that might cause certain runs to
be much, much slower than they should be. And if we include those
results, our measurements will be completely off. And make sure we're not running
an emulator, because while it's very tempting to run
tests on emulator, emulators don't emulate
real world performance.

So the lesson here is that
benchmarking is tricky. It's often inaccurate. We're measuring the wrong
things at the wrong time, right? And it's unstable. There's tons of variance. How many loops should we run? And it's hard to
set up correctly. That's worst of all,
because benchmarking is a tool to help us
focus developer efforts. So it's hard to set up how much
time we're really saving here. So I'd like to introduce the
Jetpack Benchmark library. It's out now, and
it's in alpha one. This is a previously
internal tool I've been working on for years. We've been working hard to get
it available to the public now.

So we hope you
guys will enjoy it. First, I'm going to go over a
little bit about what it looks like and how might how you
might use it in your code, and then we'll talk
about the internals and how we solve a
lot of these issues. So Jetpack Benchmark–
it's a tool for measuring code
performance, of course. Thank you.

We like to prevent common
measuring mistakes. There's a lot of common pitfalls
that we've seen in the past, and we want to pass
those lessons on to you. And best of all, it's
integrated with Android Studio. Who doesn't love Android
Studio integration, right? Let's jump back to
our previous example. Here's our previous benchmark. It's verbose, and
it's full of errors. But this is really
only the code that we care about– the highlighted
code here, about three lines. So let's fix that. First, we apply the benchmark
rule, and then all of this code becomes one simple API
call, measureRepeated. And we can focus on the code
that we really care about. Let's check on
another example here. This is a database benchmark. So first, let's initialize
our database with Room. And if you haven't
seen Room before, it's another Jetpack library. It's just an abstraction
over databases. All that matters
really here is we're initializing some
object that allows us to make database queries.

So first let's
clear all the tables and install some test data,
create our measuring loop, and then we can measure the
code that we care about. In this case, some
kind of complex query, not sure what it is. Doesn't really matter. But there's an issue here. Depending on your
database implementation, your query could
be cached, right? If we know that the query
results won't be changed, shouldn't we just
save that result and use that in
the future instead of having to use complex
query all over again? So what we're
going have to do is we have to take this
clearAndInsertTestData method and do it inside
the loop, right? Every single time,
bust the cache, so we can measure what
we actually care about. But there's another
problem here. We're now measuring more
than we actually care. So what are we going to do? Well, we could do
the previous thing where we take the system
time before and after, maybe save that to some variable,
output the result, then subtract that
from our final result.

And then we'll have
our actual result that we care about, right? Well, this is,
again, very verbose. And fortunately, this
is a common use case that we run into too. So we created an API
for this as well. And that's the
runWithTimingDisabled block. So I know a lot
of Java developers out there are probably
wondering, well, what about us? Well, of course, we
have Java APIs as well. And it's slightly different. So I'd like to go through
this line by line. A lot of code has changed here. So we have to create this
BenchmarkState variable. And this is how the library's
going to commit to your code when it's done.

We create a while loop and
call state.keepRunning. And inside that block,
we can run all the code that we want to measure. If we ever need to do
any setup initialization, we can just call this
pauseTiming method, do our initialization,
and then resume when we're ready to measure again. So that's great. I've got one more
example for you guys, and it's a UI example. A very compelling case for
benchmarking might be UI. And we've designed this library
to integrate on top of existing test infrastructure. And that's because we
want you to be able to use all your favorite tools. For example, the
ActivityTestRule, which is an abstraction
that'll help you with activity lifecycle
and all the setup you need in order
to make a UI test. So we simply mark this
test as an @UIThreadTest to run it on the main thread. And then we get a reference
to a recyclerView. Then, as before, we just
create a measureRepeated loop and measure the code
that we care about.

In this case, we're scrolling
by the height of one item. Let's talk about
studio integration. With Studio 3.5, we're releasing
a new benchmark Module Template to help you get up and
running with your benchmarks. So let's walk through what
adding a benchmark module looks like. So here's a typical
Android project. We've got an app module
and a library module, because we love modulization
and we're great developers. We right click the
project, click New, Module. And we'll get this Create
New Module wizard that you've probably seen before. And if you scroll all
the way to the bottom, you'll see this new
Benchmark Module icon. Click Next, and it gives
a little configuration dialog you can use to
choose your module name, change the package name.

And we've got templates in both
Kotlin and Java-based code. So click Finish, come
back to the project. Now we have our
benchmark module. Let's take a look inside. We've got a benchmark here
that you can just run, up and running. Similar as before,
set BenchmarkRule, measureRepeated loop, and
it works out of the box. You can just run
it, and your results will appear directly
in Studio in case you're developing locally and
you want to iterate quickly.

We've also got JSON
and XML output data. We'll pull off from connected
instrument to tests from device onto your host machine. And we've done this
with the intention of being ingested by
continuous integration tools. So if you want to
look for regressions and look for improvements
you've made over time, this is a great way to do that. As with any module, there's
also a build.gradle file. And there's a couple of things
here I'd like to point out. So first, there's a bench
there's a benchmark plugin that we're going to be shipping
along with the library. There's a custom runner that
extends off the Andrew JUnit runner. And it's also open
to be extended, in case you need to do that.

We've also pre-built it with
some pre-built proguard rules that'll work out of the box with
the Jetpack Benchmark library. And of course, the
library itself– can't forget that,
very important. Let's talk about
the Gradle plugin. So that's this line you saw
before, the apply plugin line. It's going to help you pull
your benchmark reports when you run gradlew,
connect an Android test, or connected check for CI.

And we've also got some gradle
tasks for CPU clock stability in there as well. The test runner, that's
the AndroidBenchmarkRunner. It's kind of an
in-depth topic, so we'll talk more about that later. But suffice to say it's got
a lot of baked in tricks to stabilize your benchmarks. We've also got proguard rules. And that's in this file here. Our template supplies
pre-configured proguard rules. And that's important because
proguard optimizes your code. So you want to make sure
you're running your benchmark in a configuration that
represents real user performance, right? So we want to do this
in a release production environment if possible.

So that's why we bundle
these rules for you. And we realize that
tests generally don't use proguard
or R8, but this is important for benchmarks. And it's probably why we set
it up as a separate module. So we've also included
an Android manifest here that's going to run
with your Android tests. And if you notice here, we've
set debuggable to be false. And while this is normally
enabled by default for tests, it's great because it
allows us to use things like connecting a
debugger and use all of our favorite
debugging tools.

And it's great when you're
testing for correctness. But it's not so
great for benchmarks. We've seen before that runs have
been between 0% to 80% slower. And 80% is not the number
that should be worrying you. It's the hyphen. It's the range, the
variability here. It's hard to account for. Let's take a look at an example. Here are some benchmarks
that we have in AOSP that we use in the Jetpack team. And along the x-axis, you'll
see several different types of benchmarks that we do. And along the y-axis is
normalized duration, how long the benchmark took to run.

Blue is with debuggable false
and red is with debuggable on. And we've normalized
each benchmark, benchmark by benchmark,
not across the board. So you'll see here in the
deserialization example that there's hardly
any difference. If you look really closely,
it's like one pixel off. 1% difference, right? But over here in the
inflateSimple benchmark, there's a huge difference here. And again, the hard part
here is the variability. It's really hard to account for. So we want to make sure that
the optimizations and the code change we're making
are actually going to have an impact on our users. We need to make sure
debuggable is off. So that leaves me at about
benchmark configuration. And here I'd just like to
give some tips that we also bundled with the template. But you should
definitely be setting up your benchmarks this way. So first, as before, we'd like
to turn off debuggability. We also want to make sure
code coverage is false. If you're using
something like JCoCo, this actually modifies
the dex in order to support what it needs to do. And that's great if you're
trying to get code coverage, but when you're running a
benchmark, that's not so great.

Of course, as before, you
like to enable proguard. And we currently support
library modules for alpha one. We're really pushing developers
to modulize their app this year. And I'd like to do
a little shout out. Please check out the How to
Create a Modular Android App Architecture talk. That was on Tuesday,
so you should be able to find it on YouTube. It's by Florina and Yigit. But what if you forget? As a library, we can do a lot,
but we can't do everything. We can't just print
out a device and have you run on that, right? But that would be great. Fortunately, we've got warnings. So we're going to
corrupt your output and make it very visible to you. And hopefully this
is something that you can catch in your continuous
integration tests. So here's an example,
a couple of warnings, debuggable as before.

If you're running
on an emulator, that's also pretty bad. If you're missing the
runner, then you're not using any of our tricks to
help stabilize your benchmarks. And if you're low on battery– now, a surprising thing
about being low in battery is that while you might expect
the device will throttle itself, it'll try
and save power.

However, we've found
that on many devices it still does this even
while it's charging. So this is definitely something
you want to watch out for. So that's a bit about
what the lab looks like and how you would use it. And now Chris is going to
talk about how it works. CHRIS CRAIK: All right, so
that was a lot of information about what it looks
like to use the library, but there's a lot of
implementation behind it to implement all
of these behaviors that Dustin was talking about. So first of all, let's
talk about CPU clock, specifically
frequency and how it's kind of the enemy of stability.

Because when you go up and down
massively, you change results, and you make it very
hard to discover regressions and improvements. So from the perspective
of benchmarking, there are really two
problems that CPU clocks introduce to us. And the first is ramping. Clocks generally start
out low when the device isn't doing anything. And then once work
is scheduled, they will ramp slowly
over time in order to get to a
high-performance mode. On the other side
of this, though, when the device gets
hot because it's been running for a long time,
the clocks will dive quickly, and you'll get
terrible performance.

So both of these are problems. Let's talk about
them one at a time. So first, ramping– clocks will
generally increase under load. And what we've seen is that it
takes about 100 milliseconds. So here we have a clip of
a little systrace here. And the only thing
that's important to note is the clock frequency
at the bottom versus when the work started on the top. At the very beginning
of the trace, you see time equals
0 milliseconds. The frequency is 300
megahertz– incredibly slow. A frequency that most of your
app code is never going to see. However, after about
75 milliseconds, you see that we go all the
way up to 2.56 gigahertz. If we were measuring in
between these two times, that would be bad. However, the solution for this
is actually pretty simple. Benchmark just runs warmup
in order to account for this. So we spin the measurement
loop for 250 milliseconds before we ever start measuring. And then we only start measuring
once timing stabilizes. This also has the
nice side effect of handling the
Android runtime's just-in-time compilation. By the time that
your performance numbers are stabilizing,
JIT has stabilized as well.

You're seeing performance
that's corresponding to what your user would see in a
frequently hot code path. So that was ramping, but diving
is a much bigger problem. When the device gets hot,
the clocks dive quickly. And this is called
thermal throttling. Generally, the CPU will
lower its frequency once it gets to
a very high level because it wants to avoid
overheating and damaging the chip. This can happen
unexpectedly, though, and massively affects
performance while we're running our benchmarks. So take a look at this
sample where I'm just taking a pretty simple benchmark. It's just doing a tiny bit
of matrix math in a loop. And at the beginning, I'm
getting a relatively stable performance. Over the first minute or so,
it looks pretty stable at 2 milliseconds, all is well. But less than a minute in,
performance becomes terrible. Look at that. Less than two minutes in, we
are up to three and a half times the performance that we expect.

However, the device runs at this
low clock for a little while. It cools down. And it's back down. OK. Well, we're good now, right? No. We're still doing the work. So over the result of
this five minute test, we have thermal throttling
taking us up and down and making these results
pretty terrible, right? We can't extract a
whole lot of information from these because we
are dynamically going between 2 and 7 milliseconds. We can't trust
these measurements. So our solution to
throttling, we found, is different per
device because we have different tools
that we can use in different configurations. So the first solution to thermal
throttling is the simplest– lock clocks. What if we could just
set the clock frequency? Well, unfortunately, this is
ideal, but this requires root. This is not a great solution
for the average person. Because although we can set
it to a medium frequency, this will keep it from
thermal throttling because the device handles
medium frequency just fine.

We do provide a gradle
plug-in for you, though, if you do
have a device that is rooted– simply
gradlew lockClocks and your device is locked. But requiring root, though, is
really not a general solution. We don't recommend it. So what else do we have? Well, there is this API added
in Android N, Window.setSustai nedPerformanceMode. And this was originally
designed for VR and for games, but it's incredibly
appealing for benchmarks.

This is because it lowers
the max clocks specifically to solve this problem,
prevent throttling. Perfect, right? And it also works on
GPUs as well as CPUs, so it's even useful
if you're wanting to do rendering benchmarking,
say if you're a game. However, it comes along
with a lot of difficulties. It's designed for VR, not
for headless benchmarks. So first of all, it
requires an activity running with a flag set. It also has two
separate modes if it thinks that you're
single-threaded versus multi-threaded.

And it's also only supported
on some N+ devices. So let's talk through
each of these. Maybe we can solve some of this. So the first is
the activity flag. And in order to
solve this problem, we actually just launch
an activity for you around any headless test. We inject an activity that
launches any time that we might need to use the
setSustainedPerformanceMode so that it's up at all times. We set this flag, also, on any
activities that you launch. So if you're doing a UI test,
like Dustin was showing before with the Recycler view, that
gets this property as well. And it works together with
ActivityTestRule and Activity Scenario, so you don't have
to worry about adapting to this new model.

And in addition, it also calls
it out in the UI of the test while it's running. It pops up there
and says, hey, I'm in setSustainedPerformanceMode. Now, OK, so we've
got a way to solve the problem of needing
the flag, but how do we handle the
two different modes? So first, let's
describe what these are. The setSustainedPerformanceMode
can either operate in a single or
a multi-threaded mode.

Either that means you have your
single threaded application– so we could probably use
one core at max frequency– or you're multi-threaded,
in which case it will set all of the
cores to a lower frequency to prevent throttling. That's fine for
a game, but we're trying to run potentially
different benchmarks with different threading
models, and switching between these modes will
create inconsistency. So what do we do
about this problem? How do we make– we'd really like to just
pick the multi-threaded mode. That's the lower of both.

That sounds good, but
how do we force that? Well, the way that we do this
is in our Android benchmark runner. When sustained performance
mode is in use, we create a new thread. And this thread spins. And this is a really
strange way to do this, but it turns out that this is
actually a pretty efficient way to force us into a
multi-threaded mode. And it gives us that incredibly
sustained performance that we were looking for. We also set a
thread name for it. So if you see this in a trace,
you understand what's going on. And we do the best we can. We set thread priority to
be the lowest possible so that it interferes with
your test minimally.

But here's a look of what
that looks like in systrace. So on the top there, you
can see clock frequency. And over the time of
the benchmark running, when the test starts, the
activity launches– bam, we're suddenly locked
to half clocks. Perfect. We're not going to thermal
throttle in that configuration. Once the test is done,
the activity finishes, and then the clocks are free to
ramp back up slowly over time. So we talked about
how we solve the issue with having an activity
running with a flag set.

We talked about the
two different modes. But we still have
this last issue. This is only on some N+ devices. So you can check whether this is
supported on a specific device by calling PowerManager.isS
ustainedPerforma nceModeSupported. By the way, this is
supported for anything that is VR certified. So you have that. However, in the Firebase
Test Lab, for example, 11 out of the 17 Nougat+ OS
device combos have support for it. So it's not terribly hard
to find something that you can use like this for CI.

But again, this is still
not a general solution. This requires
platform support that isn't available on every phone. So our final solution to
this problem at the very end here is Thread.sleep. The simplest solution–
many devices aren't rooted. Many devices don't have
setSustainedPerformanceMode, so we use Thread.sleep. So what we do here is
we detect a slowdown in between every benchmark
by running a little tiny mini benchmark to see if the device
has started thermal throttling. If it does, we throw away
the current benchmark data, and we sleep to let
the device cool down.

So we saw this
previous slide how performance was oscillating
all over the place and we couldn't get stable
numbers out of this. Well, in this particular
case, our device doesn't have root, so we
can't use lock clocks. It can't use
setSustainedPerformanceMode. Not available, OK. Can't use that. So we have to fall
back on Thread.sleep. So let's see how that actually
performs in this graph. That is a lot better. It is just completely flat. And if you see here, we have
standard deviation for the two different approaches. For the default, we've
got 2.25 milliseconds. And for Thread.sleep,
you've got to look closely, 0.02 milliseconds– massively
more stable, much more consistent performance. But there was a
sacrifice with this.

So because we were
sleeping every time that we detected
thermal throttling, it takes longer to run. This one takes
about eight minutes. The original takes
about four and a half. So in summary, our solution
for thermal strategy does have three
different steps, and we use the best solution
for your device that's available for
your particular phone. So clocks are one
thing, but what about background interference? What about other things
running on your device? So first of all, you have to
ask, am I in the foreground? So if you're not,
something else is. And this is important to
think about in the context of performance. Tests generally run,
if they don't have UI, on top of the launcher,
because there's nothing to launch, nothing to display. However, this means that the OS
thinks the launcher, right now, is the important app. When you're running
in the background, say for instance,
behind the launcher, that means you get
a lot of sources of potential performance
interference.

You might have a live
wallpaper rendering. You might have home
screen widgets updating. You could have
the launcher doing hotword detection or
other miscellaneous work that you don't know about. The status bar is probably
repainting every now and then with a notification coming up. The clock's changing,
Wi-Fi, whatever. And starting at Nougat,
it's possible for devices to have a foreground-exclusive
core, a core that is only available to be used
by the foreground application.

You just can't touch that
if you're in the background. So we want to come
to the foreground. And we actually have a solution
for this already, right? We have our old activity. Remember this guy? So this also solves
this particular problem. And that's why we use this in
all benchmark configurations, regardless of whatever
your clocks are. The benchmark
keeps this activity in the foreground at all times,
unless you have your own, in order to guarantee
that you can use all cores with minimum
rendering interference from whatever is going
on underneath you.

But there's another
important source of background interference. And that is contention. So everything on the
device from CPU to disk is a shared resource. And if your benchmark
is being, let's say, kicked off of a CPU
because of some conflict with another task, if you're
accessing disk and system services at the same time
that someone else is, or if somebody has random
background work happening, for instance, the
system doing something, an app background job,
something along those lines. Those can create contention. So for example, if we
were running our benchmark and warmup just finished,
so we're just right about to start
taking measurements. However, we have this
other process here that's about to start doing some work.

Our first few loops are fine,
but then the other process kicks in and starts
running as well. All of a sudden, we see,
oh, well, actually some of these loops are just giving
me flat out bad numbers. Not very useful numbers that
overlap with this background work. Some of these runs are
still totally fine, but some of them aren't.

So that's why we have this idea
of measure twice, report once. And by twice I, of
course, mean many times. We will measure every– we will measure several loops. And understanding
that most contention is for a short
duration, we can ignore those loops that are most
likely to have hit contention. So what we do, in fact, is
that we report and track the minimum number
observed, not the average, because the average is
susceptible to contention.

This way, the number that we
report by the benchmarking library is immune to
tiny little hiccups that happen every now and
then and might otherwise interfere with your numbers. All right, so let's
talk about how to go about using this library. So the first thing that
we want to recommend is– don't benchmark everything. Start with tracing. Start with profiling tools. Maybe you have measurement
going on on real user devices that tell you some
particular part is slow. Well, that's a good
place to start.

Benchmark what you know is slow
so that you can iterate on it and improve it. We generally
recommend benchmarking synchronous blocks, because
these are the easiest to measure and these are the
easiest to improve over time. So if you're measuring something
that's single-threaded, it is much more likely
to be stable and isolated from other tasks. There's no thread hopping. That means you're taking
the scheduler entirely out of the mix. And that way, for instance,
you might measure UI separately from network separately
from database and rendering.

We also generally recommend
fairly small blocks. These are faster to run. And here we've been probably
less than 50 milliseconds. However, the loop itself
only has around 6 nanoseconds overhead. And this is running on a fairly
old device at half clocks. So you can measure really
small amounts of work with a benchmarking library. Another important
thing to remember is that the benchmarking library
is primarily for hot code. Because we run all of
your code with warmup and we run it in a
loop, that probably means that the code inside
is going to be JIT-ed.

Now, that's great if it's
something like work that's done by your Recycler
view, but if it's only run once in a while
by your app, it's not as likely to get JIT-ed. So be very careful when
you're benchmarking startup. We generally recommend
to only benchmark the code that is inside
of a loop during startup so that is more likely
to be measured correctly. Another thing to
keep in mind is to be aware of caches that might
be anywhere in your code, or even in someone else's code.

So here's a simple
example of a benchmark where we access a file and
check whether it exists. Maybe we observed that this
took a couple of milliseconds during startup. The problem is that the
OS is going to take this and it's going to
know that nothing has changed and just served
you a cached value every time. The benchmark is going
to be very different than the behavior at startup. So be aware of this. You can sometimes take measures
to create things differently, for example, like Dustin
showed at the beginning with the database. Another thing to consider
is potentially avoid overparameterizing. So correctness tests, it's
really easy to, say, sweep over five different
variables, recognize that, OK, over all 4,000 of these
tests, none of them failed. Great. You get a pass. That is a much harder
task for benchmarking, because we're getting something
that isn't a simple pass/fail. More data can be more
difficult to deal with. We recommend to start targeting
benchmarks at real world parameters instead of maybe
going as enthusiastically towards parameterization as
you might during a unit test. One thing I really want
to re-emphasize, though, that Dustin mentioned
before, is, please, do not compare devices.

This library is not designed
to compare the performance of one device versus another. We're really focused
on comparing code, whether that's framework
code, whether that's app code, on the same device,
same operating system version. The library optimizes
for stability. It will have some
factor of difference, that can vary from device
to device, over what you see on a benchmark versus
what you see in reality. And that's because we do not
account for real-world clocks. For example, if the user
touches down on their device, starts scrolling, generally
the clocks will ramp.

If a user uses their
device for a while and it gets hot and thermal
throttles, it performs poorly, the clocks go down. None of that is accounted for
in the benchmarking library. And that's why we don't
recommend comparing devices with our library. So you might be
wondering, OK, how do I go about in
integrating this into my CI? So there are different
tiers of adoption that you can start with to
look at exactly how deeply you want to get into benchmarking.

The first is quick and local. It's very reasonable to do
a trace, write a benchmark, and measure the
performance, make a change, check in that change
without having to monitor the performance over
time and detect regressions. And in general, this is because,
unlike correctness tests, benchmarks usually get better
because you're deleting code. Now, that said, if
you can monitor them over time, potentially manually,
that's a totally reasonable way to go. And that's even better. But we want to recognize
that there's still value in manual
monitoring of benchmarks, even if you don't have
automatic regression detection. Because automatic regression
detection is a complex problem. However, if you do want to go
all the way to that point– and we recommend it– it is something to
recognize that it's not as simple as detecting, when
does my benchmark go down by, say, 1%? Because there's going to be all
sorts of times where it flakes a little bit, it goes
down a little bit, and make a trade for one
benchmark versus another, or you check in a feature
that just absolutely needs to make this one code
path slightly slower.

So if you're prepared to receive
emails for all those things, by all means, just
something to keep in mind. And so let's go through
a few closing notes. We do have, as Dustin
mentioned earlier, an NDK sample that shows how to
use some of our infrastructure together with C++ code. That's available on our
android-performance GitHub repository. And it actually wraps the
existing Google benchmark C++ library to get numbers directly
from infrastructure that you might already be using
for C++ benchmarking. It captures the results together
with Java and Kotlin benchmarks in our final output. And it applies all of our
Android-specific tricks regarding stabilizing clocks. So Dustin mentioned that this is
actually a fairly old library. So why haven't you
heard of it before? Well, this library
has been around since about 2016, used inside
of the Android platform. And we use it all
over the place. We use it for text, for
graphics, for views, for resources, for
SQLite, for optimizing all of these different components. But for a long
time it's been very difficult to use externally
because we kind of needed you to have a rooted
device to run this on.

That's easy for a
platform developer, much harder for
an app developer. But more recently,
we have overhauled it for non-rooted devices. We've switched over to
Kotlin, which gave us some really nice benefits in
terms of function inlining for that measurement
loop and allowed us to minimize overhead. So now libraries such as
Room, Navigation, and Slices all use our benchmarking
library to make improvements. So here's an example
from the platform from a couple of years ago. I think this was in
O when we noticed that, hey, toggling the
visibility of a relatively large tree of views
was kind of expensive.

So this was our CI
over the process of checking in a few changes. So at the beginning,
modifying these views took 350 microseconds. And we checked in
an optimization for how outlines were changed
when views were stored. And so it went
down a little bit. And we actually realized,
OK, most of the work here is actually
in View.invalidate. It's doing a lot of
work that it doesn't need to anymore in a modern
hardware-accelerated pipeline. So we checked in a
complete overhaul of that. And of course, we
quickly reverted it because it broke the world. But we didn't learn
a lot of things that we needed to
change along that path. And after we were able
to check that back in, we were able to get a
total improvement of 100%– of 50% of taking that view
toggling from 0.35 milliseconds per, say, 64 views or
whatever, down to half of that, which is
a huge improvement. And we did this all before
we had automatic regression testing. So in summary here, we've seen
through all of these tricks that we've talked
about, benchmarking is a really complex problem.

It's really hard to
measure accurately. It's completely foiled
by clock instability. And background interference
can be a real source of pain when you're trying
to detect changes. That's why with
Jetpack Benchmarking we've provided a simple API. We bundled in all
the lessons we've learned about getting stable
numbers out of a device. And the alpha is available now.

Thank you so much for coming. If you remember only one
thing, d.android.com/benchmark. Thank you all so much. [MUSIC PLAYING] .

As found on YouTube