Testing Chapter 2
Testing Chapter 2
Testing Chapter 2
AI for UI Testing
Just because certain tasks have historically required human effort does not mean that we won’t be able to automate
them someday. Once upon a time, we believed that tasks such as voice and image recognition, driving, and musical
composition were too difficult for computers to simulate. However, many of these tasks are now being automated
using the power of AI and machine learning. In some cases, AI is outperforming human experts in tasks such as
medical diagnosis, legal document analysis, and aerial combat tactics, among others. With that in mind, it really
shouldn’t surprise you that we’re leveraging AI for functional testing tasks that previously relied on the expertise of
human testers. Figure 2-1 illustrates how to train AI bots to perceive, explore, model, test, and learn software
functionality. It is important to note that even though learning through feedback is explicitly called out at the end,
the bots leverage machine learning at each stage of the process.
Perceive
A foundational step in functional UI test automation is having the ability to interact with the application’s screens,
controls, labels, and other widgets. Recall that traditional automation frameworks use the application’s DOM for
locating UI elements and that these DOM-based location strategies are highly sensitive to implementation changes.
Leveraging AI for identifying UI elements can help to overcome these drawbacks. Just like humans, AI bots
recognize what appears on an application screen independently of how it is implemented. In fact, there need not be a
DOM at all, as the interaction can be based solely on image recognition. AI, and more specifically a branch of AI
known as computer vision, gives us the test automation superpower of being able to perceive anything with a screen.
Furthermore, since we train the bots on hundreds and thousands of images, UI design changes do not result in
excessive test script maintenance. In many cases, AI-based test automation requires zero maintenance after visual
updates and redesigns.
Figure 2-2 provides an illustrative example of how to leverage goal-based reinforcement learning for exploring,
modeling, and testing a software application. To start, all the bot needs is the location of the application’s home
screen and for us to give it an objective. Let’s task the bot with navigating to the shopping cart. In our initial state
(a), the bot is on the HOME screen and has the goal of reaching the checkered flag on the CART screen. Don’t let
this visualization deceive you; for now, the bot only knows about the HOME screen and as a result has only this
single state in its application model. Recall from the previous subsection that it can recognize the screen and its
various widgets. Another prerequisite I introduce here is that the bots must be able to stimulate via input actions
such as keystrokes, taps, clicks, and swipes. Prior to its taking any input actions, I give the bot an initial score of
zero. Scoring represents the reward system, a bag of treats, so to speak, which can be positive or negative. Now the
bot takes its first action. It randomly clicks a link and transitions to the PRODUCT screen (b). The bot has now seen
two screens, HOME and PRODUCT, and updates its application model with this information. My response to the
bot’s actions is that this isn’t really what I want—I am really looking for the CART—so I deduct 1 from the bot’s
score. From PRODUCT, the bot takes another random action, and this time it lands on the CART (c). Excellent!
This is exactly where I want the bot to be, so I reward the bot with 100 points. Following this path, the bot ends up
with a score of 0 – 1 + 100 = 99 points and a complete model of the application. Let’s call this exploration scenario
Episode 1.
Figure 2-2. Illustrative example of using bots to explore an app using reinforcement learning
Consider a second exploration scenario, where the bot once again starts on the HOME screen (a). However, instead
of navigating to PRODUCT, the bot takes an action that takes it directly to the CART (c). Applying the scoring
system, the bot earns 0 + 100 = 100 points for Episode 2. The bot essentially finds a path with a higher reward and,
moving forward, will follow that path to accomplish its task. In short, the bot learns by combining trial and error
with past experiences rather than taking a brute force approach, which as you may recall is computationally
infeasible. Goal-based reinforcement learning is practically applicable to a variety of testing tasks, making it an
extremely powerful technique for test automation.
Test
Now that the bots can perceive, explore, and model the application, these capabilities come together for the greater
good of software testing. The bots are trained how to generate specific types of input data and can recognize what
expected and unexpected behaviors look like in given contexts. Note that it may be easier for bots to identify some
types of issues than others. For example, an HTTP 404 error is an obvious indication that the application has thrown
an error. However, it is significantly harder to know that someone’s pay stub is incorrect because their taxes weren’t
calculated appropriately. Nonetheless, several researchers and practitioners are applying AI/ML research to
automatic functional UI test generation. This work ranges from generating inputs for individual fields to conducting
complete end-to-end test cases, including oracles.2 Although we have only scratched the surface in this area,
AI-based test generation approaches are slowly narrowing the test automation gap.
Learn
One of the most notable characteristics of AI- and ML-driven applications is the ability of the system to improve
how it performs a given task based on feedback. Having humans provide direct feedback on the bots’ actions—for
example, recognizing UI elements, generating inputs, or detecting bugs—makes the system better. Feedback
mechanisms allow humans to reinforce or rewrite the AI brain that drives the bots’ behavior. As a result, the more
feedback your teams provide to the bots on the quality of their testing, the better the bots become at testing the
product, and ultimately the more value they provide to the testing team. Typically, feedback is incorporated into the
product UI itself. However, depending on the level of testing, feedback may come via mechanisms for updating
datasets more directly.
1. Service A sends an HTTP GET request to retrieve some data from Service B.
2. Service B processes the request and returns an HTTP status of 200, indicating success along with a
response body containing the requested data.
A common approach to automated API testing with AI is to record the service traffic from manual testing and then
use that information to train an ML-based test generation system. One of the key goals for this type of testing is to
verify that the communication sequences do not result in unhandled errors. This approach stems from the once
popular record-and-playback feature found in early functional test automation tools, so you’ll see this pattern in
other areas such as performance testing. However, although there are several available open source and commercial
API test automation tools on the market,3 few of them offer these new “record, learn, generate, and play”
capabilities. Since automated tests, like those for APIs, map naturally to interleaving sequences of inputs and
expected outputs, another AI-based approach that applies to API testing involves using long short-term memory
machines (LSTMs) for generating test sequences.4 Under this approach, you would train the machine on string
sequences of abstract test cases that contain example service requests and their respective responses. Figure 2-3
depicts the workflow for developing and validating such a test flow generator using neural networks.
Figure 2-3. Automatic test case generation using neural network models
These are the major steps of the workflow in the context of API testing:
● Automatic source code analysis to generate unit tests that reflect program behavior and help to reduce
gaps in coverage.
● Integration with version control systems to monitor source code changes and keep tests up-to-date.
Of the three automation levels—UI, service, and unit—the unit level appears to be getting the least amount of
attention. In fact, the attention level seems to be the inverse of what you would expect from a community following
automation best practices like the testing pyramid.8 For example, after taking a quick product offering survey in the
space, I found only one commercial product that uses AI for unit test generation. This pales in comparison to more
than 10 for functional UI test automation and 5 for API testing. However, AI technology to support improvements to
automated unit testing is definitely available, and I hope to see more progress in this area in the near future.
Figure 2-4 provides a sample report that compares the test results of a retail application with those of other applications in the
domain. The way this works is that the bots run a set of goal-based tests on key use case scenarios for each app within a given
category. The bots compute a percentile rank for each performance result so that you can compare the results for your app with all
other apps in your category.
Figure 2-4. Sample
application benchmarking test report for a major app in the retail category
The internal tool eliminates the need for manual rescripting, and now there is ongoing work that uses ML to drive smart rules for
parameterization and correlation. Several application performance-monitoring vendors are also touting features that include
AI-based performance, scalability, and resiliency testing.
You’ve probably already noticed that, although these guidelines are written with good intentions, parts of them are vague and
open to interpretation. For example, how efficient is “as efficient as possible”? Is there a rule to know when you can stop? What
about widgets other than text fields? While this may seem like nitpicking, the subjective nature of the guidance and the resulting
designs is what makes this problem so difficult. Furthermore, even when the guidelines are clear and precise, there are so many
variants to check that doing so in an application-specific way is extremely tedious.
AI is a great way to catch these issues because you can train the bots to examine the screen just like a designer, customer, or
reviewer. They don’t look at code or have app-specific checks but instead check all the visual elements on the screen against an
AI trained on example guideline violations labeled by humans. They find these issues almost instantly and in a repeatable manner
that avoids the error of human memory and interpretation. With AI enabling the automatic validation of UI design guidelines,
there really is little reason for humans to look for the issues that machines can now identify.
AI is proving to be an effective way to extend the capabilities of current accessibility-testing tools. By combining an AI-driven
testing platform with open source tools, you can train the bots to explore a website and evaluate its WCAG compliance. As the
bots explore the site, they conduct static accessibility checks using the open source tools and generate dynamic tests that mimic
users with disabilities. An interesting project, code-named Agent A11y,12 that employs this approach appears in the 2019
proceedings of the Pacific Northwest Software Quality Conference. A notable feature of Agent A11y is that, due to the large set
of WCAG checks the bots perform, the authors even use ML to correlate and coalesce the accessibility test results. Talk about
turning a problem on itself!
Envision 70 testers in a classroom. However, these testers are no ordinary testers. These are professional, technical testers, who
work in roles where their company has chosen to send them for a week of training at an international testing conference held in
the United States. These testers are confident enough to brave a full day of learning about AI and ML algorithms. This room is
full of great testers. Their opponent is a neural network—AI bots trained on data related to the questions that are about to come.
Now let’s go a step further and pretend that you are one of those testers and see if AI can beat you at your own game. We ask you
the following qualitative testing question: If you were looking at an application’s login screen, how would you know if you could
trust it or not? In other words, solely by looking at the user interface, could you rate an app’s trustworthiness? Take a moment to
think about it and then look at some of the example mobile login screens in Figure 2-6.
In Figure 2-6, the screens on the left are data samples of some of the least trusted apps, while those on the right are some of the
most trusted. Any thoughts? If it’s any consolation, the other 69 testers in the room are taking quite some time to think about it
too. There are no quick answers. A woman in the front row exclaims, “Foreign languages!” She explains that if the primary
region of the app store is the US, but the app is written in a foreign language, she wouldn’t trust it because she wouldn’t
understand what it was saying. Not a bad start, but we’ve already spent three minutes with 70 human minds thinking about this
problem in parallel. Now granted, it is not typical that you would engage 70 testers on this one problem, and there may be several
biases at play here. However, even if only one highly skilled tester produced the same result, it may not be a good use of their
time.
A couple more minutes go by, and then a hand goes up. A gentleman suggests that if there is a recognizable brand or logo on the
screen, he would probably trust the app more than apps without these features. So now, 70 people have spent five minutes of their
time, and we have two ideas for how to measure UI trustworthiness. That’s progress, but the room quickly becomes quiet again.
This is the point where the law of diminishing returns sets in. There are no more ideas past the 10-minute mark, and in fact there
are no more ideas until that group has to move on.
But how did the AI bots perform in the challenge? Prior to the class, back at my company’s headquarters, ML engineers trained a
neural network using trustworthiness data from real users. The data was the result of asking individuals to rate the trustworthiness
of a large set of login screens on a scale of 1 to 10. Once trained, the bots had the ability to simulate the human raters while
working on previously unseen samples. As part of the experiment, the engineers inspected the neural network to understand the
AI’s answer to the same question I asked the human testers.
Here is the AI’s explanation:
Foreign language
If the screen has foreign words/characters, it’s less trustworthy.
Brand recognition
If the screen has a popular brand image, it’s more trustworthy.
Number of elements
If the screen has a high number of elements, it’s less trustworthy.
Interestingly, for this task, the AI appears to be “smarter” than any one person. The bots produce three UI design factors that
relate to UI trustworthiness, whereas no single person came up with more than one factor. Furthermore, not only did the AI
discover an additional aspect of the application UI that correlates to trustworthiness, but it gave a precise score of how
trustworthy it thought the screen was using a scale of 1 to 10.
With machine learning, the bots truly learn to simulate the judgment of the humans that provide training data for a given task. In
this testing experiment, the bots directly reflect how real users view trustworthiness, while as human testers we have to pretend
and emulate that empathy indirectly. How much better is it to have the oracle be real-world users versus testers trying to reverse
engineer and guess what the end user will think or feel?
Conclusion
AI-driven test automation is causing quite a stir in the software community due to its applicability to multiple levels and
dimensions of software testing. AI is testing user interfaces, services, and lower-level components and evaluating the
functionality, performance, design, accessibility, and trustworthiness of applications. With all of the activity and buzz around AI
for software testing, it feels like the beginning of a new era of test automation. AI is giving testing some much-needed
superpowers to help tackle challenges like automatic test generation.
1 Test.ai, “Case Study App Store Provider,” 2020.
2 Dionny Santiago, “A Model-Based AI-Driven Test Generation System” (master’s thesis, Florida International University, September 9, 2018).
3 Joe Colantonio, “Top API Testing Tools for 2020,” Test Guild, May 16, 2017.
5 Laurence Saes, “Unit Test Generation Using Machine Learning” (master’s thesis, Universiteit van Amsterdam, August 18, 2018).
6 Diffblue
7 “GPT-3 Powers the Next Generation of Apps,” OpenAI, March 25, 2021.
9 Kaushal Dalvi, “End to End Performance Testing—Automated!” (paper presented at the STAREAST 2020 Conference, Orlando, Florida, May
2020).
10 Kaushal Dalvi, “End to End Performance Testing—Automated!”
11 Aleksander Bai, Heidi Mork, and Viktoria Stray, “A Cost-Benefit Analysis of Accessibility Testing in Agile Software Development Results
from a Multiple Case Study,” International Journal on Advances in Software 10, nos. 1–2 (2017): 96–107.
12 Keith Briggs et al., “Semi-Autonomous, Site-Wide A11Y Testing Using an Intelligent Agent,” PNSQC Proceedings, 2019.