Testing Chapter 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Chapter 2.

Leveraging AI for Test Automation


Did you know that autonomous and intelligent agents, commonly referred to as bots, are already running tests on major
applications today? That’s right: leveraging AI for software testing is not a thing of the future; AI for test automation is already
here. The bots are in the building, and they’re not testing just one app but many apps in various application domains.1 In fact,
don’t be surprised if you find out that AI is already testing your own app! Chances are that, if you publish your app in one of the
major app stores, AI bots are already testing it. In this chapter, I’ll walk you through how AI tests software and demystify how
this technology really works when testing applications at different levels or for various quality attributes.

AI for UI Testing
Just because certain tasks have historically required human effort does not mean that we won’t be able to automate
them someday. Once upon a time, we believed that tasks such as voice and image recognition, driving, and musical
composition were too difficult for computers to simulate. However, many of these tasks are now being automated
using the power of AI and machine learning. In some cases, AI is outperforming human experts in tasks such as
medical diagnosis, legal document analysis, and aerial combat tactics, among others. With that in mind, it really
shouldn’t surprise you that we’re leveraging AI for functional testing tasks that previously relied on the expertise of
human testers. Figure 2-1 illustrates how to train AI bots to perceive, explore, model, test, and learn software
functionality. It is important to note that even though learning through feedback is explicitly called out at the end,
the bots leverage machine learning at each stage of the process.

Figure 2-1. Training AI to do functional UI testing

Perceive
A foundational step in functional UI test automation is having the ability to interact with the application’s screens,
controls, labels, and other widgets. Recall that traditional automation frameworks use the application’s DOM for
locating UI elements and that these DOM-based location strategies are highly sensitive to implementation changes.
Leveraging AI for identifying UI elements can help to overcome these drawbacks. Just like humans, AI bots
recognize what appears on an application screen independently of how it is implemented. In fact, there need not be a
DOM at all, as the interaction can be based solely on image recognition. AI, and more specifically a branch of AI
known as computer vision, gives us the test automation superpower of being able to perceive anything with a screen.
Furthermore, since we train the bots on hundreds and thousands of images, UI design changes do not result in
excessive test script maintenance. In many cases, AI-based test automation requires zero maintenance after visual
updates and redesigns.

Explore and Model


Testers frequently explore the application’s functionality to discover its behavior, confirm specific facts, and look for
bugs. While exploring, they create mental models and refer back to these models to deal with uncertainty when the
application or its environment changes. Similarly, AI bots explore and build models of the application under test.
You give them a goal and they attempt to reach it by trial and error, a process known as reinforcement learning. An
easy way to understand how reinforcement learning works is to think about how you train a pet dog. To start, you
decide on the task you want the dog to accomplish—let’s say “stay.” You also need a bag of treats. If by chance you
say “stay” and the dog sits in place, you give it a treat. However, if the dog continues to move around, you keep the
treat. You could even take it a step further by emphasizing “bad dog” and showing your discontent.

Figure 2-2 provides an illustrative example of how to leverage goal-based reinforcement learning for exploring,
modeling, and testing a software application. To start, all the bot needs is the location of the application’s home
screen and for us to give it an objective. Let’s task the bot with navigating to the shopping cart. In our initial state
(a), the bot is on the HOME screen and has the goal of reaching the checkered flag on the CART screen. Don’t let
this visualization deceive you; for now, the bot only knows about the HOME screen and as a result has only this
single state in its application model. Recall from the previous subsection that it can recognize the screen and its
various widgets. Another prerequisite I introduce here is that the bots must be able to stimulate via input actions
such as keystrokes, taps, clicks, and swipes. Prior to its taking any input actions, I give the bot an initial score of
zero. Scoring represents the reward system, a bag of treats, so to speak, which can be positive or negative. Now the
bot takes its first action. It randomly clicks a link and transitions to the PRODUCT screen (b). The bot has now seen
two screens, HOME and PRODUCT, and updates its application model with this information. My response to the
bot’s actions is that this isn’t really what I want—I am really looking for the CART—so I deduct 1 from the bot’s
score. From PRODUCT, the bot takes another random action, and this time it lands on the CART (c). Excellent!
This is exactly where I want the bot to be, so I reward the bot with 100 points. Following this path, the bot ends up
with a score of 0 – 1 + 100 = 99 points and a complete model of the application. Let’s call this exploration scenario

Episode 1.
Figure 2-2. Illustrative example of using bots to explore an app using reinforcement learning

Consider a second exploration scenario, where the bot once again starts on the HOME screen (a). However, instead
of navigating to PRODUCT, the bot takes an action that takes it directly to the CART (c). Applying the scoring
system, the bot earns 0 + 100 = 100 points for Episode 2. The bot essentially finds a path with a higher reward and,
moving forward, will follow that path to accomplish its task. In short, the bot learns by combining trial and error
with past experiences rather than taking a brute force approach, which as you may recall is computationally
infeasible. Goal-based reinforcement learning is practically applicable to a variety of testing tasks, making it an
extremely powerful technique for test automation.

Test
Now that the bots can perceive, explore, and model the application, these capabilities come together for the greater
good of software testing. The bots are trained how to generate specific types of input data and can recognize what
expected and unexpected behaviors look like in given contexts. Note that it may be easier for bots to identify some
types of issues than others. For example, an HTTP 404 error is an obvious indication that the application has thrown
an error. However, it is significantly harder to know that someone’s pay stub is incorrect because their taxes weren’t
calculated appropriately. Nonetheless, several researchers and practitioners are applying AI/ML research to
automatic functional UI test generation. This work ranges from generating inputs for individual fields to conducting
complete end-to-end test cases, including oracles.2 Although we have only scratched the surface in this area,
AI-based test generation approaches are slowly narrowing the test automation gap.
Learn
One of the most notable characteristics of AI- and ML-driven applications is the ability of the system to improve
how it performs a given task based on feedback. Having humans provide direct feedback on the bots’ actions—for
example, recognizing UI elements, generating inputs, or detecting bugs—makes the system better. Feedback
mechanisms allow humans to reinforce or rewrite the AI brain that drives the bots’ behavior. As a result, the more
feedback your teams provide to the bots on the quality of their testing, the better the bots become at testing the
product, and ultimately the more value they provide to the testing team. Typically, feedback is incorporated into the
product UI itself. However, depending on the level of testing, feedback may come via mechanisms for updating
datasets more directly.

AI for Service/API Testing


Automated service or API testing validates the way systems communicate via sequences of requests and responses.
For example, a typical communication exchange between two services, A and B, could be as follows:

1. Service A sends an HTTP GET request to retrieve some data from Service B.
2. Service B processes the request and returns an HTTP status of 200, indicating success along with a
response body containing the requested data.

A common approach to automated API testing with AI is to record the service traffic from manual testing and then
use that information to train an ML-based test generation system. One of the key goals for this type of testing is to
verify that the communication sequences do not result in unhandled errors. This approach stems from the once
popular record-and-playback feature found in early functional test automation tools, so you’ll see this pattern in
other areas such as performance testing. However, although there are several available open source and commercial
API test automation tools on the market,3 few of them offer these new “record, learn, generate, and play”
capabilities. Since automated tests, like those for APIs, map naturally to interleaving sequences of inputs and
expected outputs, another AI-based approach that applies to API testing involves using long short-term memory
machines (LSTMs) for generating test sequences.4 Under this approach, you would train the machine on string
sequences of abstract test cases that contain example service requests and their respective responses. Figure 2-3
depicts the workflow for developing and validating such a test flow generator using neural networks.

Figure 2-3. Automatic test case generation using neural network models

These are the major steps of the workflow in the context of API testing:

1. Model the API testing flow as a sequence problem.


2. Develop an abstract test language to support the API test flow model.
3. Create a test set to validate the adequacy of the language in describing API tests.
4. Curate and/or gather example handcrafted API test flows.
5. Train a neural network to generate valid test flow sentences that belong to the grammar of the abstract
test language.
With an abstract test case generator for your APIs in place, you can now develop an engine that
transforms those abstract, platform-independent tests into platform-specific tests for execution using
any given communication protocol or technology.

AI for Unit Testing


Researchers and practitioners are training AI to automatically write unit tests in high-level programming languages
like Java.5, 6 Much of the work in this area builds on advances in AI for generating text and natural language.
Initiatives like Open AI’s GPT-3 combine natural language processing (NLP) and deep learning models to generate
text that is difficult to distinguish from human-written text.7 Here are two popular features of AI for unit-testing
tools and frameworks:

● Automatic source code analysis to generate unit tests that reflect program behavior and help to reduce
gaps in coverage.
● Integration with version control systems to monitor source code changes and keep tests up-to-date.

Of the three automation levels—UI, service, and unit—the unit level appears to be getting the least amount of
attention. In fact, the attention level seems to be the inverse of what you would expect from a community following
automation best practices like the testing pyramid.8 For example, after taking a quick product offering survey in the
space, I found only one commercial product that uses AI for unit test generation. This pales in comparison to more
than 10 for functional UI test automation and 5 for API testing. However, AI technology to support improvements to
automated unit testing is definitely available, and I hope to see more progress in this area in the near future.

AI for Performance Testing


Tooling for performance test automation has been relatively stable for many years. Several organizations still follow
legacy load-testing practices that have a steep learning curve and involve a lot of manual steps. AI and ML are
propelling us into a future where you can rapidly gather and correlate performance test metrics and even generate
complete end-to-end performance tests that normally would require human experts. In this section, I summarize two
promising directions in the use of AI for performance testing.

Application Performance Benchmarking


With an AI-driven framework for functional UI automation in place, you can extend the bots’ capability to tracking key
application performance metrics. While the bots are exploring and testing, they collect data on the number of steps taken, load
times, CPU utilization, and more. With AI, not only can you see how your app performs in key scenarios, but you can also
compare its performance to similar applications from your competitors once the app is available, for example, in an app store.
This is possible because AI-driven tests are highly reusable across different apps within the same domain.

Figure 2-4 provides a sample report that compares the test results of a retail application with those of other applications in the
domain. The way this works is that the bots run a set of goal-based tests on key use case scenarios for each app within a given
category. The bots compute a percentile rank for each performance result so that you can compare the results for your app with all
other apps in your category.
Figure 2-4. Sample
application benchmarking test report for a major app in the retail category

Toward End-to-End Performance Testing with AI


Although practically useful, application performance benchmarking is not the same as end-to-end performance
testing. Recall from “Nonfunctional Test Automation” that system-level performance testing emulates production
traffic using hundreds and thousands of virtual concurrent users conducting real-world transactions. In a presentation
at the STAREAST 2020 testing conference, performance testing expert Kaushal Dalvi shared his team’s experience
building a tool to generate end-to-end performance tests.9 Figure 2-5 depicts a vision of their ambitious goal of
developing a self-service system that automatically produces LoadRunner scripts complete with parameterization,
correlation, and logic.

Figure 2-5. A vision of automated end-to-end performance testing using ML10

The internal tool eliminates the need for manual rescripting, and now there is ongoing work that uses ML to drive smart rules for
parameterization and correlation. Several application performance-monitoring vendors are also touting features that include
AI-based performance, scalability, and resiliency testing.

AI for Design Testing


A quick internet search on the topics of manual and automated testing is likely to return some blog posts, articles, and
presentations describing why test automation will never replace manual testing. One of the frequently used points to support this
argument is that it is not possible to automate things that require human judgment. Such qualitative assessments rely on people’s
experiences, perceptions, opinions, or feelings. UI design quality attributes like usability, accessibility, and trustworthiness all fall
under this category. However, advances in ML are demonstrating that it is possible for machines to simulate human judgment for
specific tasks, including UI design testing.

AI for Mobile Design Testing


Google and Apple publish guidelines to help developers ensure that Android and iOS mobile apps are well designed and deliver a
consistent user experience. Here is an example guideline:
When Possible, Present Choices
Make data entry as efficient as possible. Consider using a picker or table instead of a text field, for example, because it’s easier to
choose from a list of predefined options than to type a response.

You’ve probably already noticed that, although these guidelines are written with good intentions, parts of them are vague and
open to interpretation. For example, how efficient is “as efficient as possible”? Is there a rule to know when you can stop? What
about widgets other than text fields? While this may seem like nitpicking, the subjective nature of the guidance and the resulting
designs is what makes this problem so difficult. Furthermore, even when the guidelines are clear and precise, there are so many
variants to check that doing so in an application-specific way is extremely tedious.

AI is a great way to catch these issues because you can train the bots to examine the screen just like a designer, customer, or
reviewer. They don’t look at code or have app-specific checks but instead check all the visual elements on the screen against an
AI trained on example guideline violations labeled by humans. They find these issues almost instantly and in a repeatable manner
that avoids the error of human memory and interpretation. With AI enabling the automatic validation of UI design guidelines,
there really is little reason for humans to look for the issues that machines can now identify.

AI for Web Accessibility Testing


In an effort to promote universal access to web technologies, the World Wide Web Consortium (W3C) developed a set of Web
Content Accessibility Guidelines (WCAG). These guidelines provide criteria to make software accessible to people with
physical, visual, cognitive, and learning disabilities. Not only do web development companies have a moral obligation to
construct web applications that provide universal access, but in most countries they have a legal obligation. Although several
tools support static WCAG web page analysis, present tools fall short in evaluating an entire application for accessibility.
Furthermore, current test automation techniques are capable of discovering only about 30% of WCAG Level A and Level AA
conformance issues.11

AI is proving to be an effective way to extend the capabilities of current accessibility-testing tools. By combining an AI-driven
testing platform with open source tools, you can train the bots to explore a website and evaluate its WCAG compliance. As the
bots explore the site, they conduct static accessibility checks using the open source tools and generate dynamic tests that mimic
users with disabilities. An interesting project, code-named Agent A11y,12 that employs this approach appears in the 2019
proceedings of the Pacific Northwest Software Quality Conference. A notable feature of Agent A11y is that, due to the large set
of WCAG checks the bots perform, the authors even use ML to correlate and coalesce the accessibility test results. Talk about
turning a problem on itself!

AI for UI Trustworthiness Testing


While coteaching a testing workshop, I had the opportunity to engage the participants in playing an intriguing game of human
testers versus AI bots. If you think about it, AI can beat humans in games like chess, Jeopardy, and go, so why not software
testing? Let’s take a look at this game of testing and how it played out.

Envision 70 testers in a classroom. However, these testers are no ordinary testers. These are professional, technical testers, who
work in roles where their company has chosen to send them for a week of training at an international testing conference held in
the United States. These testers are confident enough to brave a full day of learning about AI and ML algorithms. This room is
full of great testers. Their opponent is a neural network—AI bots trained on data related to the questions that are about to come.

Now let’s go a step further and pretend that you are one of those testers and see if AI can beat you at your own game. We ask you
the following qualitative testing question: If you were looking at an application’s login screen, how would you know if you could
trust it or not? In other words, solely by looking at the user interface, could you rate an app’s trustworthiness? Take a moment to
think about it and then look at some of the example mobile login screens in Figure 2-6.

Figure 2-6. Rating the trustworthiness of an application based on its UI design

In Figure 2-6, the screens on the left are data samples of some of the least trusted apps, while those on the right are some of the
most trusted. Any thoughts? If it’s any consolation, the other 69 testers in the room are taking quite some time to think about it
too. There are no quick answers. A woman in the front row exclaims, “Foreign languages!” She explains that if the primary
region of the app store is the US, but the app is written in a foreign language, she wouldn’t trust it because she wouldn’t
understand what it was saying. Not a bad start, but we’ve already spent three minutes with 70 human minds thinking about this
problem in parallel. Now granted, it is not typical that you would engage 70 testers on this one problem, and there may be several
biases at play here. However, even if only one highly skilled tester produced the same result, it may not be a good use of their
time.

A couple more minutes go by, and then a hand goes up. A gentleman suggests that if there is a recognizable brand or logo on the
screen, he would probably trust the app more than apps without these features. So now, 70 people have spent five minutes of their
time, and we have two ideas for how to measure UI trustworthiness. That’s progress, but the room quickly becomes quiet again.
This is the point where the law of diminishing returns sets in. There are no more ideas past the 10-minute mark, and in fact there
are no more ideas until that group has to move on.

But how did the AI bots perform in the challenge? Prior to the class, back at my company’s headquarters, ML engineers trained a
neural network using trustworthiness data from real users. The data was the result of asking individuals to rate the trustworthiness
of a large set of login screens on a scale of 1 to 10. Once trained, the bots had the ability to simulate the human raters while
working on previously unseen samples. As part of the experiment, the engineers inspected the neural network to understand the
AI’s answer to the same question I asked the human testers.
Here is the AI’s explanation:
Foreign language
If the screen has foreign words/characters, it’s less trustworthy.
Brand recognition
If the screen has a popular brand image, it’s more trustworthy.
Number of elements
If the screen has a high number of elements, it’s less trustworthy.

Interestingly, for this task, the AI appears to be “smarter” than any one person. The bots produce three UI design factors that
relate to UI trustworthiness, whereas no single person came up with more than one factor. Furthermore, not only did the AI
discover an additional aspect of the application UI that correlates to trustworthiness, but it gave a precise score of how
trustworthy it thought the screen was using a scale of 1 to 10.

With machine learning, the bots truly learn to simulate the judgment of the humans that provide training data for a given task. In
this testing experiment, the bots directly reflect how real users view trustworthiness, while as human testers we have to pretend
and emulate that empathy indirectly. How much better is it to have the oracle be real-world users versus testers trying to reverse
engineer and guess what the end user will think or feel?

Conclusion
AI-driven test automation is causing quite a stir in the software community due to its applicability to multiple levels and
dimensions of software testing. AI is testing user interfaces, services, and lower-level components and evaluating the
functionality, performance, design, accessibility, and trustworthiness of applications. With all of the activity and buzz around AI
for software testing, it feels like the beginning of a new era of test automation. AI is giving testing some much-needed
superpowers to help tackle challenges like automatic test generation.
1 Test.ai, “Case Study App Store Provider,” 2020.
2 Dionny Santiago, “A Model-Based AI-Driven Test Generation System” (master’s thesis, Florida International University, September 9, 2018).
3 Joe Colantonio, “Top API Testing Tools for 2020,” Test Guild, May 16, 2017.

4 Dionny Santiago, “A Model-Based AI-Driven Test Generation System.”

5 Laurence Saes, “Unit Test Generation Using Machine Learning” (master’s thesis, Universiteit van Amsterdam, August 18, 2018).

6 Diffblue

7 “GPT-3 Powers the Next Generation of Apps,” OpenAI, March 25, 2021.

8 Mike Cohn, “The Forgotten Layer of the Test Automation Pyramid.”

9 Kaushal Dalvi, “End to End Performance Testing—Automated!” (paper presented at the STAREAST 2020 Conference, Orlando, Florida, May

2020).
10 Kaushal Dalvi, “End to End Performance Testing—Automated!”

11 Aleksander Bai, Heidi Mork, and Viktoria Stray, “A Cost-Benefit Analysis of Accessibility Testing in Agile Software Development Results

from a Multiple Case Study,” International Journal on Advances in Software 10, nos. 1–2 (2017): 96–107.
12 Keith Briggs et al., “Semi-Autonomous, Site-Wide A11Y Testing Using an Intelligent Agent,” PNSQC Proceedings, 2019.

You might also like