Automated Testing using Image Recognition

OpenCV (3.x)
Java 8
Java Native Interface (JNI)
Automated Testing Framework


At the yearly BPMCamp1 conference, developers were given the chance to show off ideas by presenting prototypes in a 2-hour open session. Having developed the Automated Testing Framework recently, I thought about a way to allow business users to design tests for BPM processes: They could write tests using what they interact with, meaning they could take screenshots of UI elements that need testing, fill in the parameters (i.e. tell the testing system how to interact with this element, such as clicking, entering a string, etc.), and let the testing system handle the rest of it. This would require the testing system to behave like how a human would when performing manual testing: Identify the element to test on the screen, interact with it, and finally verify that results reflect what was expected. I achieved this using image recognition algorithms.

Preliminary Research

Preliminary research for this project revealed 3 related projects:


Sikuli is a software package that automates “anything that’s seen on the screen” and comes in different packages. One of them is called SikuliX, which is distributed as a standalone package that can be scripted using a variety of languages. It can also be downloaded and included in any JVM-based language as a dependency for greater control. However, having initially been developed as a research project, its Java API is not well documented.


A cross-platform computer vision library that has been around since 2000. Having used this library in college before, I was somewhat familiar with its image recognition capabilities. At the time of this project, this library was mature, even including a GPU interface for some of the image recognition capabilities. OpenCV comes with a Java wrapper; however, this wrapper is limited in functionality, as it does not provide an API for all OpenCV functionality, such as the GPU interface.

Google Vision API

A comprehensive cloud offering by Google that performs various computer vision tasks, including optical character recognition (OCR). At the time of this project, the Google Vision platform was in its infancy and did not provide any examples of OCR performed on screenshot images.


I decided to go with OpenCV as I had experience using it before, and it provided a decent GPU interface for its image recognition algorithms, which would increase performance. However, this GPU interface was not available through the OpenCV Java API, which meant I had to use C++. Despite the challenges, this was a great opportunity for me to get back into C++ and learn how to use the Java Native Interface (JNI).

Architecture diagram showing the high level interaction between modules. Java calls to the C++ functions are made through JNI.

The first step was to implement the C++ applications - the element and text finders - which would be called by the Java application using JNI. Then, I had to extend the Automated Testing Framework (called “Application Executor” in the diagram above) to perform several simple tasks. The following had to be done, if the element finder was being used:

  1. Bring the app to be tested to the foreground.
  2. Take a screenshot.
  3. Send the screenshot and the image of the element to be found to the element finder, then start execution.
  4. Based on the response from the element finder, execute the test on the app in the foreground.

If the text finder was being used:

  1. Bring the app to be tested to the foreground.
  2. Take a screenshot.
  3. Send the screenshot and text to be found to the text finder, then start execution.
  4. Based on the response from the text finder, execute the test on the app in the foreground.

Element Finder

System sequence diagram for the UI element finder, showing the interaction between different modules.

The element finder first checked to see if the image of the UI element was smaller than a threshold. If so, it enlarged both the screenshot and the element image. Bicubic interpolation worked best for preserving sharp edges of elements.

Bicubic (left) vs bilinear (right) interpolation of an input box element on the screen. Bicubic interpolation is better at preserving sharp edges

Next, the Speeded up robust features (SURF) - an optimized version of Scale Invariant Feature Transform (SIFT) - algorithm was used to detect keypoints (i.e. areas of interest in the image) and descriptors for these keypoints (i.e. an internal representation that describes these keypoints, such as their gradient) in both the UI element image and the screenshot. These descriptors were then used to find the element in the screenshot. While there were other algorithms available for this process, such as FAST and ORB, SURF2 excelled at detecting images with blurring, which tends to happen with images that are scaled up.

Detection of keypoints and what the element matches in the screenshot (right).

Next, a brute force matcher was used to match the descriptors in the UI element image and screenshot, which went through all descriptors and found the best match. While the brute force matcher finds the best result, it can also take a long time to complete for larger datasets. In such cases, another matching algorithm called FLANN may be used. FLANN uses approximations to find a good match (which may not be the best match) and is reportedly more efficient on datasets that contain data on the order of thousands of entries3.

Finally, if a good number of matches were found (based on a preset threshold value), a perspective transform was applied on the element image, which allowed the proper screen coordinates to be obtained and sent back to the Java application, which interacted with the element on the screen.

Text Finder

I suspected using OCR could be an issue when trying to detect text in a screenshot, since generally photos are much more detailed (200-300 ppi4) than the text in a screenshot (72 ppi). Some research confirmed my suspicions and was proven by research done by others: An evaluation of HMM-based Techniques for the Recognition of Screen Rendered Text.

The research paper above showed that Hidden Markov Model (HMM) based algorithms fared better at detecting screen-rendered text as they didn’t need to segment characters for detection. This was important since segmentation of characters is difficult, due to anti-aliasing applied to screen-rendered text5. OpenCV offered two options for OCR using HMM: OCRTesseract (Google’s OCR library) and OCRHMMDecoder. Both required training data, which was also provided by OpenCV.

System sequence diagram for the text finder, showing the interaction between different modules.

Having decided to go with Tesseract OCR (due to its ease of use with OpenCV), the first step was to convert the screenshot into greyscale, then enlarge using Lanczos interpolation6 (as suggested by the research paper above), since Tesseract performed much better with higher resolution images. Other OCR algorithms also existed, such as a pure HMM-based solution or ABBYY; however, HMM-based solutions were difficult to set up and train, while ABBYY was a proprietary solution that was not provided for standalone use.

Tesseract performs much better with higher resolution images.

Next, the screenshot was binarized using the Otsu method, which automatically determined the threshold by separating the image into its foreground and background pixels. Since binarized images consist only of black and white pixels, it helped reduce the anti-aliasing formed around the text during enlargement. In cases where there is white text on black background however, the image needs to be inverted to detect such text as well (which can be performed as a separate sequence of steps).

A regular image (left) and its binarized version (right)

Next, the binarized screenshot was dilated to increase the white region in the image (assuming white is the color of the foreground objects in the image), which allowed text to merge, making it easier to find the contours of the merged text.

The screenshot (left), then dilated (middle), and finally contoured (right). Clicking on the contour image will help with viewing the red lines of the contour.

Next, a bounding box was created around the contoured areas, allowing extraction of only the regions with text. This would later be given to Tesseract for analysis, after filtering out outliers (such as vertical text).

Red box outlining the area with text, found by drawing a bounding box around the contour.

Finally, the regions can be given to Tesseract for extracting text. If any matching text is found within these regions, the location is sent back to the Java application such that tests can interact with the app in the foreground.


Despite being an experimental app, I faced several challenges in the few weeks that I spent on finding a solution:

JNIEXPORT jobject JNICALL Java_com_example_elementfinder_ElementFinder_findElementInScreenshot
        (JNIEnv *, jobject, jstring, jstring);


Although this was a difficult project, it is one that I greatly enjoyed experimenting with. I have learned about a plethora of image and text recognition algorithms, as well as the steps taken in utilizing such algorithms.


  1. BPMCamp is an annual conference held by BP3, where employees and customers are invited to watch presentations by BP3 employees and industry guests. BPM stands for “Business Process Management”. 

  2. A drawback of SURF is that it’s patented and requires permission to use commercially

  3. There is no official source that states what a “large dataset” entails. The “order of thousands” statement is based on other users’ observations of FLANN’s performance. 

  4. “ppi” stands for “Pixels per Inch”. It’s a way of measuring the pixel density of a screen or image. Higher pixel densities mean the image contains greater detail per square inch of area. 

  5. Anti-aliasing occasionally causes text to touch each other, which makes segmentation difficult. 

  6. Similar to cubic interpolation. You can read more about it here

  7. A 10 megapixel, CV_32FC3 format image (3 channel, each 32-bit floating point values, or 96 bits per pixel) takes up about 995,328,000 bits or ~124.4 MBs. 

Back to Projects