Parsing and asserting inside a PDF document from Selenium in three easy steps

Engineering Insights & Web Platforms

The Portable Document Format (PDF) is a file format used to present documents in a manner independent of application software, hardware, and operating systems. Each file encapsulates a complete description of a fixed-layout flat document.

In its particular way, it is different than a common document file and also it is more difficult to edit the data inside it.

The following use case is presented in this Selenium tutorial:

Having Selenium-based automated tests for a software project, in a test script, we require the verification of pre-defined expected results against noticed results data that are stored in a PDF file.

Why is PDF verification useful in automated tests?

PDFs are commonly used to download data from websites, like user details, balance sheets, or Pro Forma Invoices for individual purchases. In a typical automated test scenario for an eCommerce WebApp, we would create an order, submit it and then verify the result. This will also include asserting if the details in the ProForma Invoice (PI) coincide with the result presented in the WebApp.

In order to be able to assert this information, we need to use a PDF reader library that will parse the entire document and add all its content in an object of type PDF that we can then manipulate.

In order to implement this in Java, the following steps can be used:

Import the PDF reader library (in our case codeborne.pdftest) in the test class.

If you are using Maven to manage your project dependencies, you can add the following in pom.xml:

For further details on how PDF parsing is done, you can check the Github repository at: https://github.com/codeborne/pdf-test.

Download the PDF file that will be parsed from the WebApp, during the automated test. Create a path towards the download location for the test script and add initial assertions.

In order to verify a file, we need to know its name and location to create a path to it.

On a Windows machine the usual default download for Chrome is “C:Users%username%Downloads”, however, this can be easily configured.

The name of the file should have an individual but predictable pattern, like the username, date of purchase, order number, or maybe an incremented file downloaded number.

So what we need is to determine exactly how the filename is created, save the details in the previous steps of the test script, and use them when you create a direct path to the downloaded PDF.

In the scenario used for this tutorial, the PDF will concatenate the user id and the order number.

If the testing process is run on a separate server in a Selenium Grid configuration, on multiple machines then the username will be different for each one. It is recommended to use in the above example the username system variable of Windows instead of the hardcoded username. In JAVA this is done by using System.getenv(“username”) when creating the path to the file. The argument eComm is used to get the previously-stored order number.

After the path to the file has been created, it is recommended to add additional verification steps.

We should verify that the PDF can be opened from the browser to have a quick overview of it. This is simply a simulation of normal user behavior, through Selenium automated tests.

Once the PDF will be displayed in the browser you can install a listener, make a screenshot of it, store it or send it via email as required.

To assert that the correct URL is present, you can use the below code snippet.

Create a PDF object and verify portions of the string.

Once we know the location of the file we can create a PDF object and start with the verification. The logic of this method will be to add all details from the document in an object of type PDF, and verify key portions from it that are relevant to our test, like customer name, order total value, order number, etc.

Other methods of verification contained in code.codeborne, besides containsText are containsExactText, containsTextCaseInsestive, which use a different regular expression for the pdf parsing.

For support in deciding how to approach software testing for your organization, visit our dedicated page or feel free to say hello@tremend.ro.