• $ USD
  • € EUR

Black Hat USA 2020 Highlights: Portable Document Flaws 101


#2 Portable Document Flaws 101

Last time, I’ve shared my thoughts on the novel TLS-based attack. This time, I want to cover research done by Jens Müller @jensvoid on insecure (and less known) PDF features. He presented the “Portable Document Flaws 101” session during the second day of the BHUSA 2020 conference. The presentation was demo-rich and the author even left us with a couple of POC documents.

PDF basics

We all know and use PDF files. The name stands for Portable Document Format, and the format in fact is quite portable. You can read (and often edit and create) PDFs on desktop, mobile, and web platforms. Documents mostly look the same, no matter what software or operating system you are using. The format was developed by Adobe in 1993, but it was a proprietary format until 2008 when it was released as an open standard. Over the years, multiple features were added to the standard. Some of them are part of the core standard, some are just extensions – supported only by Adobe software. All 1.x specifications for PDF are backward inclusive, which makes conforming implementations quite heavy and complex. The 2.0 version deprecated some of the features and standardized new Adobe extensions. This complexity makes PDF readers the common target for fuzzing but also explains why it’s important to sandbox PDF processing. Not every software implements JavaScript support (e.g. you won’t find it implemented in Google Chrome built-in viewer), and support also varies between products (some methods are available in Adobe Acrobat DC but not in Adobe Reader). Although we typically use dedicated software to see PDFs, it’s actually a text-based format. Go on, find any PDF file, and open it within notepad. You will see definitions of header, body, cross-references table, trailer, and so on. Some elements may include binary content, otherwise, data is encoded using one of several methods. Dedicated software can be used to analyze documents without rendering them but, to some degree, you can analyze it just by using regular text-editor. Or better – hex-editor.

Example PDF file
Example PDF file (source: BH presentation)

What’s particularly interesting, is that PDF’s header starts with %PDF string. That’s how software is able to tell what format it deals with. However, the string doesn’t have to be placed at the beginning of the document. Anywhere within the first 1024 bytes will do. This makes PDF a great format to play with so-called polyglots – files that are valid documents of many formats. This might be useful during pentests to bypass some content filters – one can easily create a file that is a valid PDF document, but also a valid Windows executable or ZIP archive. Some examples of that can be found in truepolygot repository. Jens explained that PDFs often have purely redacted content – the output document may have metadata that allows identifying the author or software used to create the document, but also may contain previous versions of the document – just not directly referenced. That often happens also in case of “redacted” documents, the black square over a text is just another layer, the unredacted text is still available in one of the text streams. Over the years, multiple signatures attacks were published – parts of the document were not properly protected by the signature or it was possible to steal signature and use it to sign other documents. PDF standard contains the concept of actions. Action can result in navigation to page within the document, opening an URL, or running JS code. Actions are triggered by various events – opening a document, printing it, etc. Jens presented an invocation tree that explains how multiple paths can be used to achieve the same results – this is a core problem that implementations have to deal with. Mitigating a single path doesn’t solve the issue – just a single method of triggering it.

Call tree
Call tree (source: BH presentation)

During the talk, the following attack types were covered:

  • DoS
  • Information Disclosure
  • Data Manipulation
  • Code Execution

I won’t cover all of the presented attacks, only the ones I found most interesting. I highly recommend checking out Black Hat materials, though.

DoS attacks

There were two types of described DoS attacks. Both result in resource consumption and the potential hang of software or the entire operating system. The first type was based on infinite loops using references or JS code (e.g. while(1) {}), the second type is based on compression bombs. The logic is similar to deflate bombs known from other formats, such as ZIP files. This is done using the FlateDecode filter. Attackers can also concatenate multiple compression filters and achieve up to 1:18,576,846 ratio. In other words, 587 bytes of data will be decompressed in-memory to 10 GB buffer. This will surely cause problems, and bad news is that almost all PDF software is vulnerable to this attack. What’s worse, even seeing the document in Gnome or Windows Explorer may trigger the attack and freeze operating system.

Information disclosure attacks

It turns out that there are many ways to trigger a URL opening. While opening an untrusted document, you may unintentionally connect to the attacker’s server, giving out information about it, but also revealing your IP address (which may deanonymize you). If the attacker’s web server offers NTLM based authorization – guess what? – your PDF reader will leak NetNTLMv2 hashes. This allows for additional attacks, such as credentials relaying or brute-forcing. There were also issues related to revealing data entered into PDF forms. Such tax-forms or similar documents are often downloaded from various pages, not necessarily trusted locations. That seems ok – you are planning to fill in the document but then only print it or send it elsewhere. However, due to the form data leakage issue, the attacker may register action on the “print” event, and capture all entered information. That information would be then exfiltrated outside your system. The most interesting, though, is the ability to steal local files! Fortunately, only a few readers were found vulnerable to this attack. Basically, the standard says that it is possible to embed the local file into the PDF form and exfiltrate it to arbitrary URLs. So that’s exactly what was presented. On the picture below, you can see how the content of C:\Windows\win.ini is transferred as part of the GET request.

Exfiltrated Windows file
Exfiltrated Windows file (source: BH presentation)

I asked Jens if JavaScript could be used to encode payload (e.g. with BASE64) before it is sent, and he confirmed that it is possible. That way, we are not limited to certain characters and any kind of content could be exfiltrated.

Code execution attacks

The PDF standard has yet another interesting feature – the launch action. This action launches an application or opens or prints a document. The launched file can be a local file, but can also be delivered along with the document, as embedded executable. The majority of applications simply don’t follow standards and don’t allow for unconstrained code execution. Some of them do, though. It’s interesting that the standard doesn’t really take any security concerns into account.

On Linux, the associated app is used to open a specified file. So the text file will be opened with the text editor, most of the files with the archive manager, and so on. The command injection is restricted due to lstat() being first run on the path, but I think it still could be turned into controlled code execution. Bad news is that the file is supposed to be on the disk already, but maybe polyglots PDFs could be used here?


The summary of identified flaws is presented in the following matrix. Some of them are already fixed, some of them remain unaddressed.

Affected software
Affected software (source: https://github.com/RUB-NDS/PDF101)

The PoC documents are also available on the author’s Github. Make sure you understand what you are doing, before opening the files in any PDF reader (including built-in browser). Some of the identified findings are very likely to be used in phishing campaigns. It’s good to take a hard look at the used software and decide if perhaps other readers should be used instead.  

This image has an empty alt attribute; its file name is untitled-design-2.png

Adrian Denkiewicz, Cybersecurity Specialist at CQURE, Ethical Hacker, Penetration Tester, Red Teamer, Software Developer, and Trainer. Holder of OSCE, OSCP, OSWP, and CRTE certificates. Adrian is deeply interested in the offensive side of security, ranging from modern web attacks, through operating system internals, to low level exploit development.  Twitter: @a_denkiewicz

You may also be interested in:

How can we help you?

Suggested searches

    Search history

      Popular searches:

      Not sure what course to look for?

      Mobile Newsletter Form