Redaction is crucial for businesses. It ensures that confidential material is not viewed by the wrong people. This applies especially to those operating in the legal, financial, healthcare and governmental sectors, where data security is tightly regulated. We’ve talked about redaction previously here.
The legal team of Paul Manafort, former Trump campaign chairman, know about this all too well.
Recently, in a court document filed by Manafort’s lawyers, sensitive content was improperly redacted. The team, or possibly the Department of Justice, added a black box over the text, leaving the text items visible under the shape (and therefore in the source code too). This means that you can easily open the document, delete the black box, and reveal the sensitive information.
Diving a little deeper into the source of the document shows that the attempted redaction was done with PDFium:
It’s important to note this software does not support true redaction. This means that a lack of functionality ended up causing them a lot of trouble.
Of course, this isn’t the first time improper redaction and a lack of general understanding of what redaction actually is has created a massive headache, and it certainly won’t be the last. Another example comes from the Medical Council of New South Wales.
In 2016, they made the same mistake in attempting to redact Protected Health Information (PHI) with black boxes. Unfortunately for them, Googlebot (Google’s search index software) paid no mind and indexed a report containing the PHI data.
Another improper form of redaction that is all too often used involves changing the background color of text to black. Of course, if you don’t delete the text in this instance, it’s easily recoverable by removing the background color.
With that out of the way, let’s take a look at the Manafort document in question (link).
If we open the document and go to page 5, we are met with what looks like redacted content:
However, on closer inspection, we see that it’s in fact just a black box that is easily removed by selecting the box and deleting it:
So what is proper redaction?
Real redaction doesn’t just black out the general area where the now redacted text used to be. It completely removes every trace of the text ever being there. That means the text is gone from:
- Paragraphs, meaning they can’t be copied and pasted
- Source code, meaning they can’t be crawled or parsed
It’s important to emphasize that redaction does not mean adding black boxes to a PDF.
As shown, this can cause major trouble. Here’s what they should have done to correctly redact the information:
How to redact a PDF
- Start up your PDF editor of choice – we recommend developing your own with our world-class true redaction functions in Foxit PDF SDK of course!
- Open your document and navigate to the content for redaction.
- Highlight the text as shown below in the red box and select Protect > Mark for Redaction.
- Next, hit Apply Redactions.
- Now your selected content is fully redacted and can no longer be accessed, even in the source code. Shown below is the correctly redacted content alongside the incorrectly redacted content:
This is just one way to redact content. Besides doing it manually, as shown above, there’s also the more automated approach to redacting, which involves pattern matching, page regions, and text searches.
Pattern matching redaction
In order to redact content by pattern matching, you must first establish the pattern you are looking for. For example, if you are looking to redact social security numbers, your pattern will be in the format of xxx-xxx-xxxx. Once you perform a search for this order, you simply mark and redact this pattern wherever it appears.
Page region redaction
If you have documents that follow the same design and layout, you can redact based on the page region of the content. For example, if the phone number of your patients is always located in the final third of page 2 of a document, then that can be a basis for redaction.
Text search redaction
If you are looking to redact a pre-defined series of text strings out of your PDF documents, you can program your application to actively search for them and redact them out of documents.
Foxit SDK is your expert partner in redaction. Our PDF SDK solution enables you to programmatically search and sensor sensitive information to keep your documents safe. See our redaction features here or get in contact with us below for a free trial.