Best Practices When Handling Personal Identifiable Information

Personal Identifiable Information (PII) is any information that can be used to identify an individual. Collected data such as names, emails, date of birth, and national ID are all PII. Take note, this list of attributes is by no means exhaustive. Different industries will collect various types of PII.

PII needs special attention because it’s vital to preserve users’ privacy. Strong user privacy policies not only increase consumer trust but are required by law. Some countries (for example, those in the EU affected by GDPR) require deleting all PII when a user requests it. This also includes removing PII from logs.

Unfortunately, it’s too easy to inadvertently write sensitive data into a log file without the proper processes and tools. This article covers the best practices when working with PII in the application code and during the logging process.

Working With PII in Application Code

Compartmentalize Sensitive Data

Sensitive data (and PII) should be limited to a narrow scope of components in the system:

Store all PII in a dedicated set of tables (and better in a different database/instances).
Use randomized internal keys to reference the PII record. Do not use only hash functions, as reverse lookups are effortless with modern computer hardware.
Always consider how to deal with data whose associated PII records are deleted. For example, what should be done with a user’s purchasing history that was removed from the system?
If we use event-driven architecture (EDA), we should not include PII in events. Instead, PII should be referred to when needed instead of replicated to an unknown number of services via events.
If PII data must be shared, encrypt it with one private key per user and only decrypt it after querying the private key on the fly. To remove a specific PII record, delete the private information of that user; then, all PII information will no longer be readable.
When stored in the database, utilize the individual private keys to encrypt the PII of each user. When a backup is restored, you can only access the PII of users whose associated private keys are still accessible. The backups are kept immutable while still satisfying the requirements of “removing” PII data for specific users.

By keeping PII data in a dedicated zone, we reduce the chance of having leftover data when a PII record is deleted. It can also allow us to apply stricter policies to our PII data without affecting the rest.

Keep Sensitive Data out of URLs

URLs, especially GET URLs, are typically logged in multiple places (proxies, ISP servers, etc). PII data must be kept from appearing in those URLs, or we will have trouble with compliance or a request to remove a PII record. Here are some practices:

Determine the PII in the early phase of API design. Changing APIs is never a trivial task.
Don’t use sensitive fields (emails, phone, etc.) as the unique identifier when querying via API.

Always Review Logging Statements in Code

People don’t think about PII or compliance when critical bugs need debugging. However, logging without reviewing opens a hole for PII data to leak into our logs. We should inspect every logging statement, especially before merging to the default branch (which is deployed to production). Let’s see an example; we have a User class like this.

class User 
{
    private string name;
    private Address address;
    private int id;
    public override string ToString()
    {
        return $"{id},{name},{address}";
    }
}

Here is an easy checklist:

If the field to be logged is a primitive type, ensure the variable’s value is not PII. For example, Console.WriteLine("User is {0}", userId); if we know userId is a string, then we must make sure that the value of userId is not PII while approving to merge this line.
If the field to be logged is an object:
Check if we have any ToString overridden in our class. If we have some, check the format used in that method and ensure we don’t write PII fields ( name, password, address, …) via logging. For example, Console.WriteLine("User is {0}", user), if user is an object of User type. The name (and maybe address) will be printed out to the console and logged because of the implementation of the ToString method.
Check every nested object in the logged variable. In our example, if the Address class also has a ToString implementation, the address will be exposed via logs.
As much as possible, use structured logging instead of string interpolation. This will help us apply custom masking or redacting policies via property names.

Again, code review is a critical practice to protect us from inadvertently exposing PII. If the team has a merge request template, it’s worth having an item to check logging statements there.

Consider Using a Custom Serializer with Dedicated Attributes Attached to Types

This approach is more advanced than configuring a blacklist of property names. The article, PCI and PII Compliance When Logging Data In Digital Transformation Projects, describes a detailed example in Java. The same methodology can be applied to C# applications:

Create a custom serializer and set logging libraries to use that serializer instead of the default one.
Create an attribute that applies to classes and fields, telling the system that the class has some PII data and that the redacting policy applies to the specific field.
Instruct the custom serializer to be aware of the attribute and process the object based on that.

Automation Tests Are Our Friends

Automation tests can help us to detect and prevent PII data exposure on many levels:

Unit/integration tests can use regex patterns to check the logging statement with common test data to identify the writing of PII to logs before changes are merged into production.
End-to-end tests can add steps to check if PII data is written to logs. For example, after submitting the contact form, E2E tests can see if the user’s email and phone number appear in the log system.

PII in the Logging Platform

Filter by a Whitelist of Properties During Log Shipping

The log shipping layer (sometimes called the forwarder layer) can be configured using regular expressions or grok patterns to redact or mask data based on the property name. An example of a fluentd configuration like this will drop the whole log message if credit card information is found in the message.

<filter **> 
  @type grep 
  <exclude> 
    key message 
    pattern /credit_card/ 
  </exclude> 
</filter>

Remove PII from Existing Logs

Last but not least, if we work on a brownfield project, it usually has PII information in our logs before applying the practices mentioned above. We can expire the old log and ensure the new one follows our practices. Unfortunately, applying log immutability (e.g., for compliance reasons) usually means processing all our log data and removing unwanted records. Then, it’s a reason to treat PII data correctly as soon as possible.

We’ve covered some practices to protect our PII data from exposure in the application code or the logging platform. There are more requirements and techniques to protect that critical and sensitive data at rest or in transit, but that is beyond the scope of this article. Ultimately, the most important thing is that all our engineers have a mindset to put more effort into PII during software development.