Data Masking for User Privacy with the NGINX JavaScript Module

by

The effect of wearing clothing made of a fabric designed to thwart face-recogntion software is similar to data masking in log files

The effect of wearing clothing made of a fabric designed to thwart face-recognition software is similar to data masking in log files
Fabric pattern that confuses facial-recognition software, designed by Adam Harvey

In October 2016, the Court of Justice of the European Union ruled that IP addresses are “personal information” and as such fall under the Data Protection Directive and General Data Protection Regulation (GDPR). For many website owners, this presents challenges for archiving and analyzing log files if the data leaves the EU. For data moving into the US, the EU‑US Privacy Shield provides some protection but faces legal challenges by privacy groups and governments who do not believe the level of protection is adequate.

However, protecting personal data in log files is not just an EU problem. For organizations with security certifications like ISO/ICE 27001, moving log files outside of the security realm where they were generated, say from network ops to marketing, can compromise the scope and compliance of the certification.

In this blog post, we describe some simple solutions for sanitizing NGINX Plus and NGINX log files so that they can be safely exported without exposing what is often called personally identifiable information (PII).

[Editor – This post is one of several that explore use cases for the NGINX JavaScript module, which was originally called nginScript. For a complete list, see Use Cases for the NGINX JavaScript Module.

The post is updated to use the js_import directive, which replaces the js_include directive in NGINX Plus R23 and later. For more information, see the reference documentation for the NGINX JavaScript module – the Example Configuration section shows the correct syntax for NGINX configuration and JavaScript files.]

Simple Approaches Do Not Work

The simplest approach to personal data protection is to strip IP addresses from logs before they are exported. This is easy to achieve with standard Linux command line tools, but log analysis systems may expect log files in a standard format and fail to import logs that omit the IP address field. Even if logs are successfully imported, the value of log processing may be significantly reduced if the analysis system relies on IP addresses to track a user across a site.

Another potential approach, substituting fake or random values for real IP addresses, results in log files that look complete, but the quality of log analysis is compromised because each log entry appears to originate from a different randomly generated IP address.

Masking the Client IP Address

The most effective solution is to use a technique called data masking to transform the real IP address into one that does not identify the end user but still allows correlation of website activity for a particular user. Data masking algorithms always produce the same pseudorandom value for a given input value in a way that ensures it cannot be converted back to the original input value. Every occurrence of an IP address is always transformed to the same pseudorandom value.

You can implement IP address masking in NGINX and NGINX Plus with the NGINX JavaScript module. The module is a unique JavaScript implementation for NGINX and NGINX Plus, designed specifically for server‑side use cases and per‑request processing. In this case we execute a small amount of JavaScript code to mask the client IP address as each request is logged.

Instructions for enabling the NGINX JavaScript module with NGINX and NGINX Plus appear at the end of this article.

NGINX and NGINX Plus Configuration for IP Address Masking

The log_format directive controls which information appears in the access logs. NGINX and NGINX Plus ship with a default log format called combined which produces log files that can be processed by most log‑processing tools.

For this configuration we create a new log format, masked, which is identical to the combined format except for the first field, where we replace the $remote_addr variable with $remote_addr_masked. The new variable is evaluated by executing our JavaScript code, which generates a masked version of the client IP address.

To have NGINX and NGINX Plus write access logs in this format we configure a server block with an access_log directive that specifies the masked format.

As we are going to use NGINX JavaScript to mask the client IP address, we use the js_import directive to specify the location of our JavaScript code. The js_set directive specifies the JavaScript function that will be executed when evaluating the $remote_addr_masked variable.

Notice that we use two access_log directives. The first uses the default log format to produce access logs that can be used by administrators for operational purposes. The second specifies the masked log format. With this configuration we write two access logs for each request – one for sysadmins and DevOps, and one for export.

Finally, the location block defines a very simple response using the return directive to show that data masking is working. In production, this would most likely contain a proxy_pass directive to direct requests to a backend server.

NGINX JavaScript Code for IP Address Masking

We construct masked IP addresses using three simple JavaScript functions. Dependent functions must appear first so we will discuss them in that order.

The essence of the data masking solution is to use a one‑way hashing algorithm to transform the client IP address. In this example we are using the FNV‑1a hash algorithm, which is compact, fast, and has reasonably good distribution characteristics. Its other advantage is that it returns a positive 32‑bit integer (the same size as an IPv4 address), which makes it trivial to present as an IP address. The fnv32a function is a JavaScript implementation of the FNV‑1a algorithm.

The i2ipv4 function converts a 32‑bit integer to an IPv4 address in quad‑dotted notation. It takes the hashed values from fnv32a() and provides a representation that “looks right” in our access log. Both IPv6 addresses and IPv4 addresses are represented in IPv4 format.

Finally, we have the maskRemoteAddress function, which is referenced by the js_set directive in the NGINX and NGINX Plus configuration above. It has a single parameter, req, which is the JavaScript object representing the HTTP request. The remoteAddress property contains the value of the client IP address (equivalent to the $remote_addr variable).

IP Address Masking in Action

With the above configuration in place, we can make a simple request to our server and check the response and the resultant access log entries.

$ curl http://localhost/
127.0.0.1 -> 8.163.209.30
$ sudo tail –lines=1 /var/log/nginx/access*.log
==> /var/log/nginx/access.log <== 127.0.0.1 - - [16/Mar/2017:19:08:19 +0000] "GET / HTTP/1.1" 200 26 "-" "curl/7.47.0" ==> /var/log/nginx/access_masked.log <== 8.163.209.30 – – [16/Mar/2017:19:08:19 +0000] “GET / HTTP/1.1” 200 26 “-” “curl/7.47.0”

Masking Personal Data in the Query String

Log files can also contain personal data other than IP addresses. Email addresses, postal addresses, and other identifiers can be passed as query‑string parameters and consequently be logged as part of the request URI. If your application passes personal data in this way, you can extend the NGINX JavaScript IP address‑masking solution to sanitize personal data in the query string.

NGINX and NGINX Plus Configuration for Query String Masking

Like the default combined format, the masked log format defined above for IP address masking logs the $request variable, which captures three components of a request: HTTP method, URI (including query string), and HTTP version. We need to mask only the query string, so in the interests of code efficiency we use a separate variable for each of the three components, transforming only the request URI (second component) with the $request_uri_masked variable and using standard variables ($request_method and $server_protocol) for the first and third components.

The server block requires another js_set directive to define how the $request_uri_masked variable is evaluated.

NGINX JavaScript Code for Query String Masking

We add the maskRequestURI JavaScript function to the mask_ip_uri.js file shown above. Like maskRemoteAddress(), maskRequestURI() depends on the fnv32a hashing function and so appears below it in the file.

The maskRequestURI function iterates through each key‑value pair in the query string, looking for specific keys that are known to contain personal data. For each of these keys, the value is transformed to a masked value.

Depending on the type of processing to be carried out on NGINX and NGINX Plus log files, the masked query string values may need to resemble genuine data. In the example above we have formatted zip to be five digits and email to comply with RFC 821. Other keys may require more sophisticated formatting or dedicated functions to construct them.

Query String Masking in Action

With these additions to our configuration, we can see query string masking in action.

$ curl “http://localhost/index.php?foo=bar&zip=90210&email=liam@nginx.com”
127.0.0.1 -> 8.163.209.30
$ sudo tail –lines=1 /var/log/nginx/access*.log
==> /var/log/nginx/access.log <== 127.0.0.1 - - [16/Mar/2017:20:05:55 +0000] "GET /index.php?foo=bar&zip=90210&email=liam@nginx.com HTTP/1.1" 200 26 "-" "curl/7.47.0" ==> /var/log/nginx/access_masked.log <== 8.163.209.30 - - [16/Mar/2017:20:05:55 +0000] "GET /index.php?foo=bar&zip=38643&email=2852675791@example.com HTTP/1.1″ 200 26 “-” “curl/7.47.0”

Summary

NGINX and NGINX Plus with the NGINX JavaScript module provide a simple and powerful solution for applying custom logic to request processing. In this article we demonstrated how NGINX JavaScript can be used to mask personal data so that log files can be written in a way that allows offline analysis without violating data protection requirements.

We would love to hear about how you are using NGINX JavaScript to solve business problems – please tell us about it in the comments section below.

To try NGINX Plus, start your free 30-day trial or contact us to discuss your use cases.


[ngx_snippet name=’njs-enable-instructions’]