Troubleshooting in Production Without Tuning the Error Log
Editor – This blog is one of several that discuss logging with NGINX and NGINX Plus. Please also see:
It’s also one of many blogs about use cases for the NGINX JavaScript module. For the complete list, see Use Cases for the NGINX JavaScript Module.
NGINX helps organizations of all sizes to run their mission‑critical websites, applications, and APIs. Regardless of your scale and choice of deployment infrastructure, running in production is not easy. In this article we talk about just one of the hard things about a production deployment – logging. More specifically, we discuss the balancing act of collecting the right amount of detailed logs for troubleshooting without being swamped with unnecessary data.
Logging Basics
NGINX provides two different logging mechanisms: access logging for client requests, and error logging for when things go wrong. These mechanisms are available in both the HTTP and Stream (TCP/UDP) modules, but here we focus on HTTP traffic. (There is also a third logging mechanism which uses the debug severity level, but we won’t discuss it here.)
A typical, default NGINX logging configuration looks like this.
http {
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main; # Log using the 'main' format
error_log /var/log/nginx/error.log warn; # Log up to 'warn' severity level
...
}
The log_format
directive describes the contents and structure of the log entries created when the access_log
directive is included in the configuration. The example above is an extension of the Common Log Format (CLF) used by many web servers. With the error_log
directive, you specify the severity level of the messages to log, but not the content or format of entries, which are fixed. More on that in the next section.
Other noteworthy aspects of NGINX logging include:
- Logging directives are automatically inherited by lower‑level configuration contexts. For example, an
access_log
directive at thehttp
context is applied to allserver{}
blocks. - A logging directive in a child context overrides inherited directives.
- Multiple logging directives may exist in the same context. For example, two
access_log
directives might be used to create both a standard CLF log file and a second, more detailed log.
The Reality of Logging in Production
In general terms, we want to use the access log to provide analytics and usage statistics, and use the error log for failure detection and troubleshooting. But running a production system is seldom that simple. Here are some common challenges:
- Access logs lack sufficient detail for troubleshooting
- Error logs disclose good detail at the
info
severity level but that is too verbose for normal operations - The error log format is fixed and so cannot be customized to include variables of particular interest
- Entries in the error log don’t include the context of the request and are difficult to match with the corresponding access log entry
What’s more, changing the NGINX configuration to add or remove logging detail in a production environment may also require going through a change‑control process, and redeploying the configuration. Entirely safe, but somewhat cumbersome when troubleshooting a live issue such as, “why am I seeing a spike in 4xx
/5xx
errors?”. This is of course magnified when there are multiple NGINX instances handling the same traffic across a cluster.
Using a Second Access Log for Errors
Customizing the format of the access log to enrich the data collected for each request is a common approach for enhanced analytics, but doesn’t scale well for diagnostics or troubleshooting. Asking the main access log to do two jobs is a contrived solution, because we typically want a lot more information for troubleshooting than we do for regular analytics. Adding numerous variables to the main access log can dramatically increase log volumes with data that is only occasionally useful.
Instead we can use a second access log and write to it only when we encounter an error that warrants debugging. The access_log
directive supports conditional logging with the if
parameter – requests are only logged when the specified variable evaluates to a non‑zero, non‑empty value.
map $status $is_error {
400 1; # Bad request, including expired client cert
495 1; # Client cert error
502 1; # Bad gateway (no upstream server could be selected)
504 1; # Gateway timeout (couldn't connect to selected upstream)
default 0;
}
access_log /var/log/nginx/access_debug.log access_debug if=$is_error; # Diagnostic logging
access_log /var/log/nginx/access.log main;
With this configuration, we pass the $status
variable through a map
block to determine the value of the $is_error
variable, which is then evaluated by the access_log
directive’s if
parameter. When $is_error
evaluates to 1
we write a special log entry to the access_debug.log file.
However, this configuration doesn’t detect errors encountered during request processing that are ultimately resolved, and therefore result in status 200
OK
. One such example is when NGINX is load balancing between multiple upstream servers. If an error is encountered with the selected server then NGINX passes the request to the next server under the conditions configured by the proxy_next_upstream
directive. As long as one of the upstream servers responds successfully, then the client receives a successful response, which gets logged with status 200
. However, the user experience is likely to be poor due to the retries, and it may not be immediately obvious that an upstream server is unhealthy. After all, we logged a 200
.
When NGINX attempts to proxy to multiple upstream servers, their addresses are all captured in the $upstream_addr
variable. The same is true for other $upstream_*
variables, for example $upstream_status
which captures the response code from each attempted server. So when we see multiple entries in these variables, we know that something bad happened – we probably have a problem with at least one of the upstream servers.
How about we also write to access_debug.log when the request was proxied to multiple upstream servers?
map $upstream_status $multi_upstreams {
"~," 1; # Has a comma
default 0;
}
map $status $is_error {
400 1; # Bad request, including expired client cert
495 1; # Client cert error
502 1; # Bad gateway (no upstream server could be selected)
504 1; # Gateway timeout (couldn't connect to selected upstream)
default $multi_upstreams; # If we tried more than one upstream server
}
access_log /var/log/nginx/access_debug.log access_debug if=$is_error; # Diagnostic logging
access_log /var/log/nginx/access.log main; # Regular logging
Here we use another map
block to produce a new variable ($multi_upstreams
) whose value depends on the presence of a comma (,
) in $upstream_status
. A comma means there is more than one status code, and therefore more than one upstream server was encountered. This new variable determines the value of $is_error
when $status
is not one of the listed error codes.
At this point we have our regular access log and a special access_debug.log file that contains erroneous requests, but we haven’t yet defined the access_debug
log format. Let’s now ensure that we have all the data we need in the access_debug.log file to help us diagnose problems.
JSON Logging with JavaScript
Getting diagnostic data into access_debug.log is not difficult. NGINX provides over 100 variables related to HTTP processing, and we can define a special log_format
directive that captures as many of them as we need. However, there are a few downsides to building out a naïve log format for this purpose.
- It is a custom format – you need to train a log parser how to read it
- Entries can be very long and hard for humans to read during live troubleshooting
- You continuously need to refer to the log format in order to interpret entries
- It is not possible to log non‑deterministic values such as “all request headers”
We can address these challenges by writing log entries in a structured format such as JSON, using the NGINX JavaScript module (njs). JSON format is also widely supported by log processing systems such as Splunk, LogStash, Graylog, and Loggly. By offloading the log_format
syntax to a JavaScript function, we benefit from native JSON syntax, with access to all of the NGINX variables and additional data from the njs ‘r
’ object.
js_import conf.d/json_log.js;
js_set $json_debug_log json_log.debugLog;
log_format access_debug escape=none $json_debug_log; # Offload to njs
access_log /var/log/nginx/access_debug.log access_debug if=$is_error;
The js_import
directive specifies the file containing the JavaScript code and imports it as a module. The code itself can be found here. Whenever we write an access log entry that uses the access_debug
log format, the $json_debug_log
variable is evaluated. This variable is evaluated by executing the debugLog
JavaScript function as defined by the js_set
directive.
This combination of JavaScript code and NGINX configuration produces diagnostic logs that look like this.
$ tail --lines=1 /var/log/nginx/access_debug.log | jq
{
"timestamp": "2020-09-21T11:25:55+00:00",
"connection": {
"request_count": 1,
"elapsed_time": 0.555,
"pipelined": false,
"ssl": {
"protocol": "TLSv1.2",
"cipher": "ECDHE-RSA-AES256-GCM-SHA384",
"session_id": "b302f76a70dfec92f6bd72de5732692481ebecbbc69a4d81c900ae4dc928485c",
"session_reused": false,
"client_cert": {
"status": "NONE"
}
}
},
"request": {
"client": "127.0.0.1",
"port": 443,
"host": "foo.example.com",
"method": "GET",
"uri": "/one",
"http_version": 1.1,
"bytes_received": 87,
"headers": {
"Host": "foo.example.com:443",
"User-Agent": "curl/7.64.1",
"Accept": "*/*"
}
},
"upstreams": [
{
"server_addr": "10.37.0.71",
"server_port": 443,
"connect_time": null,
"header_time": null,
"response_time": 0.551,
"bytes_sent": 0,
"bytes_received": 0,
"status": 504
},
{
"server_addr": "10.37.0.72",
"server_port": 443,
"connect_time": 0.004,
"header_time": 0.004,
"response_time": 0.004,
"bytes_sent": 92,
"bytes_received": 4161,
"status": 200
}
],
"response": {
"status": 200,
"bytes_sent": 186,
"headers": {
"Content-Type": "text/html",
"Content-Length": "4161"
}
}
}
The JSON format enables us to have separate objects for information related to the overall HTTP connection (including SSL/TLS), request, upstreams, and response. Notice how the first upstream (10.37.0.71) returned status 504
(Gateway
Timeout)
before NGINX tried the next upstream (10.37.0.72), which responded successfully. The half‑second timeout (reported as response_time
in the first element of the upstreams
object) accounts for most of the overall latency for this successful response (reported as elapsed_time
in the connection
object).
Here is another example of a (truncated) log entry, for a client error caused by an expired client certificate.
{
"timestamp": "2020-09-22T10:20:50+00:00",
"connection": {
"ssl": {
"protocol": "TLSv1.2",
"cipher": "ECDHE-RSA-AES256-GCM-SHA384",
"session_id": "30711efbe047c38a98c2209cc4b5f196988dcf2d7f1f2c269fde7269c370432e",
"session_reused": false,
"client_cert": {
"status": "FAILED:certificate has expired",
"serial": "1006",
"fingerprint": "0c47cc4bd0fefbc2ac6363345cfbbf295594fe8d",
"subject": "emailAddress=liam@nginx.com,CN=test01,OU=Demo CA,O=nginx,ST=CA,C=US",
"issuer": "CN=Demo Intermediate CA,OU=Demo CA,O=nginx,ST=CA,C=US",
"starts": "Sep 20 12:00:11 2019 GMT",
"expires": "Sep 20 12:00:11 2020 GMT",
"expired": true,
...
"response": {
"status": 400,
"bytes_sent": 283,
"headers": {
"Content-Type": "text/html",
"Content-Length": "215"
}
}
Summary
By generating rich diagnostic data only when we encounter an error, we enable real‑time troubleshooting without needing to perform any reconfiguration. Successful requests are not impacted because the JavaScript code runs only when we detect an error at the logging phase, after the last byte was sent to the client.
The complete configuration is available on GitHub – why not try it in your environment? If you’re not already running NGINX Plus, start a free 30-day trial today or contact us to discuss your use cases.