What Structured Logging Actually Is
Structured logging is simple in concept: instead of writing log lines as unformatted human-readable strings, you emit each log entry as a discrete data object — almost always JSON — where every piece of information lives in a named field. That's it. But the implications of that shift are enormous.
A traditional log line might look like this:
2024-03-15 14:22:31 INFO api-gateway - POST /api/v2/orders completed in 142ms for user 8841 [192.168.10.45]
A structured equivalent looks like this:
{
"timestamp": "2024-03-15T14:22:31.482Z",
"level": "info",
"service": "api-gateway",
"host": "sw-infrarunbook-01",
"method": "POST",
"path": "/api/v2/orders",
"duration_ms": 142,
"user_id": 8841,
"client_ip": "192.168.10.45",
"status_code": 200
}
Both convey the same information. But only one of them lets you run a query like
duration_ms > 500 AND service = "api-gateway"and get meaningful results in seconds. The free-text version requires regex parsing that someone has to write and maintain — and it will break the moment a developer changes the wording of that log message. I've seen this happen more times than I can count, usually during an incident when everyone is already stressed.
How JSON Structured Logging Works
The Core Schema
Every structured log entry should carry a baseline set of fields regardless of context. After running logs through Elasticsearch, Loki, and CloudWatch Logs Insights across several production environments, I've landed on this minimum viable schema as a solid foundation:
{
"timestamp": "2024-03-15T14:22:31.482Z",
"level": "info",
"service": "payments-api",
"host": "sw-infrarunbook-01",
"env": "production",
"message": "Payment authorization succeeded",
"trace_id": "7f3a9c1b-4e2d-4f8a-b1c3-9d8e7f6a5b4c",
"span_id": "a3f2c1b9"
}
The
timestampfield should always be in ISO 8601 format with UTC timezone. Don't use Unix epoch integers — they're harder to read during an incident and require conversion in most query interfaces. The
levelfield should be lowercase and consistent: use
debug,
info,
warn,
error, and
fatal. Pick one convention and enforce it across every service. Mixing
WARNINGwith
warnand
WARNacross your stack is a small thing that creates a large amount of pain at query time.
Correlation IDs and Distributed Tracing
The
trace_idfield is where structured logging starts earning its keep in distributed systems. When a request enters your system at the API gateway running on
sw-infrarunbook-01, you generate a trace ID and propagate it through every downstream call — via HTTP headers, message queue metadata, gRPC metadata, whatever transport you're using. Every service that touches that request emits its log entries with the same
trace_id.
Now when something goes wrong at 2am and you're looking at an alert from your payments service, you can pull all logs for that trace across every service with a single query. In my experience, this single practice cuts incident investigation time by more than half. Without it, you're manually correlating timestamps across different log streams and hoping the clocks are synchronized. They often aren't.
Contextual Fields and Bound Loggers
Good structured logging libraries support the concept of a logging context — a set of fields that get attached to every log entry within a given scope. In Go, this is done with
context.Contextand a logger like
zerologor
zap. In Python, the
structloglibrary handles this cleanly with bound loggers. The pattern looks like this in practice:
# structlog bound logger pattern (Python)
log = structlog.get_logger().bind(
service="inventory-service",
host="sw-infrarunbook-01",
env="production",
request_id="req-9f3a1c2d"
)
log.info("stock_check_started", sku="WIDGET-42", warehouse_id=7)
log.info("stock_check_completed", sku="WIDGET-42", available_units=143, duration_ms=28)
Both log entries automatically carry
service,
host,
env, and
request_idwithout repeating them in every call. This is the right way to do it. The alternative — passing a dozen fields manually on every log call — leads to inconsistency and missing fields under pressure, which is exactly when you need those fields most.
Handling Nested Objects
One question that comes up constantly is whether to nest objects inside your JSON logs. Keep it flat where possible, and use one level of nesting only when it meaningfully groups related data. Here's the contrast:
# Cluttered flat structure
{
"http_method": "GET",
"http_path": "/api/orders",
"http_status": 200,
"http_duration_ms": 88,
"db_query": "SELECT * FROM orders WHERE user_id = ?",
"db_duration_ms": 34,
"db_rows_returned": 12
}
# One level of grouping — cleaner and still queryable
{
"http": {
"method": "GET",
"path": "/api/orders",
"status": 200,
"duration_ms": 88
},
"db": {
"query": "SELECT * FROM orders WHERE user_id = ?",
"duration_ms": 34,
"rows_returned": 12
}
}
Deeper nesting than one level causes headaches. Most log query languages handle one level fine — Elasticsearch flattens it to
http.status, Loki's LogQL can filter on it with
| json. But three levels deep and you're fighting every query tool you'll ever use. Keep it sane.
Why It Matters in Production
If you're running a single service on a single host, structured logging is a convenience. Once you cross into multiple services, multiple hosts, or any kind of container orchestration, it becomes a necessity. The difference between diagnosing an incident in ten minutes versus two hours often comes down to whether you can query your logs or whether you're grepping through files.
Consider a real scenario: your monitoring system alerts that error rates on solvethenetwork.com's checkout flow have spiked to 8% over the past five minutes. With structured logs flowing into a centralized system, you can immediately run:
# Loki LogQL — filter errors and extract trace context
{service="checkout-service", env="production"}
| json
| level="error"
| line_format "{{.message}} | user={{.user_id}} | trace={{.trace_id}}"
Within seconds you see that all the errors share a
payment_gatewayfield value of
stripe-eu, suggesting a regional gateway issue rather than an application bug. You pivot to the payment service logs filtered by that same gateway value, confirm the pattern, and you're opening a vendor incident ticket — not staring at a wall of unstructured text trying to spot a trend manually.
Indexing Strategy and Storage Cost
There's another angle worth understanding: how your log storage and indexing system handles structured data. Elasticsearch creates inverted indexes on every field, which makes field-level queries fast but also means high-cardinality fields like
user_idor
trace_idconsume significant index memory. This is a real operational concern at scale, not a theoretical one.
The practical solution is being intentional about log levels. Debug-level logs often contain the high-cardinality detail you need for deep debugging — keep them at a sampling rate or suppress them in production unless you're actively investigating. Info-level logs should contain enough context to understand what happened without logging every field of every object in your system.
Grafana Loki takes a different approach: it indexes only labels — a small set of fields you designate — and stores the rest as raw log content parsed at query time. This trades query performance for storage cost, and it works well when your label cardinality is low. Knowing which system you're targeting should influence how you structure your fields.
Real-World Examples from Production
Application Error Logging
Error logs deserve special attention because they're the ones you're reading at 3am. They need to contain enough information to reproduce and understand the issue without requiring you to re-run the failing request. Here's what a well-structured error log looks like:
{
"timestamp": "2024-03-15T03:14:22.819Z",
"level": "error",
"service": "order-processor",
"host": "sw-infrarunbook-01",
"env": "production",
"message": "Failed to reserve inventory for order",
"trace_id": "b2c3d4e5-f6a7-4b8c-9d0e-1f2a3b4c5d6e",
"span_id": "c3d4e5f6",
"order_id": "ord-88821",
"user_id": 44291,
"sku": "WIDGET-42",
"requested_qty": 3,
"error": {
"type": "InsufficientStockError",
"message": "Available: 1, Requested: 3",
"stack": "order_processor.reserve_inventory:142 | order_processor.process_order:88 | worker.handle_message:31"
}
}
Notice that the stack trace lives inside an
errorobject rather than as a raw multiline string. Multiline strings are the enemy of structured logging. They break line-by-line parsing in most log shippers — Filebeat, Fluentd, Fluent Bit — and require explicit multiline parsing configuration that's easy to get wrong. Collapsing the stack trace into a pipe-delimited single-line string or a structured object avoids that entire class of problem entirely.
Infrastructure and Deployment Logging
System-level logging for infrastructure tooling follows the same principles. Here's a log entry from a deployment automation script running on
sw-infrarunbook-01under the
infrarunbook-adminaccount:
{
"timestamp": "2024-03-15T10:33:07.002Z",
"level": "info",
"service": "deploy-agent",
"host": "sw-infrarunbook-01",
"operator": "infrarunbook-admin",
"action": "deploy_started",
"target_service": "payments-api",
"target_version": "v2.14.1",
"target_hosts": ["10.10.1.21", "10.10.1.22", "10.10.1.23"],
"strategy": "rolling",
"canary_pct": 10,
"deploy_id": "deploy-20240315-103307"
}
That
deploy_idfield is your correlation handle for grouping all log entries tied to this specific deployment. If something breaks during the rollout, you filter by
deploy_idand get the complete picture: which hosts succeeded, which failed, what the error was, and how long each step took. This is the same trace ID concept applied to operational workflows rather than request flows.
Security and Audit Events
Security events need their own structured schema and typically warrant a separate log stream or index. Every authentication event, authorization decision, privilege escalation, and configuration change should log the actor, the action, the target, the outcome, and the source IP:
{
"timestamp": "2024-03-15T09:14:55.331Z",
"level": "warn",
"service": "auth-service",
"host": "sw-infrarunbook-01",
"event_type": "authentication_failed",
"actor": "infrarunbook-admin",
"source_ip": "192.168.1.85",
"target_resource": "/admin/api/users",
"failure_reason": "invalid_mfa_token",
"attempt_count": 3
}
Structured security logs integrate directly into SIEM correlation rules. When every field is discrete and queryable, alerting on behavioral patterns becomes straightforward: five authentication failures from the same
source_ipwithin sixty seconds triggers an alert, no regex required.
Common Misconceptions
"JSON logs are too verbose and expensive"
This comes up from teams that haven't looked at their actual log volume costs carefully. Yes, JSON has more bytes per log entry than a terse text line. But log shippers like Fluent Bit compress output before forwarding, and most log storage systems compress stored data significantly. In practice, JSON log storage costs maybe 15-20% more than equivalent unstructured logs — and the operational savings from faster incident resolution dwarf that cost within the first week of a real incident. The verbosity argument doesn't survive contact with a production outage.
"We're a small team, we don't need this"
I've heard this right before an on-call engineer spent six hours grepping through text logs trying to find why three users couldn't complete checkout on a Friday evening. Structured logging is a one-time setup cost with permanent returns. The time to implement it is before your first bad incident, not during or after it.
"We log everything so we'll always have what we need"
Logging everything indiscriminately is its own failure mode. It bloats storage, inflates costs, and creates noise that buries the signals you actually care about. Worse, unrestricted logging is a PII and compliance risk — you'll inevitably log something you shouldn't have, like a password field or a raw request body containing payment data. Structure your logging intentionally: define what each log level means for your system, document your core field schema, and treat sensitive data explicitly with redaction before it hits your log pipeline.
"We'll add structure later when we scale"
Later never comes. Retrofitting structured logging into existing services while they're in production is painful. Every service ends up with slightly different conventions, every team uses slightly different field names, and now you have to normalize them at the aggregation layer or live with inconsistency forever. Build the schema first — even if it's just a shared logging library with sensible defaults. It takes a day. The inconsistency it prevents takes months to fix.
Defining a Field Schema Your Team Will Actually Use
The most valuable thing you can do today is define and document a standard field schema for your organization's logs. It doesn't need to be exhaustive. Start with the fields that appear in every service and lock those down. Here's a practical starting template:
# Mandatory fields — every log entry, every service
timestamp string ISO 8601 UTC
level string debug|info|warn|error|fatal
service string Logical service name
host string Hostname (e.g., sw-infrarunbook-01)
env string production|staging|development
# Strongly recommended
message string Human-readable event description
trace_id string Distributed trace correlation ID
span_id string Current span within the trace
# HTTP context (when applicable)
http.method string GET, POST, PUT, DELETE, etc.
http.path string URL path only — never log full URL with query params
http.status int HTTP response code
http.duration_ms int Response time in milliseconds
# Error context (when level is error or fatal)
error.type string Error class or type name
error.message string Error description
error.stack string Collapsed single-line stack trace
Enforce this schema through a shared internal logging library, not documentation alone. Documentation gets ignored. A library that only allows you to create compliant log entries gets used. Write it once, publish it to your internal package registry, make it the default in your service templates, and you've solved the consistency problem permanently.
Structured logging with JSON isn't a nice-to-have. In any system with more than one moving part, it's the difference between running a professional engineering operation and flying blind. The tooling is mature, the libraries cover every major language, and the return on investment shows up the first time something breaks in production and you actually know what happened — and why.
