Request logging should be done by an application gateway; it will do a more honest job of it. In applications, don't log anything except:
- application startup and shutdown (at INFO level)
- occurrences that indicate a bug in the system logging (at ERROR level)
- occurrences that indicate a bug in another of your systems (at WARN level)
The last two require a lot of education and discipline to stop people from logging rare but valid circumstances, bad input originating outside your software, and "important" events. The answer to "but don't we want to know when..." arguments should always be, "If it's that important, emit it as a metric or store it in an appropriate datastore." Anything logged at WARN or ERROR should be something that can be addressed by fixing code.
If you stick to this discipline, monitoring logs is valuable and reasonably simple. You can alert on any WARN and ERROR level logging. There's nothing like looking at a month of logs and seeing only actionable information and a handful of application lifecycle events. (There's also nothing like looking at six months of logs from a stable system and only seeing lifecycle events.)
It is hard to stick to this discipline. You need a manager who believes in it and will back it up when engineers want to depart from it, and who will always prioritize fixing violations.
Why is it so hard to make engineers follow these rules? Because initially, it makes them feel incredibly anxious. It feels irresponsible to throw away so much "important" information instead of logging it. After working this way for a while, though, they realize that it actually forces a higher level of responsibility. Logging information in noisy logs might feel better than throwing it away, but it's not. It creates an illusion of having handled the information responsibly. By taking away the illusory option, you force people to make a real decision between throwing it away and saving it in an actionable way, by emitting it as a metric or storing it in a datastore to be processed.
I've been here with this and having logs also emitted by application/web servers is critical too. If you only have it at the gateway, and it emits a 503 or 504, did the request make it to the web server? Maybe? You have no signal. Seeing a timeout at the gateway/load-balancer and the web server log that it serviced the request successfully but took 45s tells you very critical information. If you didn't have the web log, you have a missing signal that tells you critical information.
Despite having to managed TBs of logs per day and sift through them at times, I'd rather have too much logs. However, we did not alert on any of them. All alerting was symptoms based via SLOs (error rate, latency). Logs were only used for debugging.
> If you only have it at the gateway, and it emits a 503 or 504, did the request make it to the web server? Maybe?
I was taught a perfect solution to this. Only ever return 200-300 from your web server (not applicable for most public APIs, though). My web server is not a gateway, not a resource access mechanism as envisioned by people in the 90s. It’s always an RPC server. REST is RPC. JSON is RPC. I let my outsourced CDN do HTTP code shenanigans, but any apps I develop are 200 all the time. Want to know if it failed? Easy!
More important, this is generally expected when calling any kind of API over HTTP.
Returning a 200 in an error situation makes it very difficult to diagnose errors. For example, the F12 debugging pane in my browser color-codes errors responses. Fiddler does the same thing too.
More important: Languages / libraries have built-in mechanisms for handling these errors, which make error situations easier to handle (or not handle.) For example, some C# libraries will always throw an exception on a 4xx or 5xx response. If your oddball API returns a 200 in an error situation, it breaks normal error handling patterns. (And could result in making situations where "success": false much harder to track down.)
And finally: Because error situations are unusual, they often aren't encountered in the normal day-to-day debugging of your APIs consumers. As a result, your oddball API's failure situation is probably untested in whoever is calling your API. When your API fails (because it will at some point,) there's a high chance your consumer won't check the success flag and parse the response as if it's successful. (Again, because you're not following normal HTTP semantics.) This will trigger unpredictable failures in your consumers.
> Returning a 200 in an error situation makes it very difficult to diagnose errors. For example, the F12 debugging pane in my browser color-codes errors responses. Fiddler does the same thing too.
I strongly disagree; let me try to convince you why those tools are correct and why it's still wrong to return non-2xx error codes from the application.
The F12 debugging pane and related tools which color code non-2xx response codes are correct for returned documents, not for API requests. For returned documents, HTTP is the application protocol, for API requests HTTP is the transport medium! These tools are not able to make the distinction (yet) between "calling an API" and "retrieving a document".
In HTTP it's literally called the status code, not the error code.[1]
Look at it this way, when you're in ssh to a remote host, any errors in the transport medium is reported within the TCP stream, not inside the TCP headers or inside the IP headers. Errors from ssh itself don't set error bits in the TCP stream or IP packets. You'd be horrified if, within ssh, a shell error (say, 'cd /non-existent-dir') sent error codes in the error fields of TCP headers or IP headers.
So, yeah, errors in the application layer should not set error fields in the transport medium. When serving static files or documents, HTTP is the application layer, so it should report things like '404 not found'. When GETting an API endpoint
results in the application not find (for example) 'web.config', it should not* return a 404. After all, the handler was* found, the URL is correct, so don't overload the 404 for something else.
[1] Even curl handles it properly for the level it is working at. Since it is one layer higher than even the HTTP API (because it returns a result to the shell that called it), no errors are returned for any non-2xx responses, because, from the shell PoV, that's not an error.
> Languages / libraries have built-in mechanisms for handling these errors
And it's always easy for me to override this. I've been doing it this way for years - all of the startups I've worked at do it this way - and it's extremely successful.
> whoever is calling your API
Just internal use. For external use I agree meeting expectations is important. Users will expect a 400/500 for certain issues.
> Remember, you're breaking debugging tools like Fiddler when you do this...
> ...And, whoever inherits your code probably won't like working with it. Doing weirdo things like that can hurt your reputation in a team.
Well, it explains why web APIs are such a hot mess: the people creating the tools don't know the difference between transport errors and application errors, the developers creating the applications don't know the difference, and the developers calling into the application don't know the difference.
It's a pretty important difference, and it all other protocol stacks care is taken not to side-step any layer and directly fiddle with the transport layers from the application layers.
Right now, a client getting a 5xx response can't tell if the application had an error or if the proxy is misconfigured, because the application developers are sending proxy errors/server errors (5xx) and trampling all over the namespace of the transport medium.
The system appears to be well structured, but the conventions were all set by developers who were all noobs. I don't think, in 25 years of development, I ever saw a sockets program (neither client nor server) where the application detected an error and set error-bits in the IP datagram, and yet in web-development, I see this all the time.
Yup. I'm not sure I've perfectly figured out where to draw the line, but a 500 error should mean a 3rd party proxy failed to get a response from your server - not that you caught an exception in the server and specifically crafted an internal server error response.
Should you start HTTP responses with "ACK" because it's based on TCP? No. Then why start your HTTP responses with "HTTP/2 500" because your call to OpenAI timed out?
There's always a bit of a cultural adjustment needed for people who are onboarded to this paradigm (myself included! It took a year before I bought in). But everyone has universally either been converted or at least had to admit that the system works.
That still doesn’t explain any actual problem, it explains what you think is the cause behind the unspecified problem, jumping right over the key question.
HTTP is a hyper text transfer protocol, designed as such. My application is a RPC front-end for a series of data stores, AI models, etc.
Using HTTP codes for your application's protocol is a mismatch. There are some fairly general-purpose HTTP error codes that can meet many application's use cases, but the overlap is not perfect. You can get greater granularity by ignoring it entirely.
Going HTTP 200-only means any non-200 is now definitively from 3rd party middleware, a router, proxy, etc. The boolean check of "did I hit my application code?" is easy and useful.
Its easy and useful if you just return details in response bodies, which you can do with errors as well as 200s, and that lets everything else (including client code) that deals with your responses (which doesn’t generally care if it hit your application code as likely as it cares about the things HTTP error codes communicate) handle errors without having to code for your bespoke method of error reporting.
I only agree with this for a request/response application.
I disagree for fully automated applications, for example message-driven ones. Logging key steps is important for situations where you need that post-mortem information after a problem already happened, because you often cannot reproduce the exact environment in which the problem was encountered. Numbers and timing information can be very useful in these systems, as well as information about what path the system will take (tracing) when there are many.
- application startup and shutdown (at INFO level)
- occurrences that indicate a bug in the system logging (at ERROR level)
- occurrences that indicate a bug in another of your systems (at WARN level)
The last two require a lot of education and discipline to stop people from logging rare but valid circumstances, bad input originating outside your software, and "important" events. The answer to "but don't we want to know when..." arguments should always be, "If it's that important, emit it as a metric or store it in an appropriate datastore." Anything logged at WARN or ERROR should be something that can be addressed by fixing code.
If you stick to this discipline, monitoring logs is valuable and reasonably simple. You can alert on any WARN and ERROR level logging. There's nothing like looking at a month of logs and seeing only actionable information and a handful of application lifecycle events. (There's also nothing like looking at six months of logs from a stable system and only seeing lifecycle events.)
It is hard to stick to this discipline. You need a manager who believes in it and will back it up when engineers want to depart from it, and who will always prioritize fixing violations.
Why is it so hard to make engineers follow these rules? Because initially, it makes them feel incredibly anxious. It feels irresponsible to throw away so much "important" information instead of logging it. After working this way for a while, though, they realize that it actually forces a higher level of responsibility. Logging information in noisy logs might feel better than throwing it away, but it's not. It creates an illusion of having handled the information responsibly. By taking away the illusory option, you force people to make a real decision between throwing it away and saving it in an actionable way, by emitting it as a metric or storing it in a datastore to be processed.