Resolving intermittent 502 errors with SvelteKit SSR and Oracle Cloud load balancer

dmitrii_fediuk · September 25, 2025, 9:29pm

The client's task

Title

DevOps, LB in OC spitting 502 when refreshing in a sveltekit SSR project

Description

Running into a super annoying issue with Oracle Cloud’s load balancer and hoping someone here has seen it before.

Setup:

Frontend is SvelteKit, sitting behind Nginx.
Traffic goes through an Oracle Cloud public load balancer (HTTPS).
Health checks are green, normal client-side navigation works fine.

The problem:

If I refresh a page (so the app does a full server-side render instead of just SPA navigation), the load balancer often spits out a 502 Bad Gateway.
If I hit the backend directly (bypass LB), everything works. It only happens through the LB, and only on refresh/deep links.

What I’ve tried:

Bumped up LB timeouts (idle + response) to 60–120s.
Made sure SSR fetches don’t loop back through the public LB domain.

dmitrii_fediuk · September 25, 2025, 9:36pm

My analysis

1.

In my experience, your issue is most likely due to the factors described in points 2 and 3 below.
Both hypotheses explain the observed symptoms well and are not mutually exclusive.

2. Exceeding HTTP header size limits on the OCI LB

2.1.

OCI Load Balancers have strict limits on the maximum size of response headers from the backend.
This is particularly relevant during SSR, as SvelteKit generates large headers—notably Link headers (for preload assets) and Set-Cookie headers.
If the header size exceeds the OCI LB limit, the load balancer returns a 502.

2.2.

This hypothesis precisely explains the difference between SSR (large headers) and SPA navigation (small API response headers).

2.3.

This hypothesis also explains why direct access works: Nginx and browsers have higher limits than the OCI LB.

2.4.

The fixed limit for the maximum size of HTTP response headers from the backend is 8KB, according to Oracle: support.oracle.com/knowledge/Oracle%20Cloud/2603461_1.html
Exceeding this limit results in a 502 error, even if the backend responds with 200 OK.
While the OCI documentation describes the configuration of HTTP Header Rules in OCI LB Rule Sets, these rules only allow increasing the limit for request headers (exceeding this limit causes a 400 error); crucially, they cannot be used to adjust the response header limit.
Therefore, solving this problem requires reducing the size of the headers generated by SvelteKit during SSR (e.g., by optimizing Link preload headers or reducing the size/number of Set-Cookie headers).

3. Mismatched `Keep-Alive` timeouts

3.1.

A 502 error can occur due to a race condition when reusing persistent connections (Keep-Alive).
The OCI LB maintains a connection pool with the backend (Nginx).
If Nginx closes a persistent connection (due to its keepalive_timeout) while the OCI LB still considers the connection active (because the LB-to-Backend Idle Timeout has not yet expired), a race condition occurs.
The LB may attempt to reuse the closed connection, resulting in a 502 error.

3.2.

This is a classic cause of intermittent 502 errors, which directly corresponds to your description «often spits out a 502».

3.3.

The problem occurs only through the LB because this connection management mechanism is not involved when accessing the backend directly.

3.4.

The Keep-Alive timeout on the backend (Nginx keepalive_timeout) must be greater than the LB-to-Backend Idle Timeout on the OCI LB.
According to the OCI documentation («Load Balancer Timeout Connection Settings»), the OCI LB closes connections to the backend that have been idle for more than 300 seconds (a fixed value for the LB-to-Backend Idle Timeout).
If the keepalive_timeout in Nginx is less than this value (e.g., the Nginx default value of 75s), this creates the exact conditions required for this race condition to occur.
The OCI documentation recommends setting the timeout on the backend to at least 310 seconds to prevent 502 errors.
Increasing the Listener Idle Timeout (which you have already tried) does not affect this mechanism, as it defines the idle time during the HTTP request/response phase, not between requests.

3.5.

This hypothesis effectively explains why the problem occurs predominantly during SSR (page refresh).
SSR requests (like a page refresh) are more likely to occur after a pause in activity, allowing connections to become idle.
If the idle time exceeds the Nginx keepalive_timeout but is less than the LB-to-Backend Idle Timeout (300s), the described race condition occurs.
SPA navigation, in contrast, generates frequent requests that keep the connection active and prevent the idle timeouts from being reached.