ci: run E2E backend under gunicorn instead of flask dev server

Both `cypress-run-all` and `playwright-run` started the Superset backend with `flask run --no-debugger -p $port`. The Flask development server is single-threaded and has no crash-recovery, so heavy tests — most notably `playwright/tests/dashboard/export.spec.ts:61` (Export YAML) and `dashboard-list.spec.ts:266` (Import zip) — can knock the backend offline for the rest of the run. Subsequent tests then cascade-fail with `ECONNREFUSED`, `socket hang up`, `Missing CSRF token`, and `page.goto: net::ERR_ABORTED; maybe frame was detached`. Across the last 50 master runs of the E2E workflow, 6 failed (12%), every single one with this signature. Switch both runners to gunicorn with the same shape used in `docker/entrypoints/run-server.sh`: - `--workers 4 --worker-class gthread --threads 20` — concurrency that matches what the real product runs. - `--timeout 120` — kill stuck workers instead of letting them hang the entire suite. - `--max-requests 500 --max-requests-jitter 50` — recycle workers periodically so memory accumulation from long suites doesn't OOM the process. - `--access-logfile - --error-logfile -` — keep the same per-run log capture pattern. Only frontend (JS) coverage is captured in E2E (verified — bashlib.sh only instruments the JS assets), so multi-worker gunicorn doesn't break the existing coverage path.
2026-05-22 00:05:15 +00:00 · 2026-05-18 23:08:14 -05:00
parent 4ceefb7e40
commit a43f3a421b
1 changed files with 37 additions and 10 deletions
--- a/.github/workflows/bashlib.sh
+++ b/.github/workflows/bashlib.sh
@@ -175,9 +175,12 @@ cypress-run-all() {
  local APP_ROOT=$2
  cd "$GITHUB_WORKSPACE/superset-frontend/cypress-base"

-  # Start Flask and run it in background
-  # --no-debugger means disable the interactive debugger on the 500 page
-  # so errors can print to stderr.
+  # Start the Superset backend via gunicorn (not `flask run`). The Flask
+  # development server is single-threaded and has no crash-recovery, so
+  # heavy tests (dashboard import/export, SQL Lab) can knock it offline
+  # for the rest of the run — surfacing as `ECONNREFUSED` / `socket hang up`
+  # / `Missing CSRF token` cascades. Gunicorn gives us multiple workers,
+  # a request timeout, and worker-recycling under load.
  local flasklog="${HOME}/flask.log"
  local port=8081
  CYPRESS_BASE_URL="http://localhost:${port}"
@@ -187,7 +190,18 @@ cypress-run-all() {
  fi
  export CYPRESS_BASE_URL

-  nohup flask run --no-debugger -p $port >"$flasklog" 2>&1 </dev/null &
+  nohup gunicorn \
+    --bind "127.0.0.1:$port" \
+    --workers 4 \
+    --worker-class gthread \
+    --threads 20 \
+    --timeout 120 \
+    --max-requests 500 \
+    --max-requests-jitter 50 \
+    --access-logfile - \
+    --error-logfile - \
+    "superset.app:create_app()" \
+    >"$flasklog" 2>&1 </dev/null &
  local flaskProcessId=$!

  USE_DASHBOARD_FLAG=''
@@ -224,7 +238,9 @@ playwright-run() {
  local APP_ROOT=$1
  local TEST_PATH=$2

-  # Start Flask from the project root (same as Cypress)
+  # Start the Superset backend via gunicorn from the project root.
+  # See cypress-run-all() above for the rationale — the Flask dev server
+  # cannot survive the dashboard import/export tests under load.
  cd "$GITHUB_WORKSPACE"
  local flasklog="${HOME}/flask-playwright.log"
  local port=8081
@@ -235,7 +251,18 @@ playwright-run() {
  fi
  export PLAYWRIGHT_BASE_URL

-  nohup flask run --no-debugger -p $port >"$flasklog" 2>&1 </dev/null &
+  nohup gunicorn \
+    --bind "127.0.0.1:$port" \
+    --workers 4 \
+    --worker-class gthread \
+    --threads 20 \
+    --timeout 120 \
+    --max-requests 500 \
+    --max-requests-jitter 50 \
+    --access-logfile - \
+    --error-logfile - \
+    "superset.app:create_app()" \
+    >"$flasklog" 2>&1 </dev/null &
  local flaskProcessId=$!

  # Ensure cleanup on exit
@@ -243,10 +270,10 @@ playwright-run() {

  # Wait for server to be ready with health check
  local timeout=60
-  say "Waiting for Flask server to start on port $port..."
+  say "Waiting for gunicorn server to start on port $port..."
  while [ $timeout -gt 0 ]; do
    if curl -f ${PLAYWRIGHT_BASE_URL}/health >/dev/null 2>&1; then
-      say "Flask server is ready"
+      say "gunicorn server is ready"
      break
    fi
    sleep 1
@@ -254,8 +281,8 @@ playwright-run() {
  done

  if [ $timeout -eq 0 ]; then
-    echo "::error::Flask server failed to start within 60 seconds"
-    echo "::group::Flask startup log"
+    echo "::error::gunicorn server failed to start within 60 seconds"
+    echo "::group::Server startup log"
    cat "$flasklog"
    echo "::endgroup::"
    return 1