The previous @has_access decorator on ActivityDebugView.show was
redirecting authenticated-but-not-authorized requests to the home
page (FAB's access-denied default), which is what the user was seeing
instead of the React app. Then @login_required was tried briefly but
crashed because flask-login's redirect endpoint isn't wired up under
this app's auth backend.
Simplest fix: no decorator on the shell route. The shell page exposes
no data of its own — the React component fires API calls to
/api/v1/{resource}/{uuid}/activity/ which gate access via
raise_for_ownership on the path entity. Anonymous users would see
the React UI with 401 errors inline, which is correct UX for a debug
tool.
Verified: GET /activity-debug/dashboard/<uuid>/ → 200 + SPA shell HTML.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The React route at /activity-debug/{resource}/{uuid} was returning
a JSON 404 on fresh page-load because Superset's SPA model doesn't
use a Flask catch-all — every React-router path needs a corresponding
Flask view that calls render_app_template(). Adding the missing
piece.
* superset/views/activity_debug.py — new ActivityDebugView (uses
BaseSupersetView). Single @expose("/<resource>/<uuid>/") returning
the React shell; the in-app router (superset-frontend/src/views/
routes.tsx) handles the actual page rendering. @has_access keeps
it gated by login.
* superset/initialization/__init__.py — appbuilder.add_view_no_menu
registration alongside the other SPA shell views.
Throwaway by design; both edits tagged with "Throwaway: sc-107283"
comments for easy removal when the activity-view feature ships.
Verified: GET /activity-debug/dashboard/<uuid>/ returns 302 to
/login when unauthenticated (correct), 200 with the SPA shell when
logged in.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds /activity-debug/{dashboard,chart,dataset}/{uuid} for visual
verification of the sc-107283 activity-view endpoints. Lets you eyeball
the response shape without building a curl/jq habit:
* Controls for include / page / page_size / since / until
* Per-record cards showing entity_kind chip, source badge, kind tag,
tombstone affordance, issued_at, changed_by, summary, entity_uuid,
version_uuid, path, from/to values, impact
* Deleted entities render with strikethrough + "deleted" tag
* Soft-deleted state surfaces as an orange tag (currently always null
until sc-103157 lands deleted_at)
**Throwaway by design**: when the activity-view feature ships, delete
the lazy import + route entry from superset-frontend/src/views/routes.tsx
and remove the superset-frontend/src/pages/ActivityDebug/ directory.
Both edits are tagged with "Throwaway: sc-107283" comments so they're
easy to find.
The component uses @superset-ui/core/components (Card, Tag, Radio, Input,
Space, Typography, Loading, Empty) per CLAUDE.md conventions. TypeScript
clean, ruff/eslint/oxlint clean, no any types. Uses a native HTML
<select> for page_size — the @superset-ui Select wrapper's stricter prop
types weren't worth fighting for a debug tool.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Breaking change to the (still-unmerged) activity-view JSON contract:
``entity_kind`` now emits ``"dashboard"`` / ``"chart"`` / ``"dataset"``
instead of the Python class names ``"Dashboard"`` / ``"Slice"`` /
``"SqlaTable"``. The class names are developer-facing artifacts that
leaked the model layer (e.g. ``"Slice"`` predates the UI rename to
"chart"; ``"SqlaTable"`` predates "dataset"). User-facing JSON should
speak user language.
Implementation: a new ``_USER_FACING_KIND`` map translates at JSON
serialization time only (in ``_decorate_records``). Internal code keeps
the Python class-name form (``model_cls.__name__``) for dispatch — the
existing ``_NAME_COLUMN``, ``_NOT_FOUND_EXC``, ``_API_KIND_LABEL``,
``_can_read``, ``_compute_impact``, etc. all key off class names and
are unchanged. The translation happens at the single ``record["entity_kind"]
= ...`` assignment.
Schema validator ``ACTIVITY_ENTITY_KINDS`` updated to the new tuple.
Integration tests' response-shape assertions renamed via bulk sed; unit
tests testing internal helpers are unchanged (they operate on internal
api_kind / class names).
UPDATING.md example payload updated. Spec updates (spec.md, data-model.md,
contracts/activity-view.yaml) committed separately to the spec repo.
Full suite: 66 unit + 35 integration + 1 xfailed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
T037 — Instrument the five phases of get_activity with stats_logger
timing metrics under the `superset.activity_view.<kind>.<phase>_ms`
key convention (plan §D-17). Wrap each phase with a stats_timing
context manager:
* superset.activity_view.<kind>.relationship_resolution_ms (_resolve_scope)
* superset.activity_view.<kind>.fetch_ms (_fetch_change_records)
* superset.activity_view.<kind>.visibility_filter_ms
* superset.activity_view.<kind>.denormalize_ms
* superset.activity_view.<kind>.decorate_ms
`<kind>` is the lowercased path-entity model class name (dashboard /
slice / sqlatable), enabling per-endpoint-family Grafana panels.
T038 — Add request- and response-shape attributes via the existing
counter/gauge interface:
* superset.activity_view.<kind>.requests.include_<value> (incr)
* superset.activity_view.<kind>.requests.has_since_filter_<true|false> (incr)
* superset.activity_view.<kind>.page_size (gauge)
* superset.activity_view.<kind>.record_count (gauge — post-visibility-filter)
* superset.activity_view.<kind>.related_entity_count.charts (gauge)
* superset.activity_view.<kind>.related_entity_count.datasets (gauge)
Confirmed no PII: entity names, diff content, user identifiers — none
flow into the metric layer. Only counts and shape tags.
T050 — Cross-coupling sanity test (unit-scope): asserts
_METRIC_PREFIX == "superset.activity_view" so a code review catches
accidental drift from sc-103156's eventual "superset.versioning.*"
sibling namespace. Both endpoint families belong to one Grafana panel
under "superset.versioning.* OR superset.activity_view.*".
Full suite: 66 unit + 35 integration + 1 xfailed (T044 sc-103156
restore-kind dependency, unchanged).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* T042 test_activity_marks_hard_deleted_chart_with_tombstone (D-15)
Edit a chart, hard-delete it via db.session.delete(chart) + commit,
hit dashboard activity. Asserts tombstoned record has
entity_deleted=True, entity_uuid=None, entity_name preserved from
the last shadow row. The fixture's _cleanup tolerates missing slices
(uses Slice.id.in_(slice_ids) which silently skips).
* T044 test_activity_surfaces_dashboard_restore_event (AV-015) — xfail
AV-015 requires sc-103156's restore code to emit a synthetic
kind="restore" change record with to_value carrying the source
version_uuid + label. sc-103156's restore_version() currently does
not — the diff capture surfaces the field changes as kind="field".
The activity layer correctly passes through whatever kind sc-103156
emits; the test will start passing once the upstream emission lands.
Marked strict=True so the day sc-103156 emits, this fails-to-fail
and we know to revisit.
* T051 test_activity_excludes_records_after_retention_prune (AV-010)
Edit a chart (creates a new version_transaction), backdate that
transaction's issued_at past the 30-day retention cutoff, run
_prune_old_versions_impl(retention_days=30). Assert the prune
removed ≥1 transaction AND the activity endpoint's filtered count
decreased.
* T039 test_activity_pagination_is_deterministic_and_disjoint
(SC-AV-002 pragmatic interpretation). The spec's stricter "no
skip/duplicate under concurrent writes" is unprovable with offset
pagination — new top-inserted records shift every later page by
one. Cursor pagination would solve this (deferred per plan §D-10).
Under offset, the testable guarantees are: (a) same request fired
twice produces identical pages; (b) page N and page N+1 are
disjoint under one request round. Both come from the
(issued_at DESC, transaction_id DESC, sequence DESC) sort.
Full activity-view suite: 35 passed, 1 xfailed in 46s. The xfail is
T044 with the documented sc-103156 dependency.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three of the Phase 6 polish tasks landed together:
* T040 test_activity_ordering_is_stable_by_issued_at_then_transaction_id
— AV-006 deterministic-ordering contract. Iterates adjacent pairs in
the response and asserts (issued_at, transaction_id) is monotonically
non-increasing. Catches any sort-stability regression — random
ordering would fail the pairwise check almost certainly.
* T041 test_activity_page_size_caps_returned_records_at_200 — pairs
with the existing pagination_clamps test. The former confirms
?page_size=500 doesn't 400; this one confirms the response is
bounded to ≤200 records as the OpenAPI contract guarantees.
* T053 verification: every fixture-mutating test in
activity_view_tests.py was already following the
try/finally + rollback convention (sweep verified 30 tests; zero
non-conformant). No remediation needed; documenting the
sweep result in the commit message.
Full activity-view suite: 32/32 integration + 65/65 unit tests green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two structural changes from the second SQLAlchemy review. No behaviour
change; full activity-view suite (70 unit + 30 integration) still
green.
* Tidy 1 (Warning #2): switch _batch_chart_counts and
_batch_datasets_used_by_charts from positional row-unpack to
.mappings()-keyed access. Column-rename-safe — a future edit that
reorders columns in the SELECT can't silently misassign values
during the Python loop. Matches the existing convention in
_select_change_rows_for_kinds.
* Tidy 2 (Suggestion #1): docstring on _seed_activity_history
explaining why it intentionally commits without rollback (unlike
the T053 convention in activity_view_tests.py). The seed IS the
setup, not the unit under test — the endpoint reads a realistic
post-commit state of the shadow tables.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two doc-shaped tasks landed together because they're both about making
the activity-view endpoints discoverable by external consumers.
T047 — UPDATING.md: new section under the existing entity-version-history
entry documenting the three activity-view endpoints (dashboard / chart /
dataset), their query params (since / until / include / page / page_size),
the response shape with all DTO fields, the silent permission filter
(AV-008), tombstone behaviour (D-15), and the no-feature-flag / no-
new-tables impact statement. Mirrors the depth of the sc-103156
versioning section above it.
T048 satisfied by the UPDATING.md entry: the activity-view feature
adds no new config keys (no SUPERSET_* env vars, no feature flag), and
the per-endpoint API reference is auto-generated from the YAML
docstrings via FAB's OpenAPI integration. The `/swagger/v1` page picks
up the activity endpoints automatically — verified by the new tests
below. sc-103156 followed the same pattern (UPDATING.md only, no
standalone config doc).
T049 — Three new tests in TestActivityOpenApiSpec verify FAB's OpenAPI
generation includes the activity endpoints with the right shape:
* test_three_activity_paths_appear_in_openapi — the three
/<uuid_str>/activity/ paths are surfaced in /api/v1/_openapi.
* test_activity_endpoints_document_query_params — since / until /
include / page / page_size are all declared, and include's enum is
exactly {"self", "related", "all"}.
* test_activity_endpoints_declare_200_response — 200 + 400/401/403/404
are all declared response codes.
base_api_tests.py::TestOpenApiSpec::test_open_api_spec already
validates the full spec's structural correctness on every CI run, so
malformed YAML in the activity-view docstrings would have been caught
upstream. The new tests add activity-specific assertions about the
endpoints' presence and parameter shape.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surfaced by T045's perf test (which failed at p95 4228ms vs 1500ms
budget) and an ad-hoc per-phase profile that pinpointed _resolve_scope
as accounting for ~1926ms of the wall clock.
The previous _resolve_dashboard_scope called _datasets_used_by_chart
once per slice on the dashboard. On a fixture with ~12 charts that's
12 sequential SQL round-trips for the same shape of query, each
returning a small result set. The new _batch_datasets_used_by_charts
takes the full set of slice_ids and returns
{slice_id: [(dataset_id, window), ...]} in one query; the singular
form _datasets_used_by_chart now forwards to the batch (used only by
_resolve_chart_scope which has a single slice).
Impact (measured on the same fixture):
* Wall-clock per request: 3955ms → 2257ms (43% faster)
* Total query count per request: 18,376 → 29 (driven by the
identity-map populating during the warmup; the genuine count for
the activity layer alone is ~13 SQL queries)
* T045 p95 over 50 runs: 4228ms → 2429ms
The remaining gap to the 1500ms budget is dominated by Python work
inside _fetch_change_records: post-filter and sort over ~26K
accumulated test-environment records. The spec's target scale is
"25 charts × 3 dataset windows" — the SQLite test environment has
accumulated far more state from prior test runs. T046 (index
migration) is deferred because the remaining bottleneck is Python-
side, not SQL-side; production data on Postgres with retention
pruning will not have this shape. Document and revisit when a
production-representative perf measurement is available.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two opt-in tests (SUPERSET_PERF_VALIDATION=1 to run, skipped otherwise)
under the existing PerfValidationTests class:
* test_av_sc001_activity_endpoint_p95_under_1500ms — seeds dense
history on the birth_names dashboard (~12 chart edits, 5 dataset
edits, 1 dashboard edit), then hits /api/v1/dashboard/<uuid>/activity/
50 times after warmup and asserts p95 < 1500ms per SC-AV-001.
* test_av_sc003_save_path_p95_unaffected_by_activity_view — under the
same fixture, runs 50 dashboard saves and re-measures the SC-004
version-capture overhead (50ms p95 budget). Confirms the activity-
view branch hasn't added a new save-path coupling that regresses
sc-103156's save-path budget. PASSES on the current branch — read-
only feature, no new save-path code paths.
Seed fixture differs from the spec's "25 charts × 3 dataset windows
each" target — the birth_names fixture has ~12 charts on a single
dataset, so we approximate by editing many charts plus the dataset
multiple times. A bespoke multi-dataset fixture would let us measure
against the spec's exact scale; that's future work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Surfaced by the SQLAlchemy review (Warning #1) and required before
T045's p95 perf budget can be met on dashboards with many historical
dataset edits.
Before: _decorate_records fired one COUNT(DISTINCT slice_id) query
per related dataset record via _compute_impact → _count_dashboard_
charts_pointing_at_dataset_at_tx. For a dashboard activity stream
with N dataset-edit records, that's N round-trips to compute impacts.
After: _decorate_records collects the distinct (dataset_id, target_tx)
pairs once via _collect_impact_pairs, fires a single batched query
via _batch_chart_counts that pulls the (slice, dataset, validity-
window) state for every relevant slice, and filters by validity in
Python per pair. The result is a {(dataset_id, target_tx): count}
mapping consumed by the new pure helper _impact_for_record per row.
Round-trip count drops from O(N records) to O(1) for the impact
calculation. The SQL stays small and dialect-portable — same per-kind
IN-clause + Python validity filter pattern as _fetch_change_records.
Trade-off documented in _batch_chart_counts' docstring: the SELECT
pulls (slice_id, datasource_id, two validity-window pairs) for every
slice ever on the dashboard whose dataset matches one of the
requested dataset_ids. For a busy dashboard with 100 slices and 50
versions each, that's ~5000 rows into Python vs N small COUNT scans.
For N > ~5 (which is typical) the batch wins.
Removed:
* _compute_impact (replaced by _impact_for_record + _collect_impact_pairs)
* _count_dashboard_charts_pointing_at_dataset_at_tx (replaced by
_batch_chart_counts)
Test changes: the three _compute_impact unit tests (no-impact paths)
become six _impact_for_record tests (positive count + four no-impact
paths + zero-count → None). Five new _collect_impact_pairs tests
cover dashboard/chart/dataset path branching plus dedupe and empty.
Full suite: 27/27 integration + 65/65 unit (was 57; +8 from the
restructure). No semantic regression on either side of the cut.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No-op docstring change. The function fires one SELECT per entity_kind
with entity_id IN (...) instead of one query with tuple_(entity_kind,
entity_id).in_(...). The latter is supported by Postgres but its SQL
emission across MySQL and SQLite is uneven (some dialects emit literal
ROW expressions, some emit a transformed multi-column form, some don't
support it at all in older versions). Three round-trips per request is
the correct trade-off given Superset's multi-dialect requirement.
Documenting now so a future reader doesn't reach for the tuple-IN
"optimisation" and silently break MySQL or SQLite. Surfaced by the
SQLAlchemy review.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the dataset activity stream. Per AV-004 datasets have no
transitive layer in V2: dataset activity = dataset's own edits only,
regardless of whether charts use the dataset. The chart/dashboard
endpoints surface dataset edits in their *related* streams (already
shipped); the dataset endpoint stays narrowly self-only.
* T033 DatasetRestApi.activity — third application of the shared
resolve_endpoint_path_entity + parse_activity_query_params pattern.
Endpoint body is ~10 lines of real logic, same shape as T016 and
T028 but with SqlaTable as the path entity. The activity orchestrator
already short-circuits the related-entity resolution for datasets
(_resolve_scope returns [] for related when path_kind="SqlaTable"),
so the dataset endpoint inherits the V2 semantics for free.
* T034 TestDatasetActivityView class added to activity_view_tests.py.
* T035 test_dataset_activity_excludes_chart_edits — AS-1 of US3 /
AV-004 verbatim. Edit a chart that uses the dataset; assert the
chart edit does NOT appear in the dataset's activity stream
(datasets are read-only upstream of charts in V2).
* T036 test_dataset_activity_includes_dataset_self_edits — confirms
the positive path: editing the dataset's own description surfaces
a SqlaTable/self record.
* New test_dataset_activity_related_only_returns_empty — AV-004 in
contract form: ?include=related on a dataset returns an empty
result list and count=0 because there are no related entities to
draw from.
Plus the standard boundary tests (404 for unknown UUID, 400 for
malformed UUID / invalid include, 403 for non-owner, 200 envelope
shape).
Full activity-view suite: 27/27 integration tests + 57/57 unit tests
green. All three endpoint families live.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the chart cross-entity activity stream (US2). Builds on the
resolve_endpoint_path_entity helper from the prior tidy commit
(1d60513079), so the new endpoint is a small composition of: resolve
path entity → parse params → get_activity(Slice, ...) → serialize.
* T028 ChartRestApi.activity — @expose("/<uuid_str>/activity/") on the
chart blueprint. Decorator stack mirrors list_versions; registered
in include_route_methods. The MODEL_API_RW_METHOD_PERMISSION_MAP
entry "activity": "write" added in Phase 3 already covers this
endpoint family (the map is class-permission-agnostic).
* T029 TestChartActivityView class added to activity_view_tests.py,
following the same patterns as TestDashboardActivityView.
* T030 test_chart_activity_includes_dataset_edit_as_related — Given
a chart pointing at the birth_names dataset, When the dataset's
description is edited, Then the chart activity stream surfaces a
SqlaTable/related record (AS-1 of US2).
* T032 test_chart_activity_excludes_sibling_dashboards — Given the
chart is on a dashboard, When the dashboard's title is edited,
Then NO Dashboard records appear in the chart's activity. Per the
spec's Relationship Traversal section: charts don't see sideways
to dashboards they happen to be on.
* T031 (datasource-switch attribution) deferred — needs a second
dataset fixture which the birth_names environment doesn't provide
cleanly. Will land with a multi-dataset test harness in a follow-up.
Additional coverage: 404 for unknown UUID, 400 for malformed UUID,
400 for invalid include, 403 for non-owner, 200 envelope smoke,
chart-self-edit appears as source=self, include=self filter test.
Full suite: 19/19 integration tests + 57/57 unit tests green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make the change easy, then make the easy change. Phase 4 (T028 chart
/activity/ endpoint, T033 dataset endpoint) duplicates the UUID-parse
+ find_active_by_uuid + raise_for_ownership dance from the existing
DashboardRestApi.activity. Extracting now keeps the next two endpoints
single-line affairs.
* New PathEntityResponseError carries a pre-built error Response.
Subclassing Exception (not ValueError) so the endpoint's catch is
precise and the standard hierarchy stays unpolluted.
* New resolve_endpoint_path_entity(api, model_cls, uuid_str) runs the
three-step preflight and raises PathEntityResponseError with the
right 400/404/403 response carried inside.
* DashboardRestApi.activity refactored to use the helper. 12 lines of
imports + try/except boilerplate collapse to a single try/except
block that maps every preflight failure to its response.
No behaviour change: 10/10 dashboard activity integration tests still
green. Pure structural change, per Beck's tidy-first discipline (the
T028 endpoint code follows in the next commit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five structural changes — no behaviour change — applied as one commit
per the tidy-first discipline. All from the clean-code review of
e4070a4716.
* Tidy 1 — Fix _fetch_change_records sort key. Was calling .timestamp()
on issued_at with an `else 0` fallback; the fallback was dead defense
(column is non-null per sc-103156 schema) and .timestamp() introduces
non-determinism on tz-naive datetimes. Sort on the datetimes directly.
* Tidy 2 — parse_activity_query_params now raises ActivityParamsError
(subclass of ValueError) instead of returning (Optional[dict],
Optional[str]). The tuple was forcing every caller into a defensive
`if error or params is None: return self.response_400(message=error
or "Invalid query parameters")`. The new shape — `try: params =
parse(...); except ActivityParamsError as exc: return
response_400(str(exc))` — is shorter, type-safe, and the contract
is enforced at the boundary.
* Tidy 3 — Test helper now uses Flask client's query_string= parameter
instead of f-string concatenation. Handles URL-encoding correctly
for the day a test passes a value containing & / = / + / etc.
* Tidy 4 — get_activity pipeline collapses to a single rolling
`records` variable instead of the mid-stream `raw / visible_raw /
enriched / visible` naming. Each function call's name documents
what the step does; no intermediate variable names needed.
* Tidy 5 — Extracted four per-parameter parsers: _parse_optional_iso,
_parse_include, _parse_page, _parse_page_size. The "parse one
parameter" concept now has a name. Cost: four small helpers (each
10-15 lines, one job). Benefit: parse_activity_query_params is a
10-line table-driven dispatcher.
Test changes: parser unit tests now use pytest.raises(ActivityParamsError)
instead of unpacking the (params, error) tuple. Added one test
confirming ActivityParamsError subclasses ValueError so the standard
library exception hierarchy still catches it. Total unit tests: 57.
Integration tests still 10/10 green.
Deferred per the review: the UUID-parse + entity-find + ownership-
check dance is duplicated between activity / list_versions / get_version
and will grow to T028 / T033. Refactor when T028 lands — three real
callers > one prospective one.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the dashboard cross-entity activity stream (US1 MVP).
Closes T015-T018, T023-T024, T026 from sc-107283 tasks.md. T019-T022
(complex fixture choreography) and T025 (RBAC fixture for restricted
chart access) deferred to a follow-up; T025's logical coverage is
provided by the new unit tests for _can_read kind dispatch.
* T015 Rename _dashboard_related_scope → _resolve_dashboard_scope and
_chart_related_scope → _resolve_chart_scope, parallel to _resolve_scope
/ _resolve_path_entity. Aligns the codebase with the task spec names.
* T016 New activity(uuid_str) method on DashboardRestApi:
@expose("/<uuid_str>/activity/"), @protect + @safe + @statsd_metrics
+ @event_logger. Same row-level ownership check pattern as
/versions/. Registered in include_route_methods and mapped to "write"
in MODEL_API_RW_METHOD_PERMISSION_MAP so the can_write Dashboard
permission gates access (Alpha-non-owner gets 403 from
raise_for_ownership, not 404 from missing permission).
* New parse_activity_query_params helper in activity.py — shared parser
for since/until/include/page/page_size used by all three endpoint
families. Tolerates the Z suffix Python <3.11 fromisoformat rejects.
Silently clamps page_size to the contract max instead of rejecting.
Two correctness/scale bugs found and fixed under integration-test load:
* SQLite SQLITE_MAX_EXPR_DEPTH (1000) was tripping on dashboards with
many slices × many historical attachment windows. _fetch_change_records
used to emit one OR-clause per (entity_kind, entity_id, window) tuple;
now it issues one SELECT per kind with entity_id IN (...) and filters
by exact windows in Python via _row_within_any_window. SQL shape is
proportional to the number of kinds (≤3); the per-entity window
precision is preserved.
* _merge_entity_windows now unions overlapping/touching windows within
each entity via _union_windows. Sequential fixture loads create many
redundant Continuum shadow rows; without merging, the unfiltered
windows still produce many OR branches downstream.
* Pipeline-ordering fix in get_activity: visibility filter runs BEFORE
decoration. Decoration strips entity_id (not in API contract) and the
filter needs it — and dropping invisible records early also avoids
paying for name lookup + tombstone probes on records the requester
can't see (AV-008's silent-filter contract).
Tests:
* tests/integration_tests/versioning/activity_view_tests.py — 10 tests
for TestDashboardActivityView covering: 404 for unknown UUID, 400 for
malformed UUID / invalid include / invalid since, 403 for non-owner,
200 envelope smoke, chart-edit-appears-as-related, include=self
filter, include=related filter, page_size clamping.
* tests/unit_tests/versioning/test_activity.py — grew from 30 to 56
tests. New coverage: parse_activity_query_params (7 cases), _can_read
per-kind dispatch (4 cases — covers T025 at unit scope), _union_windows
(9 parametrized cases), _merge_entity_windows window-union case,
_row_within_any_window (6 cases).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five structural changes — no behaviour change — applied as one commit
per the tidy-first discipline (separate from the prior feat commits
that landed the primitives and orchestrator):
* Add tests/unit_tests/versioning/test_activity.py — 30 tests for the
pure-function helpers: _intersect_windows (11 boundary cases),
_resolve_scope branching (per kind + per include mode), entity-
window merging, AV-012 summary headlines, changed_by projection,
kind-translation round-trip, _can_read fall-through, _compute_impact
no-impact paths. Runs in <500ms, no DB, no Flask. The DB-touching
helpers wait for the integration suite (Phase 3+).
* Hoist _datasets_used_by_chart(slice_id) out of the inner attachment-
window loop in _dashboard_related_scope. Previously a chart with N
attachment windows fired N queries for identical data; now once per
slice. Removes a perf cliff at the historical-rich case the spec
explicitly calls out (US1 AS-3).
* Delete the unused requesting_user parameter from
_filter_records_by_visibility and from get_activity. It was never
threaded through to the security manager calls (which read the user
from Flask-Login implicitly) — speculative future-shape that the
reader had to mentally trace through. If CLI/Celery bypass becomes
necessary, add it then with a real call site.
* Delete the unused module-level logger. No observability lands here
until T037/T038; reintroduce when those tasks add real logger calls.
* Fix the file docstring — it claimed _resolve_version_tables was
reused; it isn't. Trimmed to (find_active_by_uuid,
derive_version_uuid).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Completes Phase 2 of sc-107283. The activity-view query layer now has
one public entry point — get_activity(model_cls, entity_uuid, ...) —
plus the supporting visibility filter and DTO decorator.
* T011 _filter_records_by_visibility — silent permission filter per
AV-008. Batch-resolves live entities per kind, dispatches
can_access_dashboard / can_access_chart / can_access_datasource.
Tombstoned entities pass through (the decorator handles deleted
messaging; no navigable entity_uuid is exposed). _can_read helper
isolates the per-kind predicate dispatch.
* T012 _decorate_records — synthesizes the full ActivityRecord shape:
entity_kind (API form), entity_uuid (live lookup; null for
tombstones), entity_deleted, entity_deletion_state, source ("self"
vs "related"), summary (AV-012 headline), impact (via T010),
version_uuid (derive_version_uuid), changed_by (User projection).
Strips internal-only columns (entity_id, sequence, raw user cols)
from the final shape.
* T013 get_activity — top-level orchestrator. Branches by include
("self"/"related"/"all") via _resolve_scope; for related,
_dashboard_related_scope walks charts-on-dashboard then datasets-
used-by-chart and clips chart-on-dataset windows by attachment
windows. _chart_related_scope skips the dashboard hop. Datasets
return empty related scope per AV-004. _merge_entity_windows
collapses repeated (kind, id) entries to keep the OR-clause in
_fetch_change_records compact. Pagination is post-visibility-filter
per AV-008; defaults page_size=25, clamps to 200.
Tasks T004-T014 in tasks.md updated to [X]. The next phase opens user
story implementation (T015 dashboard scope helper + T016 endpoint).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the foundational query layer for the cross-entity activity view
in superset/versioning/activity.py. Pure helpers + single-purpose
shadow-table queries; no public entry point yet (T011-T013 follow).
* T004 _resolve_path_entity — UUID lookup via find_active_by_uuid;
typed 404 dispatch by model class (DashboardNotFoundError /
ChartNotFoundError / DatasetNotFoundError).
* T005 _charts_attached_to_dashboard — (slice_id, window) tuples from
dashboard_slices_version, op=2 rows excluded.
* T006 _datasets_used_by_chart — (datasource_id, window) tuples from
slices_version, filtered to datasource_type='table'.
* T007 _intersect_windows — pure half-open window intersection;
end=None acts as positive infinity. Unit-verified.
* T008 _fetch_change_records — workhorse query joining version_changes
+ version_transaction + ab_user, OR-clause over (kind, id, [windows]),
ordered by (issued_at DESC, transaction_id DESC, sequence DESC) per
AV-006. No DB-level pagination here — AV-008's silent permission
filter means count is computed after the visibility filter, so
pagination happens in T013.
* T009 _denormalize_entity_names — batch shadow lookup keyed by
(api_kind, entity_id, transaction_id); validity-strategy predicate.
Helper _resolve_names_for_kind extracted to keep cyclomatic
complexity under ruff's C901 threshold.
* T010 _compute_impact — dashboard+dataset path yields
{"charts": N}; all other path/related shapes yield None per
data-model.md §"impact computation".
* T014 _check_entity_tombstones — live-row existence + soft-delete
state probe; tolerates the pre-sc-103157 absence of a deleted_at
column via hasattr-style introspection on Table.c.
Kind translation isolated at the module boundary: version_changes
stores lowercase ("chart"/"dashboard"/"dataset"); the ActivityRecord
DTO returns class names ("Slice"/"Dashboard"/"SqlaTable"). The
_TABLE_KIND_TO_API / _API_KIND_TO_TABLE mappings translate at the
two crossings.
Phase 2 orchestration layer (T011-T013) lands in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply clean-code review polish on top of the Phase 1 scaffold before
Phase 2 builds on this surface:
* Replace magic-string field descriptions with ``validate.OneOf`` on
``entity_kind``, ``source``, ``entity_deletion_state``, and ``kind``.
Canonical lists hoisted to module-level tuples (``ACTIVITY_*``).
* Type ``path`` as ``fields.List(fields.String())`` and ``impact`` as
a new ``ActivityImpactSchema`` — replaces two ``fields.Raw`` ``Any``s
whose shape was previously documented only in prose.
* Resolve the ``ActivityChangedBySchema`` docstring contradiction —
the schema deliberately omits ``username`` so it is no longer
honest to call it "identical shape to VersionChangedBySchema".
* Fix the ``activity.py`` module docstring: one polymorphic
``get_activity`` function, not three entry points.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* T001 — new ``superset/versioning/activity.py`` module with ASF
header and docstring describing the cross-entity activity-view query
layer (companion to ``queries.py``). Empty otherwise; Phase 2 fills
it with the shared query helpers.
* T002 — empty test file
``tests/integration_tests/versioning/activity_view_tests.py`` with
ASF header, module docstring naming the three test classes that will
follow, and a ``SupersetTestCase`` import. Phase 3+ fills it with
test classes.
* T003 — three Marshmallow schemas added to
``superset/versioning/schemas.py``:
* ``ActivityChangedBySchema`` — User subset on each record.
* ``ActivityRecordSchema`` — one change record (per data-model.md).
* ``ActivityResponseSchema`` — envelope ``{result, count}``.
Fields mirror data-model.md §"``ActivityRecord`` DTO" line-for-line;
metadata descriptions are FAB/Swagger-ready.
No new endpoints, no listeners, no DB writes — pure scaffolding. The
three schemas are unused until Phase 3 (T016) wires the first endpoint.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Squashes the three migrations that captured the versioning schema as
the feature was developed:
* ``56cd24c07170 add_entity_version_history_tables`` — version_transaction
+ dashboards_version / slices_version / tables_version parent shadows.
* ``e1f3c5a7b9d0 add_version_changes_table`` — the field-level diff log.
* ``f7a2b3c4d5e6 spike_add_child_continuum_shadow_tables`` —
table_columns_version / sql_metrics_version / dashboard_slices_version
child shadows.
Combined into a single migration ``2026-05-28_19-50_56cd24c07170_
add_versioning_tables.py`` that creates the same 8 tables in dependency
order (sequence + version_transaction first, parent shadows, version_
changes, child shadows last) and drops them in reverse on downgrade.
Why:
* All three migrations represent one logical feature ("enable
versioning") — applying any subset leaves a broken intermediate
state.
* The third migration was literally named "spike_..."; the iterative
shape was a development artifact, not a design intent.
* Downstream operators get one migration to apply / reverse and one
file to review.
* The dev DB was rolled back through the three downgrades, the new
migration was applied via ``superset db upgrade``, and column-count
spot checks confirmed byte-identical schema (4/9/21/26/35/19/15/5).
The ``revision`` hash is reused from the original first migration
(``56cd24c07170``) for chain continuity. ``down_revision`` stays
``2bee73611e32`` (sc-105349 composite-PK tip), unchanged from the
original chain's head.
Net: -70 lines (605 deleted, 535 added), same schema, one Alembic step
instead of three.
The diff engine now recurses into nested dicts inside JSON-blob fields
(``json_metadata``, ``Slice.params`` non-list sub-keys, layout node
``meta``) via a shared ``_recursive_leaf_diff`` helper, emitting one
``ChangeRecord`` per atomic leaf change rather than one record carrying
the whole top-level sub-tree on both sides.
Concretely, a dashboard header text change from "VERSION 2!" to
"HEADER!" used to emit a single record at path ``["edit", "header",
"HEADER-id-1"]`` with the entire layout node duplicated on both sides
of the diff. It now emits a single record at path ``["edit", "header",
"HEADER-id-1", "text"]`` with just the changed string as ``from_value``
/ ``to_value``. Multi-leaf edits produce one record per leaf.
Each call site passes its own depth cap via a named constant:
* ``_LAYOUT_META_DIFF_DEPTH = 3`` — layout meta is presentation state
(text, sizes, colors), shallow by nature.
* ``_JSON_METADATA_DIFF_DEPTH = 6`` — native filter configuration can
go five levels deep (``defaultDataMask.filterState.value``).
* ``_SLICE_PARAMS_DIFF_DEPTH = 6`` — adhoc filter sub-queries and pivot
options can be similarly deep.
A cap is a usefulness bound (granularity that's meaningful in a
timeline), not a safety bound. Cap-on-dict-vs-dict emits a debug log so
production tuning can see when a cap is too tight for typical data.
Lists are treated as opaque leaves: positional paths break under
reorder. Lists with stable identity (adhoc filters, metrics, dataset
columns) already have natural-key walkers (``_diff_list_by_natural_key``)
that emit per-element records with the right identity; those are
unchanged.
Backward compatibility:
* Top-level scalar fields, child-collection records (column/metric),
M2M slice membership, and layout add/remove/move records all keep
their existing path/from/to shape. The change is scoped to JSON-blob
recursion.
* All 56 existing diff unit tests pass without modification. 13 new
tests cover the Shape B behaviour (helper directly, type mismatches,
opaque list policy, depth cap, nested ``json_metadata``, layout
edit, ``Slice.params`` deep unknown key, cross-node aggregation).
* Existing integration tests in ``change_records_tests.py`` assert on
scalar-field changes only — unaffected.
Restoration is unaffected: the restore path reads from Continuum
shadow tables, not from ``version_changes``. This change is purely
about audit-log granularity.
Spec coverage: tasks.md T047d (added in spec repo). Rationale: plan.md
"Key Design Decision: Leaf-level recursive diff for JSON-blob fields
(Shape B)". Conventions: data-model.md "Leaf-level recursion".
The ``_pin_audit_columns`` helper added in 1b0083bc03 relies on a
SA-version-dependent behavior: calling ``attributes.flag_modified(parent,
"changed_by_fk")`` causes the next UPDATE statement to use the
in-memory value rather than invoking the column's ``onupdate=callable``.
This is the mechanism that prevents a stale ``g.user.id`` from being
written into the parent's ``changed_by_fk`` when the synthetic
flag-flush triggers an UPDATE during an autoflush at a time when the
test user has already been deleted from ``ab_user`` — the cascade that
broke CI in sc-103156 T062.
Four unit tests, no app context required (uses a minimal in-memory
SQLite engine + declarative_base):
* ``test_flag_modified_suppresses_onupdate_callable``: the positive
invariant — the in-memory value lands in the DB and ``onupdate`` is
NOT called. If SA ever changes this semantic, the test fails.
* ``test_onupdate_does_fire_without_flag_modified``: negative sanity
check — confirms the test setup is realistic. Without
``flag_modified``, ``onupdate`` fires as expected.
* ``test_pin_audit_columns_skips_missing_attribute``: the helper
tolerates parents without audit columns (e.g., non-AuditMixin
models). No raise.
* ``test_pin_audit_columns_tolerates_invalid_request_error``: SA
raises ``InvalidRequestError`` from ``flag_modified`` when the
attribute is unloaded in instance state (the freshly-constructed
``session.new`` case). The helper must catch and skip — this is the
production-path safety net mentioned in the docstring.
Uses ``expire_on_commit=False`` for the positive invariant test so the
column stays loaded; the production-path case where the attribute is
expired is the ``InvalidRequestError`` branch covered separately.
Closes review item W2 from the sqlalchemy-review pass on
sc-103156-versioning.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four unit tests pin the behavior the validator added in 46f138326b:
* Valid UUID-format strings get coerced to ``uuid.UUID`` instances —
the primary contract that makes importers, test fixtures, and ad-hoc
``Model(uuid=...)`` construction see a consistent type.
* Already-``UUID`` values pass through unchanged (the validator must
not re-wrap; ``is`` identity check pins this).
* Non-UUID-shaped strings pass through unchanged. This keeps test
mocks that use placeholder strings (e.g.
``tests/unit_tests/mcp_service/dashboard/test_dashboard_schemas.py::test_dashboard_schema``
with ``"dashboard-uuid-7"``) working. Without this contract, those
tests would break under the validator and need to migrate to
``uuid.uuid4()``. If the contract is ever tightened to raise, this
test catches the regression and signals the migration cost.
* ``None`` passes through unchanged.
Closes review item S3 from the sqlalchemy-review pass on
sc-103156-versioning.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``ChartRestApi.list_versions`` (and the dashboard / dataset siblings)
resolves the entity via ``VersionDAO.find_active_by_uuid`` to run the
``raise_for_ownership`` check from T056, then calls
``VersionDAO.list_versions(model_cls, entity_uuid)`` — which internally
runs the SAME ``find_active_by_uuid`` query a second time. The lookup
filters on ``uuid`` (a unique non-PK column), so it isn't served from
SQLAlchemy's identity map — every call is a real SELECT.
Add an optional ``entity=`` keyword to ``list_versions``,
``get_version``, and ``resolve_version_uuid`` so callers that have
already resolved the path entity can skip the redundant lookup. The
public signature stays backward-compatible (kwarg-only).
Threading ``entity`` through ``get_version`` also avoids a third
redundant lookup inside the ``resolve_version_uuid`` call it makes.
Net: 6 fewer SELECT queries on the ``/versions/`` and
``/versions/<uuid>/`` hot paths (1 per call × 6 endpoints). All
``find_active_by_uuid`` callsites in the API layer were already
present from T056; this commit just plumbs them through.
Closes review item W3 from the sqlalchemy-review pass on
sc-103156-versioning.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reset session state (``rollback`` + ``expire_all``) before and after each
test in this class. Defensive hygiene against the Postgres-only multi-
test cascade documented in spec task T062 — a previous test in the
full-suite ordering can leave Continuum's shadow-table session attributes
in a state where the restore command's ``@transaction`` boundary raises
mid-flush and the API returns 422 "Dataset could not be updated."
Note: this *doesn't* fix the cascade itself (the polluter is writing to
the DB, not just the session; reproducer remains 3 failures on Postgres
full-suite ordering). The root cause is queued as T062. This change just
guarantees that whatever state this class' tests inherit, they start
from a clean session view.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The version-history integration tests follow the codebase's dominant
pattern: a trailing ``# Cleanup`` block at the end of each test that
restores the fixture entity's mutated scalar fields (``description``,
``slice_name``, ``dashboard_title``). The pattern is brittle — when any
intermediate assertion fails, the cleanup is never reached and the
fixture stays polluted with the test's last edit, cascading state into
later tests in the same suite run.
This bit hard in CI's full-suite Postgres ordering: a failure in
``test_restore_applies_scalar_field`` left ``description="restore-test
v2"`` on the birth_names dataset, polluting downstream tests including
the RLS virtual-dataset cascade. The same shape is latent in the chart
and dashboard files.
Convert every mutating test in the three version-history files to wrap
the mutation block in ``try:`` and the restoration in ``finally:``,
with a ``db.session.rollback()`` at the top of the finally to clear any
half-flushed state left by a 422 response. Re-query the entity by id
inside the finally so the restore writes against a live object rather
than a stale handle the failing path may have expired or detached.
Pure hygiene improvement — doesn't fix the underlying Postgres failure
chain (still triggered by ``model_tests.py``'s interaction with later
tests in the full-suite ordering), but prevents this set of tests from
amplifying any future cascade.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ``@validates("uuid")`` coercion added in 46f138326b now ensures
``UUIDMixin.uuid`` always returns a ``uuid.UUID`` instance after
assignment, regardless of input type. This is the right semantic
(matches what ``UUIDType`` would emit on a fresh DB read) but it
broke three pre-existing tests that compared the attribute against
string literals — those assertions only passed before because the
post-INSERT refresh path either didn't run or ran inconsistently for
the model under test.
Update those three assertions to compare ``UUID`` to ``UUID`` so the
test contract matches the now-consistent attribute shape:
* ``test_import_database`` and ``test_import_database_no_creds`` —
``database.uuid`` now matches the coerced ``UUID`` literal.
* ``test_load_parquet_table_sets_uuid_on_new_table`` — ``tbl.uuid``
same treatment.
Also harden the validator to pass non-UUID-shaped strings through
unchanged (instead of raising), so test mocks that use placeholder
strings like ``"dashboard-uuid-7"`` keep working — the SQL bind layer
will surface a clearer error if such a value is ever written to the
DB.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``sqlalchemy_utils.UUIDType`` only coerces in ``process_bind_param`` /
``process_result_value`` — string-to-UUID conversion happens on SQL
write / read, not on Python attribute assignment. Importers and ad-hoc
construction (e.g., ``SqlMetric(uuid="00000000-...")`` in the
``test_import_dataset`` unit test) leave the in-memory attribute as a
``str`` until the next DB round-trip refreshes it.
For non-versioned mappers a post-INSERT attribute expiration usually
refreshes the value before the caller reads it. With SQLAlchemy-
Continuum versioning on a child mapper (``TableColumn``, ``SqlMetric``),
the expire-on-INSERT behaviour shifts enough that the assertion
``metric.uuid == uuid.UUID(...)`` fails with a string-vs-UUID inequality.
Bisection confirmed: removing ``__versioned__`` from either child
restores the implicit coercion path; adding it reproduces the failure.
Rather than chase the Continuum/SA flush interaction, add a defensive
``@validates("uuid")`` to ``UUIDMixin`` so the attribute is always a
``UUID`` at runtime regardless of where the value came from. This
aligns the in-memory shape with what ``UUIDType`` would emit on a
fresh DB read, and the coercion is the same one the SQL layer would
apply on the next bind anyway.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``_force_parent_dirty_on_child_change`` calls ``flag_modified`` on a
versioned parent so Continuum records a parent shadow row whenever a
versioned child changes. The resulting UPDATE on ``tables`` /
``dashboards`` / ``slices`` inherits ``AuditMixin``'s
``onupdate=get_user_id`` on ``changed_by_fk`` and ``onupdate=datetime.now``
on ``changed_on`` — SQLAlchemy calls ``get_user_id()`` at flush time
and stamps whoever ``g.user`` resolves to.
When the flush is autoflush-triggered during a *previous* test's
teardown (or the fixture cleanup between two tests), ``g.user`` can
still point at a user that's already been deleted from ``ab_user``.
The parent UPDATE then fails the FK to ``ab_user``, surfacing as
``IntegrityError: FOREIGN KEY constraint failed`` /
``ForeignKeyViolation``. In CI's full-suite ordering this fires from
two distinct places — ``test_rls_filter_alters_no_role_user_birth_names_query``
teardown and ``TestDatasetRestoreApi::test_restore_applies_scalar_field``
mid-restore (the latter returning 422 "Dataset could not be updated.").
Fix: also ``flag_modified`` ``changed_by_fk`` and ``changed_on`` on
the parent. The columns now have dirty attribute history, so
SQLAlchemy uses the in-memory (previously-committed, valid) values
instead of invoking ``onupdate``. The synthetic parent UPDATE carries
the existing audit values; the FK violation goes away. The behaviour
on a real user-driven save is unchanged — those code paths go through
the normal write path with a live ``g.user`` and the audit columns
update as before.
Extracted ``_pin_audit_columns`` helper to keep the caller under
ruff's ``C901`` complexity ceiling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's ruff-format would collapse the two-clause if expression onto a
single line — local ruff at v0.9.7 reformats the same way. Pure
formatting, no behavior change. Counterpart of ``6de8811873``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same fix as ``7caf9862be`` for the dashboard restore test. CI's
integration DB accumulates Continuum shadow rows across fixture teardown
cycles, and ``TestDatasetRestoreApi::test_restore_applies_scalar_field``
picks the first row whose ``description`` matches the original. In CI
that match can land on an ``operation_type=2`` (DELETE) row from a prior
test run; Continuum's ``revert()`` then restores the dataset to its
deleted state and the restore API returns 422 "Dataset could not be
updated."
Filter ``operation_type != 2`` so the test only ever targets INSERT or
UPDATE shadow rows.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The SIP's *Authorization layering* commitment promises that all nine
version endpoints enforce both model-level access (``@protect()``) AND
row-level access (``security_manager.raise_for_ownership(entity)``).
Restore endpoints already do this via ``BaseRestoreVersionCommand.validate``,
but ``list_versions`` and ``get_version`` (six endpoints across chart,
dashboard, and dataset) skipped the row-level check — a non-owning user
with ``can_write`` on the model could list or read the shadow rows of
any entity they could reach by UUID.
Each affected handler now looks up the entity via
``VersionDAO.find_active_by_uuid`` (which already runs inside the DAO),
returns 404 if it's gone, then calls ``raise_for_ownership`` and maps a
``SupersetSecurityException`` to 403. Admins bypass automatically via
the existing FAB mechanism. The redundant lookup hits the session
identity map on the second call, so the cost is one PK read.
Tests:
- Add ``test_list_versions_denies_non_owner`` (chart, dashboard, dataset)
and ``test_restore_denies_non_owner`` (chart) — Alpha has ``can_write``
on each model but isn't an owner of the admin-owned fixture, so the
row-level branch is the only thing that can reject. The dataset row-
level test is alongside the existing model-level Gamma test, which
only exercises the ``@protect()`` denial path.
- Unskip the three placeholder tests on chart and dashboard that were
blocked waiting for this gap to close (``@pytest.mark.skip(reason="...
no built-in no-write user to exercise the 403 branch...")``).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pairs with ``25f1922b2d temp(versioning): demo version-history dropdowns``,
which adds ``uuid`` to the dashboard-list page's ``select_columns`` so the
per-row VersionHistoryDropdown can call ``/api/v1/dashboard/<uuid>/versions/``.
``DashboardList.test.tsx`` asserts the list-fetch URL via
``toMatchInlineSnapshot``, so the snapshot needs the new column. Revert
this entry along with the dropdown component (same lifecycle as the temp
parent commit).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The integration SQLite DB accumulates Continuum shadow rows across
fixture teardown cycles. ``TestDashboardRestoreApi::test_restore_applies_scalar_field``
picks the first version row whose snapshot matches the original title,
and that match can land on an ``operation_type=2`` (DELETE) row left
over from a prior test run. Continuum's ``revert()`` then dutifully
restores the entity to its deleted state — the dashboard row is gone
after the restore returns 200, and the post-restore re-query fails
with ``NoResultFound``.
Filter ``operation_type != 2`` so the test only ever targets
INSERT or UPDATE shadow rows, which is also what a real user would
pick from the version-history dropdown.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous fix (9c2391d1b4) gated the force-parent-dirty hook on
is_modified(child) for ALL session collections (dirty/new/deleted).
That was over-restrictive: is_modified checks attribute history,
and deletion is a state transition with no attribute history —
so deleted children evaluated as not-modified and the parent
wasn't flagged. The change-records listener then didn't see the
deletion and no removal record was emitted.
Symptom: test_restore_emits_full_child_diff_in_one_transaction
failed expecting a column-removed change record after a restore
that removed the column; instead only the parent's scalar fields
appeared in observed paths.
Refine: apply the is_modified filter ONLY to persistent rows in
session.dirty. session.new (creation) and session.deleted
(removal) are always real content changes by virtue of their
session-collection membership — no is_modified check needed (and
in deletion's case, the check returns the wrong answer).
Pre-commit (previous) flagged I001 unsorted-imports on the
backward-compat façade. Two queries imports merged into one
block (the aliased ``derive_version_uuid as _derive_version_uuid``
moves inline rather than living in its own block), and the
restore-side names sorted: ``_RESTORE_RELATIONS``,
``_stamp_audit_fields_for_restore``, ``restore_version``.
Pure mechanical reformatting; no behaviour change.
_force_parent_dirty_on_child_change was firing whenever ANY
TableColumn or SqlMetric of the parent appeared in
session.dirty / new / deleted — even when the child was there for
non-content reasons:
- Lazy-load side effects when a relationship is touched
- M2M relationship-cascade artifacts (e.g. RLS setUp doing
rls_entry.tables.extend([dataset]) triggers cascade behavior
that pulls children into the session)
- AuditMixin auto-bumps from earlier code paths
- Reverter side passes during restore
Force-touching the parent in those cases produced an incidental
UPDATE tables SET description=…, changed_on=…, changed_by_fk=…
whose changed_by_fk value or autoflush ordering tripped FK
integrity on some dialects. Symptoms:
- test_rls_filter_alters_no_role_user_birth_names_query → FK
IntegrityError on autoflush during a query
- test_restore_applies_scalar_field → 422 "Dataset could not be
updated" during restore
Fix: gate on Continuum's is_modified(child), which returns True
only when a non-excluded versioned column on the child has
SQLAlchemy attribute-history changes. New objects (session.new)
and genuinely-modified rows still flag the parent; phantom-dirty
rows do not.
The intended hook semantics — "child edit forces a parent shadow
row" — are preserved: a column-description edit through the
dataset API still triggers is_modified True, still flags the
parent. See test_dataset_column_edit_creates_parent_version.
flag_modified(parent, "uuid") was producing FK integrity failures via
the column's BLOB/BINARY round-trip: SQLAlchemy logs the param as
``<memory at 0x…>`` and the UUID round-trip doesn't always match the
in-memory value byte-for-byte. Symptom: in scenarios where the parent
is already going to flush (Reverter applying historical state during
restore, RLS test triggering autoflush during a query), our added
``uuid`` UPDATE column tripped the FK check.
Pick ``description`` instead — plain Text column on all three
versioned parent classes (Dashboard, Slice, SqlaTable), no
TypeDecorator, no marshaling layer. Flagging it round-trips its
current value safely. Fallback chain ``description → uuid → col_keys[0]``
keeps the original deterministic-pick property for forks/subclasses
that excluded ``description``.
Should unblock test_restore_applies_scalar_field and the
test_rls_filter_alters_no_role_user_birth_names_query autoflush
error.
- factory.py: TID251 banned ``import json``; switch to
``from superset.utils import json`` (project convention).
- factory.py: ruff format reflow on _matches_previous_version.
- version_restore.py: ruff format collapse on restore_version call.
CI was pinning a different ruff version than my local uvx default;
re-ran against ruff==0.9.7 (the version in requirements/development.txt)
which surfaced these.
The previous attempt (d0520f6766) was too aggressive: skipping when
parent is in session.dirty/new/deleted bypassed the
persistent-and-clean case the hook EXISTS for. Some upstream code
paths put the dataset in session.dirty *before* this listener fires
(API controllers touching audit fields, etc.), so the
session-membership pre-check made us silently no-op on the very
scenario the hook needs to handle. CI symptom:
test_dataset_column_edit_creates_parent_version showed before=317,
after=317 (parent shadow not written).
Restore the unconditional flag_modified and catch the specific
InvalidRequestError that fires only for the session.new case
(uuid default callable hasn't populated state yet). Other states
fall through to the original behavior:
- persistent + clean → flag_modified succeeds, parent goes dirty,
Continuum picks it up, SkipUnmodifiedPlugin keeps the row via
_has_dirty_versioned_children. ✓
- persistent + dirty → flag_modified is harmless (already dirty).
- session.new → InvalidRequestError, skip (parent INSERTs anyway).
- session.deleted → flag_modified may or may not raise; if it does,
we skip; if not, the delete dominates.
Should unblock test_dataset_column_edit_creates_parent_version,
test_get_version_returns_historical_snapshot_with_children, and
test_restore_with_column_edits_reverts_columns.
- ruff: import sort + E501 reflow on the parent-state guard in
baseline.py
- ruff format: function-signature collapse and join-chain reflow in
queries.py
- auto-walrus: two ``entity_kind = …; if … is not None:`` patterns
in queries.py converted to assignment-expressions
When one ORM flush touches multiple versioned entities (dashboard +
slice + dataset all save at tx=X), each gets a shadow row sharing
that tx. If only the dashboard is later edited at tx=Y, the
dashboard row at tx=X is closed (end_tx=Y) while slice/dataset rows
stay live at tx=X. Retention then preserves tx=X (slice/dataset are
live there) and prunes tx=Y. The dashboard's closed row at tx=X
survives step 1, then its end_transaction_id=Y trips the FK when
step 2 deletes version_transaction row Y.
Fix: extend the shadow-row delete to also match end_transaction_id
IN tx_ids. Live rows have end_tx=NULL so they're never matched by
either predicate. Closed rows that touch a pruned tx at either
endpoint are pruned together — consistent with retention semantics
(any tx in the row's lifespan is gone, so the row's chain is broken
anyway).
Unblocks test_retention_prunes_old_rows on sqlite, mysql, postgres.
The force-parent-dirty listener was calling attributes.flag_modified
on every parent reachable from a dirty child — including parents
themselves in session.new (e.g. brand-new SqlaTable + brand-new
TableColumns from POST /api/v1/dataset/). flag_modified rejects
unloaded attributes, and a session.new SqlaTable's uuid (default=uuid4
fires at flush time) is unloaded until then. CI caught this with
InvalidRequestError cascading into 422s across dataset creation /
upload / Playwright dataset specs.
The hook is only needed for the persistent-and-clean case (child
edited, parent's own scalars untouched, dropdown otherwise empty).
Anything in session.new will flush anyway; anything in session.dirty
is already flagged; session.deleted shouldn't be touched. Short-
circuit before the flag_modified call.
Unblocks test-sqlite, test-mysql, test-postgres (previous), and
playwright dataset specs.
Three small follow-ups surfaced by aminghadersohi's review of the
SoftDeleteMixin PR (#39977) that apply equally here:
- H1: cache _child_to_parent_registry() with functools.cache. Called
twice per save flush; mapping depends only on import-time model
classes, so unbounded cache is the right shape (no invalidation).
- M5: tighten _CHILD_BASELINE_HANDLERS type from dict[str, Any] to
dict[str, Callable[[Session, Any, int], None]] via a named alias.
Mypy now catches a future broken handler signature.
- M3/M4: explain the inline-import pattern once in the module
docstrings of baseline.py and changes.py. Both modules use
pylint disable=import-outside-toplevel uniformly because they
load during init_versioning() before mappers are configured;
the per-callsite "why" comments would just repeat the same
reason. Module-level explanation + a hint to comment unusual
cases is the cleaner shape.
M6 (listener placement) doesn't apply — init_versioning() already
runs inside init_app_in_ctx(). M8 (loose OpenAPI schema in
*/api.py docstrings) is real but its own change.
Extends the existing docstring note ("the orphan is swept by retention")
with the reasoning behind not cleaning it up in the same flush. The
inline-delete is appealing in principle but would couple this plugin
to the change-records listener's buffer state via the ON DELETE
CASCADE on ``version_changes.transaction_id``: both listeners would
have to agree that the flush produced nothing before the version_transaction
row could be dropped safely. The orphan's ~40-byte storage cost +
retention's correct-by-construction handling (orphans have no parent
shadow, so they're never in the "preserve" set) make the coordination
overhead not worth it.
Captures the design decision in the file where the next reader will
look for it.
Pure file shuffle, zero behaviour change. Reorders ``baseline.py`` so it
reads top-down by level of abstraction (newspaper-article rule): the
public entry point at the top, supporting helpers descending below.
Before: 14 private helpers, then ``register_baseline_listener`` at the
bottom. A reader opening the file met the leaf builders first and had
to accumulate context before finding the call site.
After (top-down):
- Entry point: ``register_baseline_listener`` + inner ``capture_baseline``
- High-level helpers used by ``capture_baseline``:
``_force_parent_dirty_on_child_change``,
``_collect_parents_to_baseline``,
``_child_to_parent_registry``,
``_version_table_for``,
``_shadow_row_count``,
``_insert_baseline_and_children``
- Mid-level builders:
``_insert_baseline_row``,
``_baseline_children_for_parent``
- Per-entity child handlers + their dispatch table:
``_baseline_dataset_children``,
``_baseline_dashboard_children``,
``_CHILD_BASELINE_HANDLERS``
- Leaf builders:
``_insert_child_baseline_rows``,
``_baseline_attached_slices``,
``_insert_synthetic_slice_baseline``
Three section-divider comments mark the abstraction levels. The
``_CHILD_BASELINE_HANDLERS`` dict literal stays after its referenced
handlers (module-level literals evaluate at import time and need names
already bound); a comment now flags this constraint.
Function bodies are byte-for-byte unchanged; ``git log -L`` on any
function shows only its relocation. 96 unit tests pass.
baseline.py:_insert_baseline_row and changes.py:_read_pre_state both
issued the same "read a single row through ``session.connection()``
inside ``with session.no_autoflush:``" pattern. Same five-line block,
same intent ("read the pre-flush state without triggering the in-flight
edit's flush").
Promoted to ``superset.versioning.utils.read_row_outside_flush(session,
table, entity_id)``. Companion to ``single_flush_scope`` — they sit
next to each other in utils.py and frame the two directions of the
"don't autoflush mid-listener" pattern.
Returns ``dict[str, Any]`` (or ``None``) so callers can't accidentally
hold a cursor-bound ``RowMapping`` past the listener boundary. Both
call sites get shorter by ~5 lines.
Also picks up Decimal stringification in the changes.py docstring
update (was listed in the W4 commit but the docstring still said
"(datetime, UUID, bytes)" — now matches the implementation).
Behaviour unchanged. 96 unit tests pass.
After the SRP split (8c9cf36) put both functions in the same module
~150 lines apart, their overlap became visible: same JOIN of
version_table → version_transaction → ab_user, same baseline-first
ordering, same user-row → ``changed_by`` projection, same lookup
``_ENTITY_KIND_BY_CLASS_NAME.get(model_cls.__name__)``. About 30 lines
of duplication.
Five small helpers extracted at the module top:
- ``_resolve_version_tables(model_cls)`` returns ``(ver_tbl, tx_tbl, user_tbl)``
- ``_version_with_tx_user_join(ver_tbl, tx_tbl, user_tbl)`` builds the join
- ``_baseline_first_ordering(ver_tbl)`` returns the order-by tuple
- ``_user_select_cols(user_tbl)`` returns the user-column list with
``user_id`` as the stable label (normalises the prior asymmetry
where ``list_versions`` labelled it ``user_id`` and ``get_version``
labelled it ``_user_id`` to dodge a column-name collision — the
``user_id`` label collides with neither)
- ``_changed_by_from_row(row)`` projects user columns onto the API shape
- ``_entity_kind_for(model_cls)`` resolves the change-records taxonomy lookup
Both call sites get shorter and read what they do (build query / project
user / build row) rather than how. Behavior unchanged; no test changes.
Also two small inline tidyings while in the file:
- Replace the ternary
``changes_by_tx = list_change_records_batch(...) if entity_kind else {}``
with an explicit two-line if-statement in both functions. The ternary
buries the decision; the if-statement reads as one thought.
- Inline the one-shot ``meta_cols`` set declaration in ``get_version``
into the ``if col.name in {...}`` check that uses it three lines later.
Net: about 110 lines → about 80 lines across the two functions, plus
a small helper section at the top.
Cleanup pass from the SQLAlchemy + migration code review. Eight items,
all in the "warnings / suggestions" tier — no behaviour change visible
to the API, but each closes a real correctness, perf, or maintainability
concern surfaced in review.
baseline.py
- Delete unused ``_get_user_id`` (W1). The function wrapped a broad
``except Exception: # noqa: S110`` swallow that hid bugs; grep
confirmed no callers anywhere. The legitimate audit-field paths
(``row.get("changed_by_fk")`` etc.) already drive the
``version_transaction.user_id`` write.
- Batch ``_baseline_attached_slices`` from O(N) round-trips to
three queries (W2): one membership SELECT, one existing-shadow
SELECT, one bulk live-row SELECT for the missing ids. The previous
per-slice ``COUNT(*)`` + ``SELECT`` was a measurable first-save
hotspot on dashboards with many charts. Drops the now-unused
``_slice_has_shadow`` helper.
- Pick a stable column name for ``flag_modified`` in
``_force_parent_dirty_on_child_change`` (W3). ``uuid`` is on all
three versioned parent classes and excluded by none, so the
flagged attribute is deterministic across SQLAlchemy versions /
mapper-config orders instead of depending on
``versioned_column_properties(parent)[0]``. Falls back to the
first available column for forks that exclude ``uuid``.
changes.py
- Add ``Decimal`` handling to ``_jsonable`` (W4) — ``json.dumps``
rejects ``Decimal``, so any numeric column (e.g. ``SqlMetric.currency``
contents, or fork/plugin Decimal columns) would crash the bulk
insert. Stringify rather than ``float()`` to preserve precision;
the diff engine compares ``from_value`` / ``to_value`` by string
equality after this coercion so both sides round-trip identically.
queries.py
- Promote the inline ``{0: "baseline", 1: "update", 2: "delete"}``
dict to module-level ``_OP_TYPE_LABELS`` (W7). The literal was
duplicated across ``list_versions`` and ``get_version``; the third
caller is one bug fix away.
- Comment on ``resolve_version_uuid``'s Python-side ``derive_version_uuid``
loop (W8) — no portable SQL form for UUIDv5 across PostgreSQL /
MySQL / SQLite, iteration count is bounded by the retention
window. Flags the place to revisit if retention is ever disabled
(``=0``) on a heavily-edited entity.
migrations/2026-05-01_23-36 (composite-PK)
- Belt-and-braces guard in ``_downgrade_mysql_table`` (W6): asserts
``t.name in AFFECTED_TABLES`` before interpolating into the
backtick-quoted ALTER statements. The invariant was already
structurally implied (callers iterate ``AFFECTED_TABLES``), but
making it load-bearing means a future refactor can't slip an
arbitrary table name through.
(W5 was verified-no-change: grepped ``tests/`` for ``metadata.create_all``
callers that exercise versioning tables; none. The cascade-FK
gap on ``version_changes.transaction_id`` is already documented
in ``tests/integration_tests/versioning/change_records_tests.py:27-32``.)
62 versioning unit tests pass.
VersionDAO carried five distinct concerns under one class — UUID
derivation, version metadata queries, change-record loading,
single-version snapshot retrieval, and restore orchestration. Bob's
"and" test (the clean-code review flagged this as the next structural
fix after the dead-code purge) gives ~600 lines of "queries about
versioned state of one entity AND the workflow that mutates it."
Splits the read and write sides into purpose-built modules:
- ``superset/versioning/queries.py`` — UUID derivation
(``VERSION_UUID_NAMESPACE``, ``derive_version_uuid``) + read-side
helpers (``find_active_by_uuid``, ``current_version_number``,
``current_live_transaction_id``, ``current_live_version_uuid``,
``list_versions``, ``resolve_version_uuid``, ``get_version``,
``list_change_records_batch``). ~475 lines.
- ``superset/versioning/restore.py`` — write-side (``restore_version``,
``_stamp_audit_fields_for_restore``, ``_RESTORE_RELATIONS``).
~140 lines. Depends only on ``queries.find_active_by_uuid`` and
``utils.single_flush_scope``.
- ``superset/daos/version.py`` — collapsed to an ~85-line backward-compat
façade that re-exports both modules under a single ``VersionDAO``
class via ``staticmethod`` aliases. The module also re-exports
``VERSION_UUID_NAMESPACE`` and ``derive_version_uuid`` at module level
so the ~10 existing callers (api.py handlers, command classes, the
ETag emitter, integration tests) don't have to change their imports.
New code is encouraged to import from the sub-modules directly.
The functions themselves are unchanged byte-for-byte aside from
internal call sites being rewritten from ``VersionDAO.foo`` to the bare
function name (since they now live as module-level functions, not
class methods).
One unit-test mock target moved: ``test_restore_version_returns_none_for_unknown_entity``
now patches ``superset.versioning.restore.find_active_by_uuid`` (the
actual call site) instead of ``VersionDAO.find_active_by_uuid`` (which
is now just an alias).
Each of the three modules now has one reason to change. When the
sc-103157 soft-delete pass adds the ``deleted_at IS NULL`` filter to
``find_active_by_uuid``, it touches only ``queries.py``. When a
per-entity-type restore Strategy replaces the string-keyed
``_RESTORE_RELATIONS`` dispatch, it touches only ``restore.py``.
DashboardList demo dropdown previously instructed the user to "Reload
the page to see the change" after a restore. The URL the user
returns to may still carry ``?native_filters_key=…`` /
``permalink_key`` / ``form_data_key`` from a prior session — those
point at server-cached snapshots (in ``key_value`` and the
filter-state cache) captured before the restore. On rehydration the
cached state is merged on top of the restored ``json_metadata``,
masking the rollback (e.g. dashboard-level colour-scheme restore
appears not to take effect).
Replaces the alert + manual reload with a direct ``window.location.href``
navigation to ``/superset/dashboard/<uuid>/`` — drops all URL params,
forcing hydration from the freshly restored DB state.
Also regenerates ``package-lock.json`` to pick up the ``zod 4.4.1 →
4.4.3`` bump that master's ``package.json`` already reflects.
(``temp(versioning)`` prefix per the demo dropdown's status — this
file is not part of V1 scope per ADR-005; the V2 UI SIP owns the
actual restore UI surface.)
Two coupled clean-code review fixes:
(1) Rename ``VersionDAO._find_active_entity_by_uuid`` →
``find_active_by_uuid``. The leading-underscore + three
``# pylint: disable=protected-access`` suppressions in the restore
commands were the smell of a wrongly-private API. The method is a
perfectly reasonable public DAO operation; dropping the underscore
removes the suppressions.
(2) Collapse ``RestoreChartVersionCommand``, ``RestoreDashboardVersionCommand``,
``RestoreDatasetVersionCommand`` onto a shared
``BaseRestoreVersionCommand`` (``superset/commands/version_restore.py``).
The three classes were textbook copy-paste — identical except for
the model class and three exception types. Each subclass now declares
``model_cls`` + ``not_found_exc`` + ``forbidden_exc`` and overrides
``run()`` with one ``@transaction(reraise=<failed_exc>)``-decorated
line delegating to ``self._do_restore()``. ~80 lines per file →
~45 lines per file; one shared workflow instead of three drift sources.
The api.py imports of ``RestoreChartVersionCommand`` /
``RestoreDashboardVersionCommand`` / ``RestoreDatasetVersionCommand`` are
unchanged — public class names preserved.
The full-Continuum spike (ADR-004 revised) replaced the JSON-snapshot
restore path with Continuum's native Reverter and removed the
``dataset_snapshots`` / ``dashboard_snapshots`` tables from the
migration chain. Seven VersionDAO methods and two module-level
helpers that read/wrote those tables stayed in the code anyway and
went unused — dead code that looked live.
Worse, ``VersionDAO.get_version`` still read from
``dataset_snapshots`` in its SqlaTable branch. On any environment
where the snapshot tables don't exist (current production behavior),
``GET /api/v1/dataset/<uuid>/versions/<version_uuid>/`` raised
``OperationalError``. The branch is rewritten to read column and
metric state from Continuum's child shadow tables
(``table_columns_version`` / ``sql_metrics_version``) via the
existing ``_shadow_rows_valid_at`` helper.
Deleted:
- ``_deserialize_snapshot_value`` (module helper)
- ``_coerce_snapshot_list`` (module helper)
- ``RESTORE_EXCLUDE_FIELDS`` (constant — only referenced by deleted code
and a docstring)
- ``VersionDAO._restore_dataset_children``
- ``VersionDAO._parse_slice_ids_json``
- ``VersionDAO._apply_dashboard_slices``
- ``VersionDAO._restore_dashboard_children``
- ``VersionDAO._apply_snapshot_children``
The corresponding ~17 unit tests in
``tests/unit_tests/daos/test_version_dao.py`` are removed alongside.
Stale docstring references in ``versioning/changes.py`` and
``versioning/diff.py`` that pointed at the retired snapshot tables are
also cleaned up.
Also strips an 8-line comment block in ``restore_version`` that
duplicated the docstring of ``_stamp_audit_fields_for_restore``.
Net: −290 lines from ``daos/version.py``; a production-shape bug
fixed; dead code that looked live is gone.
VersionDAO.restore_version previously called Continuum's Reverter
once per relation in a split-revert loop with flush + expire between
calls. That closed an autoflush race in the Reverter when multiple
relations were reverted at once, but split one logical restore across
multiple Continuum transactions — and once the change-records listener
was wired up, the listener's tx-dedup guard skipped the second pass,
silently dropping child-addition records from version_changes. A
restore that re-added a calculated column would render as an empty
"Baseline" entry in the dropdown.
Replaces the split-revert with a single ``target_version.revert(relations=relations)``
call wrapped in a new ``single_flush_scope(db.session)`` context
manager (``superset/versioning/utils.py``). The context manager
suppresses autoflush inside the block and issues one trailing flush
on clean exit; on exception, the trailing flush is skipped so the
session's normal rollback path handles cleanup. Same autoflush window
closed, one Continuum transaction instead of N, the change-records
listener sees the complete shadow state in one after_flush pass.
The wrapper carries the full autoflush-race / cascade-add rationale
in its docstring so the restore_version call site can be a short
6-line block referencing it.
Integration coverage: ``test_restore_emits_full_child_diff_in_one_transaction``.
SQLAlchemy doesn't mark a parent as dirty when only its children
(``TableColumn`` / ``SqlMetric`` on ``SqlaTable``) are modified.
Continuum's UnitOfWork only creates operations for entities in
``session.dirty``, so a column-only edit produces shadow rows in
``table_columns_version`` but no parent shadow row in
``tables_version``. ``VersionDAO.list_versions`` queries the parent
shadow, so the version dropdown is empty for child-only saves —
exactly the failure mode reported when "I edited a column description
but no version appeared."
Extends ``register_baseline_listener`` with a new before-flush hook
``_force_parent_dirty_on_child_change`` that walks the existing
``_child_to_parent_registry`` and ``attributes.flag_modified(parent,
<first non-excluded versioned column>)`` whenever a versioned child
is dirty / new / deleted but the parent's own scalars haven't been
touched. The flag puts the parent in ``session.dirty`` so Continuum's
UoW creates a parent UPDATE operation; the resulting shadow row's
scalar columns mirror the previous version (only the children
actually changed), and the row exists to anchor the transaction in
the parent's version chain.
``SkipUnmodifiedPlugin._is_no_op_update`` is updated in this commit's
predecessor to recognize the "scalars match but children dirty" case
via ``_has_dirty_versioned_children`` so the forced parent UPDATE
isn't skipped.
Integration coverage: ``test_dataset_column_edit_creates_parent_version``.
Continuum's no-op suppression compared post-flush column values
byte-for-byte against the previous live shadow row. For
``Dashboard.json_metadata`` that produced false-positive version rows
on saves where the user authored nothing — the frontend re-stamps
``map_label_colors`` (regenerated from the ``LabelsColorMap``
singleton) on every save, plus ``chart_configuration`` /
``global_chart_configuration`` / ``show_chart_timestamps`` /
``color_namespace`` (derived from the current chart set), so two
consecutive identical saves produce different bytes for the column.
The diff engine already excluded those keys via
``DASHBOARD_JSON_METADATA_AUDIT_KEYS`` when computing change records;
the skip-plugin diverged.
Adds a ``_COLUMN_NORMALIZERS`` registry keyed on
``(class_name, column_name)`` that maps to a per-column normalizer
applied to both pre- and post-image before equating. The first
entry parses ``Dashboard.json_metadata`` as JSON and drops the
audit-key set before comparing. The same registry is the extension
point for analogous transient fields on charts and datasets.
Promotes ``_DASHBOARD_JSON_METADATA_AUDIT_KEYS`` to a public name
(``DASHBOARD_JSON_METADATA_AUDIT_KEYS``) so the skip-plugin can import
it from ``superset.versioning.diff`` without reaching across a
leading-underscore boundary.
Integration coverage: ``test_map_label_colors_only_change_does_not_create_version``.
The v1 import pipeline previously wrote dashboard ↔ chart membership
via raw Core DML (``db.session.execute(delete(dashboard_slices)…)`` +
``db.session.execute(insert(dashboard_slices)…)``). With Continuum's
M2M tracker enabled by the versioning feature, those Core writes
emit malformed shadow INSERTs into ``dashboard_slices_version`` —
the tracker can't see the composite-PK columns through the Core
layer and produces rows with only ``(transaction_id, operation_type)``
populated, triggering a ``NOT NULL`` violation on
``(dashboard_id, slice_id)``.
Rewrites both import paths (``ImportAssetsCommand._import`` in
``commands/importers/v1/assets.py`` and ``ImportDashboardsCommand._import``
in ``commands/dashboard/importers/v1/__init__.py``) to use ORM-level
``dashboard.slices = [...]`` reassignment followed by an explicit
``db.session.flush()``. The explicit flush is necessary to land the
M2M rows before any subsequent autoflush fires an inner-flush event
handler that would reset the relationship change (cf. the SAWarning
``Attribute history events accumulated on N previously clean instances
within inner-flush event handlers have been reset``).
The unit tests previously called ``_import`` directly twice in the same
session — production wraps ``run()`` in ``@transaction`` so each invocation
gets its own DB+Continuum transaction. Added ``db.session.commit()`` between
calls in ``test_import_adds_dashboard_charts``,
``test_import_removes_dashboard_charts``, and
``test_dashboard_import_with_overwrite_replaces_charts`` so the tests
mirror production semantics; otherwise the second call's M2M shadow
inserts conflict with the first call's on
``UNIQUE (dashboard_id, slice_id, transaction_id)``.
Adds debug-only ``VersionHistoryDropdown`` widgets to the chart,
dashboard, and dataset list pages so the version surface can be
exercised from the UI during the spike. Each row's actions column
gets a clock-icon dropdown that fetches ``/api/v1/{resource}/<uuid>/
versions/`` on click, lists the ten most recent versions with a
formatted change-log summary, and offers per-version restore via
``POST .../versions/<uuid>/restore``.
Strings are wrapped in ``t('...')`` with placeholder formatting
(e.g. ``t('Added %(kind)s "%(name)s"', { kind, name })``) so
translators can reorder verbs and nouns rather than concatenating
fragments. ``KIND_LABELS`` is a static map keying English layout
kinds (``chart``, ``row``, ``column``, ``tab``, ``markdown``, etc.)
to ``t(...)``-extractable labels. Empty change lists render as
"Baseline" rather than "No changes recorded" since the empty case
is overwhelmingly the ``operation_type=0`` baseline row.
Locale-aware date rendering: ``new Date(iso).toLocaleString(lang)``
where ``lang`` comes from ``document.documentElement.lang`` (set
by ``src/views/App.tsx`` from the bootstrap ``locale``), so dates
follow the user's chosen Superset locale rather than the browser's.
French translations for the new strings are appended to
``superset/translations/fr/LC_MESSAGES/messages.po`` (Ajouté,
Supprimé, Modifié, Version initiale, kind labels, …). Run
``npm run build-translation`` and ``pybabel compile -l fr`` to
regenerate the JSON / MO packs.
This commit is **demo-only** per ADR-005 (V1 is backend-only). It
is intentionally marked ``temp`` so it can be reverted before the
PR splits — the production V1 ships without UI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Locks in the no-op-suppression behavior implemented by
``SkipUnmodifiedPlugin`` (which lives in ``superset/versioning/factory.py``
shipping with the foundation commit). Five integration tests:
1. Owners-only edit doesn't mint a version row — exercises the
case where every dirty column is an excluded relationship.
2. Re-save with identical scalar values doesn't mint a row —
exercises the json_metadata re-serialise path where
``set_dash_metadata`` rewrites the column to a different byte
sequence with identical parsed content; the plugin must compare
post-flush values against the prior shadow row to detect this.
3. Real scalar change DOES mint a row — guards against the plugin
over-suppressing.
4. Same assertion on a Slice (covers the ``String`` column path on
a different entity type).
5. ``json_metadata`` sub-key edit DOES mint a row — covers the
``MediumText`` column path past the plugin's content-equality
check.
Tests are designed so a column-type change in the parent entities
(e.g. flipping ``json_metadata`` from ``MediumText`` to ``JSON``)
will fail one of these if the plugin's Python ``!=`` comparison
breaks for the new type.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Helper module that derives the strong-validator ``ETag`` value from
an entity's current live ``version_uuid`` and attaches it to a
Flask response. Two functions:
- ``set_version_etag(response, version_uuid)`` — direct path used by
PUT handlers that already compute ``new_version_uuid`` (see the
REST API commit two prior). Cheap; no extra query.
- ``set_version_etag_by_uuid(response, model_cls, entity_uuid)`` —
used by version endpoints that operate on ``entity_uuid``; looks
up ``entity_id`` then derives ``version_uuid`` via ``VersionDAO``.
Costs one extra ``SELECT id WHERE uuid = ?``; documented in the
docstring so callers prefer the cheap variant when they have the
id already.
Integration tests cover all three entity types and four endpoint
shapes (entity GET, save PUT, version-list GET, single-version GET)
plus the entity-with-no-versions edge case (header is correctly
absent).
The ETag is wired into the API endpoints in the REST-API commit
(group 3) and the CORS ``expose_headers: ["ETag"]`` ships with the
retention commit (group 4) since both touch ``superset/config.py``.
Locking enforcement (``If-Match`` → 412) is explicitly NOT in this
change — deferred to the follow-up UI SIP per Open Question §7.
``ETag`` is informational in v1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a scheduled Celery task that prunes version history older than
``SUPERSET_VERSION_HISTORY_RETENTION_DAYS`` (default 30; settable
via env var; ``0`` disables retention entirely).
**Task** — ``superset.tasks.version_history_retention.prune_old_versions``:
1. Computes ``cutoff = utcnow() - timedelta(days=N)``.
2. Selects ``version_transaction.id`` rows with ``issued_at <
cutoff`` and filters out any tx whose parent shadow includes a
live row (``end_transaction_id IS NULL``). The live row is the
only preservation rule — closed historical rows including the
baseline (``operation_type=0``) age out. Per-entity minimum-history
floor is an open question tracked in ``future-work.md``.
3. Deletes rows owned by surviving txs in each parent shadow
table (``dashboards_version`` / ``slices_version`` /
``tables_version``).
4. Deletes child-shadow rows for the same transactions
(``table_columns_version`` / ``sql_metrics_version`` /
``dashboard_slices_version``).
5. Drops the surviving ``version_transaction`` rows. The
``version_changes`` rows cascade via the FK from the previous
commit.
Idempotent and safely retried on partial failure.
**Schedule** — ``superset/config.py`` adds the task to the default
``CeleryConfig.beat_schedule`` (nightly at 03:00). Operators who
override ``CeleryConfig`` in their ``superset_config.py`` need to
merge this entry — see UPDATING.md.
Also adds ``"expose_headers": ["ETag"]`` to the default
``CORS_OPTIONS`` so cross-origin browser clients can read the
``ETag`` header introduced in the next commit. (Co-located here
because both touch ``superset/config.py``; the ETag mechanism
itself ships in the next commit.)
**Auto-discovery** — ``superset/tasks/celery_app.py`` adds
``version_history_retention`` to its late-imports so Celery's
auto-discovery picks up the task.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Exposes the version surface as three new endpoints per entity type
(chart, dashboard, dataset), each carrying the standard Superset
decorator stack (``@protect()``, ``@safe``, ``@statsd_metrics``,
``@event_logger.log_this_with_context``) so they appear in FAB's
``action_log`` alongside other audited operations.
| Method | Path | Purpose |
|---|---|---|
| GET | ``/api/v1/{resource}/<uuid>/versions/`` | List version history (oldest-first; per entry: ``version_uuid``, ``version_number``, ``transaction_id``, ``operation_type``, ``issued_at``, ``changed_by``, ``changes`` array) |
| GET | ``/api/v1/{resource}/<uuid>/versions/<version_uuid>/`` | Read-only snapshot of the entity at the requested version (scalar fields plus ``columns`` / ``metrics`` for datasets) |
| POST | ``/api/v1/{resource}/<uuid>/versions/<version_uuid>/restore`` | Replay the snapshot onto the live entity via Continuum's ``Reverter`` (non-destructive — produces a new version row stamping the restoring user via the standard save path) |
``<version_uuid>`` is a deterministic ``UUIDv5(entity_uuid,
transaction_id)`` so it's stable across replicas and retention
pruning. Authorisation reuses the resource's existing ``can_write``
permission; workspace admins can list / restore any entity.
**Restore commands** — ``superset/commands/{chart,dashboard,dataset}/
restore_version.py`` wrap ``VersionDAO.restore_version`` in the
standard ``@transaction()`` boundary. The command resolves the
``Reverter`` once per related collection (split-revert pattern, with
``flush + expire`` between calls) so a multi-relation restore
doesn't trip Continuum's autoflush race that would otherwise mark
half the collection as ``state.deleted=True`` mid-revert.
**Save responses** — ``PUT /api/v1/{resource}/<pk>`` is updated to
include ``old_version`` / ``new_version`` (0-based numbers),
``old_transaction_id`` / ``new_transaction_id`` (stable across
pruning), and ``old_version_uuid`` / ``new_version_uuid`` body
fields so callers can correlate a save with its resulting version
row. The ``ETag`` response header in the next commit is built on
top of this, but the body fields stay — they predate the header
and remain useful for clients that don't read response headers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a structured per-field change log alongside the foundational
shadow tables. Each save flush emits zero or more ``version_changes``
rows describing what changed relative to the previous version, with
shape ``[{kind, path, from_value, to_value, sequence}]`` keyed to
``version_transaction.id`` (FR-016 .. FR-021).
**Schema** — ``version_changes`` table, FK to ``version_transaction``
with ``ON DELETE CASCADE`` so retention drops dependent records
without explicit cleanup. Composite unique index on
``(transaction_id, entity_kind, entity_id, sequence)`` so the
listener can write monotonically and downstream readers see a
deterministic order.
**Diff engine** (``superset/versioning/diff.py``) — pure-function
diffing of pre-/post-state pairs:
- ``diff_scalar_fields`` for ordinary columns; emits one record per
changed field with JSON-safe ``from_value`` / ``to_value``.
- ``diff_json_field`` for ``json_metadata`` and ``params``, walking
the parsed structure and emitting per-sub-key records. Honours
an ``exclude_keys`` set
(``_DASHBOARD_JSON_METADATA_AUDIT_KEYS``: ``chart_configuration``,
``global_chart_configuration``, ``map_label_colors``,
``show_chart_timestamps``, ``color_namespace``;
``_CHART_PARAMS_AUDIT_KEYS``) so frontend-stamped sub-keys that
mutate on every save don't dominate the change log (FR-022).
- ``diff_dashboard_layout`` walks ``position_json`` structurally
and emits ``[verb, kind, id]`` records (verbs ``add``, ``remove``,
``move``, ``edit``; kinds from a ``CHART``/``ROW``/``COLUMN``/etc.
→ english map) so a UI can render "Added chart 'Foo'" without
re-parsing JSON. ``HEADER_ID`` is suppressed because it duplicates
the ``dashboard_title`` scalar record.
- ``fold_dashboard_layout_with_chart_changes`` deduplicates layout
records against M2M / chart-membership records by UUID so an
add-and-attach doesn't appear twice.
- ``_values_equivalent`` treats ``None`` and ``""`` as equal; this
matches the save path's habit of normalising nullable strings to
the empty string.
**Listener** — ``superset/versioning/changes.py`` registers a
``before_flush`` listener that captures pre-state for each dirty
entity and an ``after_flush`` listener that runs the diff engine
against the post-state and writes ``version_changes`` rows under
the resolved ``transaction_id``. Tracks processed transaction ids
on ``session.info`` so re-firings within a single transaction
(autoflush triggered by mid-commit queries) don't double-insert and
trip the unique constraint. Reads child rows via raw SELECT against
``table_columns`` / ``sql_metrics`` rather than ``dataset.columns``
because the live collection is stale during the restore path's raw
DELETE+INSERT cycle.
**Endpoint surface** — ``VersionDAO.list_change_records_batch``
batches the lookup across multiple transactions with a single
``WHERE transaction_id IN (...)`` query so the version-list
endpoint avoids N+1 round-trips. ``list_versions`` / ``get_version``
return entries with a populated ``changes`` array (empty for
``operation_type=0`` baseline rows).
**Tests** — ``test_diff.py`` covers the diff engine shape (39
unit cases across scalar, JSON, layout, child-collection, and
fold paths). ``change_records_tests.py`` exercises the listener
end-to-end with realistic save flows. ``perf_validation_tests.py``
is the T044 harness for SC-002/3/4 (list endpoint p95 < 1s,
restore < 3s, save overhead < 50ms).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds SQLAlchemy-Continuum as a dependency and wires it as the
canonical capture mechanism for chart, dashboard, and dataset edits.
**Schema** — three Alembic migrations, leaving the chain at one
foundation revision plus one child-shadow revision:
- ``version_transaction`` (renamed from Continuum's default
``transaction``; SQL-reserved-word workaround) carries the per-save
``user_id`` / ``issued_at`` and is the join target for all shadow
rows. Auto-incrementing PK; user_id has no FK so import / Celery /
CLI saves can write rows without an active Flask user.
- Parent shadow tables for the three entity types:
``dashboards_version``, ``slices_version``, ``tables_version``.
- Child shadow tables for dataset children + dashboard M2M:
``table_columns_version``, ``sql_metrics_version``,
``dashboard_slices_version`` (composite PK on the M2M shadow,
matching the live ``dashboard_slices`` reshape from
sc-105349-composite-association-pks).
**Models** — ``Dashboard``, ``Slice``, ``SqlaTable`` (and dataset
children ``TableColumn`` / ``SqlMetric``) gain ``__versioned__``
class attributes. The exclude lists carry both M2M relationships
(``owners``, ``roles``, ``dashboards``) and the ``AuditMixin``
columns (``changed_on`` / ``created_on`` / ``changed_by_fk`` /
``created_by_fk`` plus ``last_saved_at`` / ``last_saved_by_fk``
on ``Slice``) so auto-bumped audit fields cannot trigger a
version row on their own (FR-025).
**Plugins** — ``superset/versioning/factory.py`` ships three
Continuum plugins:
- ``VersionTransactionFactory`` renames the transaction table and
appends the unconditional ``user_id`` column.
- ``VersioningFlaskPlugin`` sources the acting user from Superset's
``g.user`` rather than ``flask_login.current_user`` (Superset's
JWT auth populates ``g.user`` but leaves ``current_user``
anonymous on API routes).
- ``SkipUnmodifiedPlugin`` filters Continuum's UPDATE operations,
marking content-equivalent re-saves as ``processed=True`` so they
don't mint no-op shadow rows (FR-026; see follow-up commits for
the test). Lives in this commit because it shares the file with
the other plugins.
**Save-path glue** — a ``before_flush`` baseline listener
(``superset/versioning/baseline.py``) inserts an ``operation_type=0``
shadow row the first time a pre-existing entity is saved, including
the slice-baseline-under-dashboard pattern that gives the dashboard
M2M shadow a row to join against. ``UpdateDashboardCommand`` wraps
its body in ``no_autoflush`` so ``process_tab_diff`` /
``process_native_filter_diff`` don't fire intermediate flushes that
would mint extra version rows. ``DatasetDAO.update_columns`` is
rewritten as a natural-key upsert keyed on ``column_name`` so child
edits flow through ORM events Continuum sees.
**DAO** — ``superset/daos/version.py`` exposes the read API used by
the version endpoints in the next commits:
``current_version_number`` (0-based index, unstable under retention
pruning), ``current_live_transaction_id`` (stable across pruning),
``current_live_version_uuid`` (deterministic UUIDv5), plus
``list_versions`` / ``get_version`` / ``restore_version`` and a
batch ``list_change_records_batch`` for N+1 avoidance.
**Initialization** — ``superset/initialization/__init__.py`` wires
``init_versioning()`` after ``make_versioned()`` runs and the
versioned mappers are configured. Registers the baseline listener
plus the change-record listener (the latter's body lives in the
next commit but the registration site is here because it shares
the init function).
**Tests** — version-capture and version-list integration tests for
each entity type, plus a ``VersionDAO`` unit test suite. Retention
test uses a backdated ``issued_at`` so it can drive
``_prune_old_versions_impl`` synchronously.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three improvements from @aminghadersohi's review on
apache/superset#39859:
1. **`fk["name"]` unguarded in ``_downgrade_mysql_table`` re-add loop**
The drop loop gates on ``if fk_name := fk.get("name"):`` but the
re-add loop accessed ``fk["name"]`` unconditionally in an f-string.
MySQL/InnoDB always assigns FK names, so this branch was defensive,
but the asymmetry was confusing. Symmetrized via ``continue`` at the
top of the re-add loop.
2. **``ondelete`` whitelist before raw-SQL interpolation**
The value comes from MySQL's ``information_schema`` (not user
input), but interpolating a reflected string into raw SQL without a
guard left a "what if an unexpected value appears" footgun. Added
``_VALID_ONDELETE_ACTIONS`` (the four SQL-standard actions) and a
``RuntimeError`` when an unexpected value is reflected.
3. **Direct ALTER on PostgreSQL for tables with pre-existing UNIQUE**
``recreate="always"`` is dialect-agnostic — on PostgreSQL it
triggers ``CREATE TABLE AS SELECT → DROP → RENAME`` holding
``ACCESS EXCLUSIVE`` for the full table-copy duration. For a
multi-million-row ``dashboard_slices``, that lock window can be
noticeable. The reflected UNIQUE constraint has a stable name on
PostgreSQL (default ``<table>_<cols>_key`` convention), so dropping
it directly and then running structural change as direct ALTER
avoids the copy entirely.
The reflected UNIQUE name is wrapped in a new
``_drop_redundant_unique_by_name()`` helper. Postgres takes the
direct path; MySQL keeps ``recreate="always"`` because InnoDB binds
FKs to the UNIQUE's underlying index for back-reference (``DROP
CONSTRAINT`` on the UNIQUE there raises ``ERROR 1553``); SQLite
keeps ``recreate="always"`` because unnamed UNIQUEs reflect with
``name=None`` and can't be dropped by name.
Verified end-to-end: downgrade-then-upgrade against MySQL with
~12M total junction rows seeded completes in ~1m 41s (within the
range of the prior measurements).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Justin Park (@justinpark) reported on apache/superset#39859:
MySQLdb.OperationalError: (1832, "Cannot change column 'dashboard_id':
used in a foreign key constraint 'fk_dashboard_roles_dashboard_id_dashboards'")
Root cause: ``batch_op.alter_column(fk1, nullable=False)`` for the six
non-UNIQUE association tables emits ``ALTER COLUMN`` on a column that
participates in an FK constraint. MySQL 8 rejects this with ERROR 1832
when the table has data — even when the change is just ``NULL`` →
``NOT NULL`` and the column is already part of a freshly-added
composite primary key (which InnoDB has just made implicitly NOT NULL
anyway). The error fires on populated tables only; CI's ``test-mysql``
shard runs against empty tables and so didn't catch this, while a
real production-shaped install does.
The ``alter_column`` was only ever needed for SQLite, where composite
``PRIMARY KEY`` does not promote constituent columns to ``NOT NULL``
(a long-standing SQLite quirk — only ``INTEGER PRIMARY KEY`` does).
PostgreSQL and MySQL implicitly promote PK columns to ``NOT NULL`` as
part of ``ADD PRIMARY KEY``, so the explicit step is unnecessary on
both — and on MySQL it's actively broken on populated tables.
Fix: extract the ``alter_column`` pair into a helper
``_enforce_not_null_for_sqlite()`` that no-ops on Postgres and MySQL.
Both branches of the per-table upgrade (the ``recreate="always"`` path
for the two UNIQUE-bearing tables, and the direct-ALTER path for the
other six) now call the helper instead of inlining the
``alter_column``.
Verified end-to-end: downgrade-then-upgrade against MySQL with
~12M total junction rows (10M dashboard_slices + 1M each
slice_user/dashboard_user + 100K dashboard_roles) completes in
1m 39s with no ERROR 1832. The 44 in-memory SQLite tests still pass.
Considered Justin's alternative (drop FKs on MySQL across all eight
tables, unifying the two branches) but rejected as more invasive —
it would require capturing FK metadata and explicitly re-creating
the FKs for the six non-recreate tables, since they don't go through
the ``copy_from`` path that re-creates FKs automatically. The
SQLite-only approach is more targeted: it removes the operation that
MySQL rejects rather than working around the rejection.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the stress-test seed script with an optional duplicate-row
injection step, used to measure the empirical cost of the migration's
``_dedupe_by_min_id`` phase.
Usage: after running the normal seed at a given scale, add
``--dirty-duplicates-pct 5`` (or any non-zero value) to inject that
percentage of duplicate ``(fk1, fk2)`` rows into each non-UNIQUE
junction (slice_user, dashboard_user, dashboard_roles —
dashboard_slices is skipped because its UNIQUE constraint, present
both pre- and post-migration, rejects duplicates).
Pre-condition: requires the DB to be at the pre-migration revision
(33d7e0e21daa). The post-migration composite PK rejects duplicates,
so attempting to inject on the upgraded schema errors out.
Empirical result on MySQL @ 10M dashboard_slices + ~2.1M other
junction rows + 105K injected duplicates (5% on the 3 non-UNIQUE
tables):
Upgrade time: 1m 36s vs clean baseline 1m 37s
→ dedupe cost is within measurement noise; the table-scan that
the migration already performs dominates whether or not
duplicates exist.
This empirically confirms what the cost-model predicted: the
``_dedupe_by_min_id`` GROUP BY scan is the dominant cost of that
phase, and the actual per-duplicate DELETE is negligible.
NULL-FK injection deliberately skipped — would require altering the
six non-UNIQUE FK columns from NOT NULL back to nullable (the
migration's downgrade keeps them NOT NULL by design), which adds
per-backend ALTER complexity for a code path that's structurally
identical in cost shape (DELETE WHERE col IS NULL is the same scan
shape as the dedupe scan).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add ``scripts/seed_junction_load.py``, a backend-agnostic script that
bulk-inserts synthetic parent rows (dashboards, slices, users, roles,
tables, dbs) and many-to-many junction rows for the four largest
association tables targeted by the composite-PK migration:
``dashboard_slices``, ``slice_user``, ``dashboard_user``,
``dashboard_roles``.
Designed for measuring migration runtime at varying scales — run with
a series of size flags (100K / 1M / 5M / 10M for the target table)
and time the migration at each scale to verify the predicted
``O(N log N)`` extrapolation against real numbers.
Properties:
- **Reproducible**: deterministic cross-product walk through parent IDs
produces a stable pair sequence; re-running is replayable.
- **Idempotent**: re-running with the same target is a no-op; with a
higher target, only new rows are added.
- **Backend-agnostic**: connects via Superset's standard ``DATABASE_*``
env vars (or ``SUPERSET__SQLALCHEMY_DATABASE_URI``). Branches on
dialect for ``BINARY(16)`` vs ``UUID`` vs TEXT/BLOB UUID columns.
- **Batched**: bulk INSERT 10K rows per statement.
- **Per-phase timing**: logs elapsed wall time for the parents phase,
the junctions phase as a whole, and per junction-table.
- **Avoidance set**: loads existing junction pairs into a Python set
so re-runs on top of pre-existing data don't collide on the
uniqueness constraint.
Usage (inside the Superset container):
docker exec superset-superset-1 \\
/app/.venv/bin/python /app/scripts/seed_junction_load.py \\
--dashboard-slices 1000000
Defaults target a "large multi-team install" shape: 1M
``dashboard_slices``, 100K each ``slice_user`` / ``dashboard_user``,
10K ``dashboard_roles``. Override per-table via flags.
Tested locally on MySQL (the user's current eval stack):
- 200/100/100/50 row mini-run produced expected counts.
- Re-running at the same target is a no-op (idempotent).
- ``--dry-run`` plans without writing.
Junction tables not yet covered (``sqlatable_user``, ``rls_filter_*``,
``report_schedule_user``) are typically small in production and
require additional parent seeding (RLS filters, report schedules)
that wasn't worth the scope here. Adding them is straightforward by
extending ``JUNCTIONS`` and writing the corresponding parent seeder.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix two follow-on issues reported when starting the dev stack with
docker-compose-mysql.yml:
1. ``superset-init`` step 4 (load-examples) fails with
``MySQLdb.OperationalError: (2002, "Can't connect to server on 'db'")``
because the analytics-examples DB connection inherits ``EXAMPLES_PORT=5432``
(Postgres port) from ``docker/.env``. The override flipped
``DATABASE_DIALECT`` to ``mysql+mysqldb`` but left the EXAMPLES_*
group on Postgres defaults, so the URI became
``mysql+mysqldb://examples:examples@db:5432/examples`` — MySQL
container has no listener on 5432.
Fix: add ``EXAMPLES_HOST/PORT/DB/USER/PASSWORD`` and a complete
``SUPERSET__SQLALCHEMY_EXAMPLES_URI`` to the ``mysql-env`` anchor.
2. The Postgres init scripts under
``docker/docker-entrypoint-initdb.d/`` (``cypress-init.sh``,
``examples-init.sh``) get mounted on the MySQL container too —
compose merges volume lists. They invoke ``psql`` which doesn't
exist in the MySQL image, abort with ``psql: command not found``,
and prevent the ``examples`` DB from being created.
Fix: add a MySQL-specific init script
``docker/mysql-init/examples-init.sql`` that creates the
``examples`` database and user, and mount it at
``/docker-entrypoint-initdb.d`` in the override. Compose's
later-takes-precedence rule on duplicate volume targets displaces
the Postgres init dir, so the MySQL container only sees the
MySQL-compatible script.
(Used a plain duplicate-target mount rather than the ``!override``
tag because pre-commit's ``check-yaml`` doesn't recognize Compose's
custom YAML tags.)
Recovery for an existing failed MySQL stack: ``docker compose -f
docker-compose.yml -f docker-compose-mysql.yml down``, then
``docker volume rm superset_db_home_mysql`` (so the new init script
runs on the next fresh boot), then ``up`` again.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds ``docker-compose-mysql.yml``, a compose-override file that swaps
the default Postgres metadata DB for MySQL 8 with one extra ``-f``
flag:
docker compose -f docker-compose.yml -f docker-compose-mysql.yml up
Useful for evaluating dialect-specific behaviour (e.g., the runtime
cost of DDL migrations on a deployment whose production metadata DB
is MySQL — the question raised by review feedback on this PR).
Mirrors the connection settings used by CI's ``test-mysql`` shard:
``mysql+mysqldb`` dialect, charset ``utf8mb4`` with binary_prefix.
Host port defaults to 13306 (configurable via ``DATABASE_PORT_MYSQL``)
to avoid colliding with a native MySQL install on 3306.
A separate volume (``db_home_mysql``) keeps MySQL data isolated from
the Postgres ``db_home`` volume, so switching between the two with
``-f`` flag toggles doesn't corrupt either side.
The Postgres-specific init scripts under
``docker/docker-entrypoint-initdb.d/`` are not mounted on the MySQL
service (they are postgres-only). Examples / cypress fixtures still
load via ``superset-init``'s post-startup steps, which run
``superset load-examples`` against whichever metadata DB is in use.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror of the PostgreSQL diagnostic queries added in 11148779ed,
adapted for MySQL/InnoDB. One important difference: InnoDB rebuilds
the clustered index on every PK change, so all eight tables undergo
a full table rebuild on MySQL — not just the two that go through
the explicit ``recreate="always"`` path. The lock-window estimate
query is updated to cover all eight rather than just two, and the
"migration_path" column makes the rebuild expectation explicit
("direct ALTER (still rebuilds InnoDB clustered index)").
Other notes:
- ``information_schema.TABLES.TABLE_ROWS`` is an InnoDB estimate,
analogous to PostgreSQL's ``reltuples``; documented inline.
- ``KEY_COLUMN_USAGE`` carries both sides of the FK in a single
row on MySQL, so the external-FK pre-flight check is simpler
than the PostgreSQL version (no joins between three views).
- The aggregated dedupe query is portable standard SQL; included
verbatim for copy-paste convenience.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a "Sizing the maintenance window on PostgreSQL" sub-section to the
operator runbook. The simple per-table COUNT/duplicate/NULL queries
that were already there are dialect-portable but only count rows;
operators on PostgreSQL with large deployments need to characterize
the migration's runtime cost before scheduling it.
Adds four diagnostic queries:
- Per-table size, row count (from pg_class.reltuples), and which
migration path each table will take (recreate-rewrite vs direct
ALTER). Sizes the work concretely.
- Aggregated duplicate-row roll-up: dup_groups + total rows_dropped
per table. Replaces eight separate per-table queries with one
consolidated result for audit/dump-before-apply decisions.
- External-FK pre-flight check (the same one the migration runs at
upgrade time and aborts on). Lets operators surface any blocking
external reference ahead of the maintenance window. Should be
empty on a stock install.
- Lock-window estimate for the two full-rewrite tables, using
pg_relation_size and a conservative 100 MB/s rewrite throughput
assumption. The other six use direct ALTER and are dominated by
composite-index build time (seconds for low-millions-of-rows
tables).
Prompted by reviewer feedback on apache/superset#39859 from a large
deployment asking how to size the maintenance window. The original
pre-flight queries are kept for cross-dialect operators (MySQL,
SQLite) since the new queries use PostgreSQL-specific catalog views.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI cypress + playwright shards were red with:
ERROR [flask_migrate] Error: Multiple head revisions are present
for given argument 'head'
The recent rebase onto master pulled in
``33d7e0e21daa_add_semantic_layers_and_views.py`` (from PR #37815,
"semantic layer extension"), which had been authored against
``ce6bd21901ab`` as its parent — the same parent our migration
referenced. After the rebase both migrations point at
``ce6bd21901ab``, producing two heads and breaking ``flask db
upgrade head`` for any downstream consumer (CI's Cypress / Playwright
shards spin up a real Superset instance via ``superset db upgrade``,
which is why those shards failed first; the integration shards run
against a precomputed schema and didn't surface this).
Fix: chain our migration after the semantic-layer migration by
pointing ``down_revision`` at ``33d7e0e21daa``. The chain is now
linear:
... → ce6bd21901ab → 33d7e0e21daa (semantic layers)
→ 2bee73611e32 (composite PK, this PR)
Verified with ``superset db heads`` (returns single head
``2bee73611e32``) and the local migration test suite (44 passed,
1 skipped).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Found by running fresh-install + round-trip against a real SQLite DB:
6 of the 8 affected tables had FK columns that were originally
declared nullable. PostgreSQL and MySQL implicitly promote the
constituent columns of an ``ALTER TABLE ... ADD PRIMARY KEY`` to
``NOT NULL``; SQLite does not (it's a long-standing SQLite quirk —
only ``INTEGER PRIMARY KEY`` enforces NOT NULL on a composite-PK
column). Result: a fresh SQLite install would accept
``INSERT INTO dashboard_slices (NULL, 5)`` despite both columns
being part of the composite PK.
Our integration tests previously masked this: the test fixture seeds
columns with ``nullable=False``, so the post-upgrade NOT NULL
assertion passed regardless of whether the migration enforced it.
Fix: add explicit ``batch_op.alter_column(fk, nullable=False)`` for
both FK columns inside the per-table batch_alter_table block. On
PostgreSQL and MySQL this is a no-op (PK already implies NOT NULL);
on SQLite it adds the missing NOT NULL declaration so a fresh
install matches the data-model.md "After" contract.
Verified end-to-end:
- Postgres + MySQL: column shape unchanged (still NOT NULL)
- SQLite fresh install + round-trip: all 8 tables now have NOT NULL
on FK columns, ``INSERT (NULL, 5)`` correctly rejected with
IntegrityError on dashboard_slices, dashboard_user, sqlatable_user
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two MySQL-only failures in the downgrade path, found by running the
full migration history against a fresh MySQL 8 container:
1. ``MySQLdb.OperationalError: (1553, "Cannot drop index 'PRIMARY':
needed in a foreign key constraint")``. InnoDB uses the composite
PK index to back the FK on the leftmost column. The downgrade
tried to drop the composite PK before dropping the FKs, orphaning
the FK's backing index. PostgreSQL and SQLite create separate
indexes for FK columns and don't trip on this.
2. ``Field 'id' doesn't have a default value`` on subsequent INSERT.
``sa.Identity(always=False)`` only emits ``AUTO_INCREMENT`` on
MySQL when the column is created with ``primary_key=True`` — our
portable path adds the column first then creates the PK separately,
so MySQL leaves the column without auto-generation. Existing rows
would all collide on id=0; future inserts fail because no default.
Postgres' ``GENERATED BY DEFAULT AS IDENTITY`` and SQLite's
``INTEGER PRIMARY KEY`` rowid alias don't have this gap.
Fix: extract ``_downgrade_mysql_table()`` that emits the canonical
MySQL idiom — drop FKs, then a single ALTER combining
``DROP PRIMARY KEY, ADD COLUMN id INT NOT NULL AUTO_INCREMENT,
ADD PRIMARY KEY (id)`` (which backfills existing rows with sequential
ids and preserves AUTO_INCREMENT), restore the redundant UNIQUE on
the 2 tables that originally had it, and re-add the FKs with their
original names. Postgres and SQLite keep the existing portable
``batch_alter_table`` path.
Raw SQL is unavoidable for the combined-ALTER form; per the
constitution it's allowed for dialect-specific DDL with no SQLA
equivalent, with triple-quoted strings for legibility.
Verified end-to-end: upgrade → downgrade → upgrade against a fresh
MySQL 8 container with INSERT-without-id sanity check showing the
restored ``id`` column auto-increments correctly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI test-mysql failed with:
MySQLdb.OperationalError: (1826, "Duplicate foreign key constraint
name 'fk_dashboard_slices_slice_id_slices'")
Root cause: MySQL scopes foreign-key constraint names per-database,
not per-table (PostgreSQL and SQLite scope per-table). The
``batch_alter_table(... recreate="always", copy_from=...)`` path
used for ``dashboard_slices`` and ``report_schedule_user`` builds
``_alembic_tmp_<table>`` carrying the original FK names from
``copy_from`` while the original table still holds those names — MySQL
rejects the temp-table creation with ERROR 1826.
Fix: on MySQL only, drop the original FK constraints by name before
the ``batch_alter_table`` runs. The ``copy_from`` re-creates them on
the rebuilt table with their original names, so the post-migration
shape is unchanged. On PostgreSQL and SQLite the original code path
still runs unchanged.
Local SQLite tests (44 passed, 1 skipped) still pass; CI will validate
on MySQL.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address Beto's review comments on apache/superset#39859: replace
``sa.text(f"...")`` SQL construction in the three pre-flight helpers
(``_delete_null_fk_rows``, ``_dedupe_by_min_id``, ``_assert_no_duplicates``)
with SQLAlchemy core constructs (``sa.delete``, ``sa.select``,
``sa.func``, ``.subquery()``, ``.notin_()``).
A small ``_table_clause()`` helper builds a lightweight ``TableClause``
exposing the columns the queries reference; the three helpers consume
it. Removes all ``# noqa: S608`` comments — they are no longer needed
because there is no string-interpolated SQL.
Verified the compiled SQL is identical on Postgres, MySQL, and SQLite,
including the MySQL ERROR 1093 workaround (the inner aggregation is
wrapped in a derived table via ``.subquery()``, producing
``... NOT IN (SELECT keep_id FROM (SELECT min(id) ...) AS keep_min)``).
Also drops the redundant ``f`` prefix on the two non-interpolating
lines of the ``_check_no_external_fks_to_id`` error message.
44 migration tests still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four operator-experience improvements from the second review pass:
1. ``TABLES_WITH_NULLABLE_FKS`` is now explicitly documented as an
informational set that is not consulted at runtime; the comment
explains the previous ``dashboard_roles`` omission was the bug
that motivated the always-run cleanup.
2. ``_delete_null_fk_rows`` docstring updated to match the
"always run" semantics (was still claiming "called only on tables
in TABLES_WITH_NULLABLE_FKS").
3. ``_check_no_external_fks_to_id`` now documents its scope
limitation: ``Inspector.get_table_names()`` returns the default
schema only, so cross-schema FKs in non-standard multi-schema
PostgreSQL deployments would not be caught. The single-schema
case (Superset's documented deployment) is fully covered.
4. ``_dedupe_by_min_id`` now logs a sample of up to 10 discarded
``(fk1, fk2, id)`` tuples at WARN before deletion, so operators
can audit which rows the ``MIN(id)`` policy drops. The keep-
original policy is correct in practice but discards later
re-grants on ownership tables; the sample makes that visible.
5. ``UPDATING.md`` documents the upgrade/downgrade primary-key
name divergence (``pk_<table>`` vs ``<table>_pkey``) so
operators using schema-comparison tools don't mistake it for
migration drift.
No schema or runtime-behaviour changes. All 44 migration tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two cleanups from PR review:
1. ``dashboard_roles.dashboard_id`` was created nullable in revision
e11ccdd12658 but was missing from ``TABLES_WITH_NULLABLE_FKS``. A
production database with a stray NULL ``dashboard_id`` row would have
failed the PK-add with a cryptic constraint violation. Fix by running
the NULL-FK cleanup on every affected table — it is a no-op DELETE on
tables whose FK columns are already NOT NULL, and it eliminates the
risk of further drift in the hardcoded set. ``dashboard_roles`` is
added to the documentation set; the runtime now does not consult it.
2. The unit-test parent-table name for ``rls_filter_roles`` and
``rls_filter_tables`` was ``rls_filter`` (does not exist) instead of
the real parent ``row_level_security_filters``. Test passes either
way (the in-memory FK is self-consistent), but the parameter is now
accurate.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace synthetic id INTEGER PRIMARY KEY with composite PRIMARY KEY (fk1, fk2)
on the eight pure-junction tables: dashboard_roles, dashboard_slices,
dashboard_user, report_schedule_user, rls_filter_roles, rls_filter_tables,
slice_user, sqlatable_user. The redundant UNIQUE(fk1, fk2) on dashboard_slices
and report_schedule_user is dropped (subsumed by the new PK).
Migration handles dialect quirks: copy_from for tables with pre-existing
UNIQUE (so SQLite's anonymous-constraint reflection doesn't matter), wrapped-
subquery dedupe for MySQL (ERROR 1093), sa.Identity(always=False) on downgrade
to backfill the restored id column without NOT NULL violations, and distinct
PK names per direction (pk_<table> on upgrade, <table>_pkey on downgrade) to
avoid round-trip index-name collisions on Postgres.
ORM Table() definitions updated to match. UPDATING.md entry added with
operator runbook (BI-tool impact, pre-flight inventory queries, dedupe-row-
loss notice, pg_dump workaround, FK-NOT-NULL downgrade asymmetry note).
Tests: 8 schema-shape assertions (post-upgrade), 8 duplicate-rejection unit
tests, 8 distinct-pair sanity tests, 1 round-trip + idempotency test
(in-memory SQLite via Alembic MigrationContext).
Continuum-restore verification against the new shape is out of scope for this
PR; it is the responsibility of the versioning epic (sc-103156).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Joe Li <joe@preset.io>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Joe Li <joe@preset.io>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Joe Li <joe@preset.io>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Joe Li <joe@preset.io>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: codeant-ai-for-open-source[bot] <244253245+codeant-ai-for-open-source[bot]@users.noreply.github.com>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Joe Li <joe@preset.io>
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Joe Li <joe@preset.io>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: codeant-ai-for-open-source[bot] <244253245+codeant-ai-for-open-source[bot]@users.noreply.github.com>
Co-authored-by: Diego Pucci <diegopucci.me@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: codeant-ai-for-open-source[bot] <244253245+codeant-ai-for-open-source[bot]@users.noreply.github.com>
We kindly ask you to include the following information in your report:
- Apache Superset version that you are using
- A sanitized copy of your `superset_config.py` file or any config overrides
-Detailed steps to reproduce the vulnerability
**Submission Standards & AI Policy**
To ensure engineering focus remains on verified risks and to manage high reporting volumes, all reports must meet the following criteria:
-Plain Text Format: In accordance with Apache guidelines, please provide all details in plain text within the email body. Avoid sending PDFs, Word documents, or password-protected archives.
- Mandatory AI Disclosure: If you utilized Large Language Models (LLMs) or AI tools to identify a flaw or assist in writing a report, you must disclose this in your submission so our triage team can contextualize the findings.
- Human-Verified PoC: All submissions must include a manual, step-by-step Proof of Concept (PoC) performed on a supported release. Raw AI outputs, hypothetical chat transcripts, or unverified scanner logs will be closed as Invalid.
We kindly ask you to include the following information in your report to assist our developers in triaging and remediating issues efficiently:
- Version/Commit: The specific version of Apache Superset or the Git commit hash you are using.
- Configuration: A sanitized copy of your `superset_config.py` file or any config overrides.
- Environment: Your deployment method (e.g., Docker Compose, Helm, or source) and relevant OS/Browser details.
- Impacted Component: Identification of the affected area (e.g., Python backend, React frontend, or a specific database connector).
- Expected vs. Actual Behavior: A clear description of the intended system behavior versus the observed vulnerability.
- Detailed Reproduction Steps: Clear, manual steps to reproduce the vulnerability.
**Vulnerability Definition**
Apache Superset considers a security vulnerability to be a demonstrable issue that has meaningful impact on confidentiality, integrity, or availability beyond the intended security model. Low-impact boundary variations or technical edge cases in existing access controls may be classified as hardening improvements rather than vulnerabilities, even if exploitable.
**Out of Scope Vulnerabilities**
To prioritize engineering efforts on genuine architectural risks, the following scenarios are explicitly out of scope and will not be issued a CVE:
- **Attacks requiring Admin privileges**: (e.g., CSS injection, template manipulation, dashboard ownership overrides, or modifying global system settings). Per the CVE vulnerability definition in CNA Operational Rules 4.1, a qualifying vulnerability must allow violation of a security policy. The Admin role is a fully trusted operational boundary defined by Apache Superset's security policy; actions within this boundary do not violate that policy and are therefore considered intended capabilities 'by design,' not vulnerabilities.
- **Brute Force and Rate Limiting**: Reports targeting a lack of resource exhaustion protections, generic rate-limiting, or volumetric Denial of Service (DoS) attempts.
- **Theoretical attack vectors**: Issues without a demonstrable, reproducible exploit path.
- **Non-Exploitable Findings**: Missing security headers, generic banner disclosures, or descriptive error messages that do not lead to a direct, documented exploit.
- **User enumeration**: API responses, timing differences, or error messages that reveal whether user accounts, IDs, dashboards, or datasets exist.
- **Information disclosure (low impact)**: Software version disclosure, generic error messages, stack traces without sensitive data exposure, or system configuration details that don't enable further exploitation.
- **Resource exhaustion requiring authentication**: Denial of Service attacks that require valid user credentials and don't bypass rate limiting or resource controls.
- **Missing security headers**: Without demonstration of a concrete exploit scenario that leverages the missing header.
**Outcome of Reports**
Reports that are deemed out-of-scope for a CVE but represent valid security best practices or hardening opportunities may be converted into public GitHub issues. This allows the community to contribute to the general hardening of the platform even when a specific vulnerability threshold is not met.
Note that Apache Superset is not responsible for any third-party dependencies that may
have security issues. Any vulnerabilities found in third-party dependencies should be
@@ -29,6 +59,13 @@ reported to the maintainers of those projects. Results from security scans of Ap
Superset dependencies found on its official Docker image can be remediated at release time
by extending the image itself.
**Vulnerability Aggregation & CVE Attribution**
In accordance with MITRE CNA Operational Rules (4.1.10, 4.1.11, and 4.2.13), Apache Superset issues CVEs based on the underlying architectural root cause rather than the number of affected endpoints or exploit payloads.
- Aggregation: If multiple exploit vectors stem from the same programmatic failure or shared vulnerable code, they must be aggregated into a single, comprehensive report.
- Independent Fixes: Separate CVEs will only be assigned if the vulnerabilities reside in decoupled architectural modules and can be fixed independently of one another.
Reports that fail to aggregate related findings will be merged during triage to ensure an accurate and defensible CVE record.
**Your responsible disclosure and collaboration are invaluable.**
Looks like your PR contains new `.js` or `.jsx` files:
```
${{steps.check.outputs.js_files_added}}
```
As decided in [SIP-36](https://github.com/apache/superset/issues/9101), all new frontend code should be written in TypeScript. Please convert above files to TypeScript then re-request review.
A modern, enterprise-ready business intelligence web application.
### Documentation
- **[User Guide](https://superset.apache.org/user-docs/)** — For analysts and business users. Explore data, build charts, create dashboards, and connect databases.
- **[Administrator Guide](https://superset.apache.org/admin-docs/)** — Install, configure, and operate Superset. Covers security, scaling, and database drivers.
- **[Developer Guide](https://superset.apache.org/developer-docs/)** — Contribute to Superset or build on its REST API and extension framework.
[**Why Superset?**](#why-superset) |
[**Supported Databases**](#supported-databases) |
[**Installation and Configuration**](#installation-and-configuration) |
# Part 2: Verify RSA key - this is the same as running `gpg --verify {release}.asc {release}` and comparing the RSA key and email address against the KEYS file # noqa: E501
@@ -24,6 +24,211 @@ assists people when migrating to a new version.
## Next
### Entity version history for charts, dashboards, and datasets
Saves of charts, dashboards, and datasets now automatically produce a version history — browsable and restorable via new API endpoints. No frontend UI in this release; the backend plumbing is the deliverable.
**New endpoints** (per entity type — same pattern for `chart`, `dashboard`, and `dataset`):
| Method | Path | Purpose |
|---|---|---|
| `GET` | `/api/v1/{resource}/<uuid>/versions/` | List the entity's version history (0-based `version_number`, `version_uuid`, `issued_at`, `changed_by`) |
| `GET` | `/api/v1/{resource}/<uuid>/versions/<version_uuid>/` | Get a single version snapshot (scalar fields at that version; plus `columns` / `metrics` for datasets) |
| `POST` | `/api/v1/{resource}/<uuid>/versions/<version_uuid>/restore` | Restore the entity to the state captured by that version |
`<version_uuid>` is a deterministic `UUIDv5` derived from the entity's UUID and the Continuum transaction id — stable across replicas and retention pruning. Authorisation reuses the resource's existing `can_write` permission; workspace admins can list/restore any entity.
**Version response shape — `changes` array:**
Each entry returned by `GET /api/v1/{resource}/<uuid>/versions/` and `GET .../versions/<version_uuid>/` includes a `changes` array describing what changed relative to the previous version:
The array is empty for baseline (`operation_type=0`) transactions. `kind` enumerates structured record types (`field`, layout-walker records for dashboards, dataset child diffs for `TableColumn` / `SqlMetric`); `path` is a dotted JSON-pointer-style locator; `from_value` / `to_value` are JSON-safe scalars or compact records.
**Save-response and ETag headers:**
- Save responses (`PUT /api/v1/{resource}/<pk>`) include `old_version_uuid` and `new_version_uuid` body fields so the client can correlate a save with its resulting version row.
- All entity GETs (`GET /api/v1/{chart,dashboard,dataset}/<pk>`), version-list GETs, single-version GETs, and save responses emit an `ETag: "<version_uuid>"` header reflecting the entity's current live version. The default `CORS_OPTIONS` now sets `expose_headers: ["ETag"]` so cross-origin browser clients can read the header. **No `If-Match` enforcement in v1** — `ETag` is informational; concurrent-edit detection is deferred to a follow-up SIP.
- **Operators overriding `CORS_OPTIONS` in `superset_config.py` MUST include `"expose_headers": ["ETag"]`** (or merge with the default) for cross-origin clients to read the ETag. A bare `CORS_OPTIONS = {"origins": [...]}` will silently drop the expose-headers default.
**Behaviour changes on save:**
- Every save of a chart, dashboard, or dataset produces one new version row. Rows preserve the full post-save state (scalar fields for all three entity types; `TableColumn` / `SqlMetric` children for datasets; `dashboard_slices` chart membership for dashboards — children versioned via SQLAlchemy-Continuum shadow tables `table_columns_version`, `sql_metrics_version`, and `dashboard_slices_version`).
- First save after an entity already exists in the DB creates a retroactive baseline version so the UI can show "what this looked like before I edited it."
- Tags, owners, and roles are **not** versioned in v1 (ADR-005). A restore leaves those at their live values.
**New config key:**
| Key | Default | Purpose |
|---|---|---|
| `SUPERSET_VERSION_HISTORY_RETENTION_DAYS` | `30` | Versions older than this many days are pruned by a nightly Celery beat task (`superset.tasks.version_history_retention.prune_old_versions`). Each entity's live row (`end_transaction_id IS NULL`) is always preserved; closed historical rows including the baseline age out with the rest. Set to `0` to disable retention entirely. |
**Impact on external integrations:**
- New tables populated on every save — `dashboards_version`, `slices_version`, `tables_version` (parent shadow tables for the three entity types), `table_columns_version`, `sql_metrics_version`, `dashboard_slices_version` (child shadow tables), plus the shared `version_transaction` and `version_changes` tables. External tooling that queries Superset's DB directly will see writes to these tables proportional to save traffic.
- Existing entity endpoints (`GET`/`PUT /api/v1/{chart,dashboard,dataset}/<pk>`) gain an `ETag` response header and the save response gains `old_version_uuid` / `new_version_uuid` body fields. No existing fields are removed or repurposed.
- Version capture is always active — no feature flag.
### Cross-entity activity stream for charts, dashboards, and datasets
A read-only companion to the version-history endpoints (above). Each entity type gains an `/activity/` endpoint that returns a chronological stream of edits — the entity's own edits plus, for dashboards and charts, transitive edits to related entities during their association windows.
**New endpoints** (per entity type):
| Method | Path | Purpose |
|---|---|---|
| `GET` | `/api/v1/dashboard/<uuid>/activity/` | Dashboard own edits + edits to charts attached during their dashboard window + edits to datasets those charts pointed at during their chart window |
| `GET` | `/api/v1/chart/<uuid>/activity/` | Chart own edits + edits to datasets the chart pointed at during association |
| `GET` | `/api/v1/dataset/<uuid>/activity/` | Dataset own edits only (no transitive layer in V2) |
**Query parameters** (all optional):
| Param | Type | Default | Purpose |
|---|---|---|---|
| `since` | ISO 8601 datetime | — | Lower bound on `issued_at` |
| `until` | ISO 8601 datetime | — | Upper bound on `issued_at` |
| `include` | `self` \| `related` \| `all` | `all` | Filter to only the entity's own edits, only related edits, or both |
`count` is the total record count *after* the silent permission filter (see below), not the raw query size.
**Authorisation:** reuses the resource's existing `can_write` permission. Workspace admins can read any entity's activity stream. The endpoint runs `raise_for_ownership` on the path entity — non-owners get `403`.
**Silent permission filter (AV-008):** records whose source entity the requesting user can't read are silently dropped — no placeholder, no count contribution. The frontend cannot distinguish "no activity" from "you can't see this activity."
**Tombstones (AV-009 / D-15):** when an activity record references a hard-deleted source entity, the record still appears with `entity_deleted: true`, `entity_uuid: null`, and `entity_name` recovered from the last shadow row.
**Impact on external integrations:**
- Pure read-only. No new tables, no new columns, no migrations. Reads sc-103156's shadow tables and the `version_changes` table.
- No new save-path code paths — perf-validation gate confirms the activity-view branch does not regress sc-103156's SC-004 50ms-overhead budget.
- No feature flag; the endpoints are always available once sc-103156's version-history feature is enabled.
### Granular Export Controls
A new feature flag `GRANULAR_EXPORT_CONTROLS` introduces three fine-grained permissions that replace the legacy `can_csv` permission:
When the feature flag is enabled, these permissions are enforced on both the frontend (disabled buttons with tooltips) and backend (403 responses from API endpoints). When disabled, legacy `can_csv` behavior is preserved.
**Migration behavior:** All three new permissions are granted to every role that currently has `can_csv`, preserving existing access. Admins can then selectively revoke individual export permissions from specific roles as needed.
### Deck.gl MapBox viewport and opacity controls are functional
The Deck.gl MapBox chart's **Opacity**, **Default longitude**, **Default latitude**, and **Zoom** controls were previously non-functional — changing them had no effect on the rendered map. These controls are now wired up correctly.
**Behavior change for existing charts:** Previously, the viewport controls had hard-coded default values (`-122.405293`, `37.772123`, zoom `11` — San Francisco) that were stored in each chart's `form_data` but never applied. The map always used `fitBounds` to center on the data. With this fix, those stored values are now respected, which means existing MapBox charts may open centered on the old default coordinates instead of fitting to data bounds.
**To restore fit-to-data behavior:** Open the chart in Explore, clear the **Default longitude**, **Default latitude**, and **Zoom** fields in the Viewport section, and re-save the chart.
### Combined datasource list endpoint
Added a new combined datasource list endpoint at `GET /api/v1/datasource/` to serve datasets and semantic views in one response.
- The endpoint is available to users with at least one of `can_read` on `Dataset` or `SemanticView`.
- Semantic views are included only when the `SEMANTIC_LAYERS` feature flag is enabled.
- The endpoint enforces strict `order_column` validation and returns `400` for invalid sort columns.
### ClickHouse minimum driver version bump
The minimum required version of `clickhouse-connect` has been raised to `>=0.13.0`. If you are using the ClickHouse connector, please upgrade your `clickhouse-connect` package. The `_mutate_label` workaround that appended hash suffixes to column aliases has also been removed, as it is no longer needed with modern versions of the driver.
### MCP Tool Observability
MCP (Model Context Protocol) tools now include enhanced observability instrumentation for monitoring and debugging:
**Two-layer instrumentation:**
1.**Middleware layer** (`LoggingMiddleware`): Automatically logs all MCP tool calls with `duration_ms` and `success` status in the audit log (Action Log UI, logs table)
2.**Sub-operation tracking**: All 19 MCP tools include granular `event_logger.log_context()` blocks for tracking individual operations like validation, database writes, and query execution
**Security note:** Sensitive parameters (passwords, API keys, tokens) are automatically redacted in logs as `[REDACTED]`.
### Distributed Coordination Backend
A new `DISTRIBUTED_COORDINATION_CONFIG` configuration provides a unified Redis-based backend for real-time coordination features in Superset. This backend enables:
- **Pub/sub messaging** for real-time event notifications between workers
- **Atomic distributed locking** using Redis SET NX EX (more performant than database-backed locks)
- **Event-based coordination** for background task management
The distributed coordination is used by the Global Task Framework (GTF) for abort notifications and task completion signaling, and will eventually replace `GLOBAL_ASYNC_QUERIES_CACHE_BACKEND` as the standard signaling backend. Configuring this is recommended for Redis enabled production deployments.
Example configuration in `superset_config.py`:
```python
DISTRIBUTED_COORDINATION_CONFIG={
"CACHE_TYPE":"RedisCache",
"CACHE_KEY_PREFIX":"signal_",
"CACHE_REDIS_URL":"redis://localhost:6379/1",
"CACHE_DEFAULT_TIMEOUT":300,
}
```
See `superset/config.py` for complete configuration options.
### WebSocket config for GAQ with Docker
[35896](https://github.com/apache/superset/pull/35896) and [37624](https://github.com/apache/superset/pull/37624) updated documentation on how to run and configure Superset with Docker. Specifically for the WebSocket configuration, a new `docker/superset-websocket/config.example.json` was added to the repo, so that users could copy it to create a `docker/superset-websocket/config.json` file. The existing `docker/superset-websocket/config.json` was removed and git-ignored, so if you're using GAQ / WebSocket make sure to:
@@ -219,6 +424,246 @@ See `superset/mcp_service/PRODUCTION.md` for deployment guides.
}
```
### Composite primary keys on many-to-many association tables
The eight M:N association tables listed below have been changed from a synthetic surrogate `id INTEGER PRIMARY KEY` to a composite `PRIMARY KEY (fk1, fk2)` on the two foreign-key columns. The `id` column is dropped, and the two tables that previously carried a redundant `UNIQUE (fk1, fk2)` constraint have that constraint removed (it is now subsumed by the composite primary key).
**Affected tables and their composite-PK column pairs:**
**Impact on external readers:** Any BI tool, custom report, backup script, or external integration that references these tables by their old surrogate `id` column (e.g., `SELECT id FROM dashboard_slices WHERE …`, `WHERE dashboard_slices.id IN (…)`) will break. Update such queries to project or filter on the FK pair (`dashboard_id, slice_id`) instead. The FK columns themselves are unchanged.
**Pre-flight inventory queries.** Before applying the upgrade, operators are encouraged to run the queries below against their database to assess what the migration will change. Two classes of pre-existing data are not preserved by the migration: duplicate `(fk1, fk2)` rows (the migration keeps `MIN(id)` and deletes the rest) and rows with `NULL` in either FK column (the migration deletes them, since FK columns are promoted to `NOT NULL` for the composite PK). Compliance- or audit-sensitive operators should also `\copy` (Postgres) or `SELECT … INTO OUTFILE` (MySQL) the affected rows for their own records before upgrading.
```sql
-- Duplicate (fk1, fk2) pairs (the migration will keep MIN(id) per group, delete the rest)
SELECT dashboard_id, role_id, COUNT(*) FROM dashboard_roles GROUP BY dashboard_id, role_id HAVING COUNT(*) > 1;
SELECT dashboard_id, slice_id, COUNT(*) FROM dashboard_slices GROUP BY dashboard_id, slice_id HAVING COUNT(*) > 1;
SELECT user_id, dashboard_id, COUNT(*) FROM dashboard_user GROUP BY user_id, dashboard_id HAVING COUNT(*) > 1;
SELECT user_id, report_schedule_id, COUNT(*) FROM report_schedule_user GROUP BY user_id, report_schedule_id HAVING COUNT(*) > 1;
SELECT role_id, rls_filter_id, COUNT(*) FROM rls_filter_roles GROUP BY role_id, rls_filter_id HAVING COUNT(*) > 1;
SELECT table_id, rls_filter_id, COUNT(*) FROM rls_filter_tables GROUP BY table_id, rls_filter_id HAVING COUNT(*) > 1;
SELECT user_id, slice_id, COUNT(*) FROM slice_user GROUP BY user_id, slice_id HAVING COUNT(*) > 1;
SELECT user_id, table_id, COUNT(*) FROM sqlatable_user GROUP BY user_id, table_id HAVING COUNT(*) > 1;
-- Rows with a NULL in either FK (the migration will delete these)
SELECT COUNT(*) FROM dashboard_roles WHERE dashboard_id IS NULL OR role_id IS NULL;
SELECT COUNT(*) FROM dashboard_slices WHERE dashboard_id IS NULL OR slice_id IS NULL;
SELECT COUNT(*) FROM dashboard_user WHERE user_id IS NULL OR dashboard_id IS NULL;
SELECT COUNT(*) FROM report_schedule_user WHERE user_id IS NULL OR report_schedule_id IS NULL;
SELECT COUNT(*) FROM rls_filter_roles WHERE role_id IS NULL OR rls_filter_id IS NULL;
SELECT COUNT(*) FROM rls_filter_tables WHERE table_id IS NULL OR rls_filter_id IS NULL;
SELECT COUNT(*) FROM slice_user WHERE user_id IS NULL OR slice_id IS NULL;
SELECT COUNT(*) FROM sqlatable_user WHERE user_id IS NULL OR table_id IS NULL;
```
**Sizing the maintenance window on PostgreSQL.** The queries above are dialect-portable but only count rows. Operators on PostgreSQL can run the diagnostic queries below to characterize the migration's runtime cost ahead of time: per-table row count and on-disk size, an aggregated duplicate roll-up, the external-FK pre-flight check (the migration runs the same check and aborts if it returns rows), and a rewrite-time estimate for the two tables that go through the slower full-table-rebuild path.
```sql
-- Per-table size, row count, and which migration path each will take.
-- Two tables ("dashboard_slices", "report_schedule_user") have a
-- redundant UNIQUE constraint that the migration drops via a full
-- table rewrite (op.batch_alter_table(recreate="always")). The other
-- six use direct ALTER TABLE, which is much cheaper.
WITH affected(name, has_unique) AS (
VALUES
('dashboard_roles', false),
('dashboard_slices', true),
('dashboard_user', false),
('report_schedule_user', true),
('rls_filter_roles', false),
('rls_filter_tables', false),
('slice_user', false),
('sqlatable_user', false)
)
SELECT
a.name AS table_name,
CASE WHEN a.has_unique THEN 'recreate (full rewrite)'
ELSE 'direct ALTER' END AS migration_path,
c.reltuples::bigint AS estimated_rows,
pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size,
pg_size_pretty(pg_relation_size(c.oid)) AS heap_size,
pg_size_pretty(pg_indexes_size(c.oid)) AS index_size
FROM affected a
JOIN pg_class c ON c.relname = a.name AND c.relkind = 'r'
ORDER BY pg_total_relation_size(c.oid) DESC;
```
```sql
-- Aggregated duplicate-row roll-up.
-- "dup_groups" is the number of (fk1, fk2) pairs that appear more
-- than once; "rows_dropped" is the total number of rows the
-- migration will delete during the dedupe pass (it keeps MIN(id) per
-- group and discards the rest).
SELECT 'dashboard_roles' AS t, COUNT(*) AS dup_groups, SUM(c) - COUNT(*) AS rows_dropped
FROM (SELECT COUNT(*) c FROM dashboard_roles GROUP BY dashboard_id, role_id HAVING COUNT(*) > 1) g
UNION ALL SELECT 'dashboard_slices', COUNT(*), SUM(c) - COUNT(*)
FROM (SELECT COUNT(*) c FROM dashboard_slices GROUP BY dashboard_id, slice_id HAVING COUNT(*) > 1) g
UNION ALL SELECT 'dashboard_user', COUNT(*), SUM(c) - COUNT(*)
FROM (SELECT COUNT(*) c FROM dashboard_user GROUP BY user_id, dashboard_id HAVING COUNT(*) > 1) g
UNION ALL SELECT 'report_schedule_user',COUNT(*), SUM(c) - COUNT(*)
FROM (SELECT COUNT(*) c FROM report_schedule_user GROUP BY user_id, report_schedule_id HAVING COUNT(*) > 1) g
UNION ALL SELECT 'rls_filter_roles', COUNT(*), SUM(c) - COUNT(*)
FROM (SELECT COUNT(*) c FROM rls_filter_roles GROUP BY role_id, rls_filter_id HAVING COUNT(*) > 1) g
UNION ALL SELECT 'rls_filter_tables', COUNT(*), SUM(c) - COUNT(*)
FROM (SELECT COUNT(*) c FROM rls_filter_tables GROUP BY table_id, rls_filter_id HAVING COUNT(*) > 1) g
UNION ALL SELECT 'slice_user', COUNT(*), SUM(c) - COUNT(*)
FROM (SELECT COUNT(*) c FROM slice_user GROUP BY user_id, slice_id HAVING COUNT(*) > 1) g
UNION ALL SELECT 'sqlatable_user', COUNT(*), SUM(c) - COUNT(*)
FROM (SELECT COUNT(*) c FROM sqlatable_user GROUP BY user_id, table_id HAVING COUNT(*) > 1) g
ORDER BY rows_dropped DESC NULLS LAST;
```
```sql
-- External-FK pre-flight check.
-- The migration runs the equivalent check at upgrade time and aborts
-- if any external FK references one of the soon-to-be-removed `id`
-- columns. Running it ahead of time lets you discover (and migrate)
-- any such reference before the maintenance window. On a stock
-- Superset install this should return zero rows. (Default schema
-- only; multi-schema deployments need to broaden the lookup.)
SELECT
rc.constraint_name,
kcu.table_schema || '.' || kcu.table_name AS referencing_table,
kcu.column_name AS referencing_column,
ccu.table_name AS referenced_table,
ccu.column_name AS referenced_column
FROM information_schema.referential_constraints rc
JOIN information_schema.key_column_usage kcu
ON kcu.constraint_name = rc.constraint_name
AND kcu.constraint_schema = rc.constraint_schema
JOIN information_schema.constraint_column_usage ccu
-- Lock-window estimate for the two full-rewrite tables.
-- recreate="always" takes ACCESS EXCLUSIVE on the table for the full
-- rewrite. Use heap size combined with your hardware's effective
-- write throughput (~100-200 MB/s on commodity SSD; faster on NVMe)
-- to size the maintenance window. The other six tables use direct
-- ALTER and are dominated by composite-index build time, typically
-- seconds for tables in the low millions of rows.
SELECT
c.relname AS table_name,
pg_size_pretty(pg_relation_size(c.oid)) AS heap_size,
pg_relation_size(c.oid) / 1024 / 1024 AS heap_size_mb,
ROUND(pg_relation_size(c.oid) / 1024 / 1024 / 100.0, 1) AS est_rewrite_seconds_at_100mbs
FROM pg_class c
WHERE c.relname IN ('dashboard_slices', 'report_schedule_user');
```
**Sizing the maintenance window on MySQL.** Equivalent diagnostic queries for MySQL/InnoDB. One important difference from PostgreSQL: InnoDB rebuilds the clustered index on every PK change, so *all eight* tables undergo a full table rebuild on MySQL — not just the two that go through the explicit `recreate="always"` path. The lock-window estimate query below therefore covers all eight tables.
```sql
-- Per-table size, row count, and which migration path each will take.
-- TABLE_ROWS is an InnoDB estimate (analogous to PostgreSQL's reltuples);
-- run SELECT COUNT(*) per table for an exact count if needed.
SELECT
TABLE_NAME AS table_name,
CASE WHEN TABLE_NAME IN ('dashboard_slices', 'report_schedule_user')
THEN 'recreate (explicit, drops UNIQUE)'
ELSE 'direct ALTER (still rebuilds InnoDB clustered index)'
**Restoring an old `pg_dump` (or equivalent) against the new schema.** A dump taken before the migration includes `INSERT` statements that populate the now-removed `id` column. Restoring such a dump against the post-migration schema will fail. The supported workaround is to dump only the schema and reference data, then re-create the M:N associations from application data after restore — for example with `pg_dump --exclude-table-data` (or per-table `--exclude-table-data=dashboard_slices` etc.) for the eight junction tables, restore the rest, then run a one-shot script that re-INSERTs `(fk1, fk2)` pairs derived from your application export. Operators who need to restore an old dump verbatim should restore against a pre-migration Superset and then re-run the upgrade.
**Intentional downgrade asymmetry.** The migration's `downgrade()` restores the surrogate `id` column and (for `dashboard_slices` and `report_schedule_user`) the original `UNIQUE (fk1, fk2)` constraint, but it does **not** restore the original `NULL`-allowed state on the FK columns — they remain `NOT NULL`. This is intentional: under SQLAlchemy's `secondary=` semantics, a `NULL` in either FK column of a junction table is meaningless (it cannot participate in either side of the relationship). Operators downgrading are not expected to need this restored. The asymmetry is documented for completeness so that round-trip schema diffs are not mistaken for migration bugs.
**Constraint-name divergence between upgrade and downgrade.** The composite primary key created on upgrade is named `pk_<table>` (Alembic's default for `op.create_primary_key("pk_<table>", ...)`), while the surrogate `id` primary key restored on downgrade is named `<table>_pkey` (PostgreSQL's default convention for `PrimaryKeyConstraint("id")`). The two names alternate so that a round-trip (upgrade → downgrade → upgrade) does not collide on a pre-existing constraint name. Operators using schema-comparison tools (e.g. `pg_diff`, `migra`) against a downgraded database may see this as drift versus a fresh-install schema. It is cosmetic — no application code references either constraint name.
## 6.0.0
- [33055](https://github.com/apache/superset/pull/33055): Upgrades Flask-AppBuilder to 5.0.0. The AUTH_OID authentication type has been deprecated and is no longer available as an option in Flask-AppBuilder. OpenID (OID) is considered a deprecated authentication protocol - if you are using AUTH_OID, you will need to migrate to an alternative authentication method such as OAuth, LDAP, or database authentication before upgrading.
- [34871](https://github.com/apache/superset/pull/34871): Fixed Jest test hanging issue from Ant Design v5 upgrade. MessageChannel is now mocked in test environment to prevent rc-overflow from causing Jest to hang. Test environment only - no production impact.
@@ -237,14 +682,14 @@ Note: Pillow is now a required dependency (previously optional) to support image
- [33116](https://github.com/apache/superset/pull/33116) In Echarts Series charts (e.g. Line, Area, Bar, etc.) charts, the `x_axis_sort_series` and `x_axis_sort_series_ascending` form data items have been renamed with `x_axis_sort` and `x_axis_sort_asc`.
There's a migration added that can potentially affect a significant number of existing charts.
- [32317](https://github.com/apache/superset/pull/32317) The horizontal filter bar feature is now out of testing/beta development and its feature flag `HORIZONTAL_FILTER_BAR` has been removed.
- [31590](https://github.com/apache/superset/pull/31590) Marks the begining of intricate work around supporting dynamic Theming, and breaks support for [THEME_OVERRIDES](https://github.com/apache/superset/blob/732de4ac7fae88e29b7f123b6cbb2d7cd411b0e4/superset/config.py#L671) in favor of a new theming system based on AntD V5. Likely this will be in disrepair until settling over the 5.x lifecycle.
- [32432](https://github.com/apache/superset/pull/31260) Moves the List Roles FAB view to the frontend and requires `FAB_ADD_SECURITY_API` to be enabled in the configuration and `superset init` to be executed.
- [31590](https://github.com/apache/superset/pull/31590) Marks the beginning of intricate work around supporting dynamic Theming, and breaks support for [THEME_OVERRIDES](https://github.com/apache/superset/blob/732de4ac7fae88e29b7f123b6cbb2d7cd411b0e4/superset/config.py#L671) in favor of a new theming system based on AntD V5. Likely this will be in disrepair until settling over the 5.x lifecycle.
- [32432](https://github.com/apache/superset/pull/32432) Moves the List Roles FAB view to the frontend and requires `FAB_ADD_SECURITY_API` to be enabled in the configuration and `superset init` to be executed.
- [34319](https://github.com/apache/superset/pull/34319) Drill to Detail and Drill By is now supported in Embedded mode, and also with the `DASHBOARD_RBAC` FF. If you don't want to expose these features in Embedded / `DASHBOARD_RBAC`, make sure the roles used for Embedded / `DASHBOARD_RBAC`don't have the required permissions to perform D2D actions.
## 5.0.0
- [31976](https://github.com/apache/superset/pull/31976) Removed the `DISABLE_LEGACY_DATASOURCE_EDITOR` feature flag. The previous value of the feature flag was `True` and now the feature is permanently removed.
- [31959](https://github.com/apache/superset/pull/32000) Removes CSV_UPLOAD_MAX_SIZE config, use your web server to control file upload size.
- [32000](https://github.com/apache/superset/pull/32000) Removes CSV_UPLOAD_MAX_SIZE config, use your web server to control file upload size.
- [31959](https://github.com/apache/superset/pull/31959) Removes the following endpoints from data uploads: `/api/v1/database/<id>/<file type>_upload` and `/api/v1/database/<file type>_metadata`, in favour of new one (Details on the PR). And simplifies permissions.
- [31844](https://github.com/apache/superset/pull/31844) The `ALERT_REPORTS_EXECUTE_AS` and `THUMBNAILS_EXECUTE_AS` config parameters have been renamed to `ALERT_REPORTS_EXECUTORS` and `THUMBNAILS_EXECUTORS` respectively. A new config flag `CACHE_WARMUP_EXECUTORS` has also been introduced to be able to control which user is used to execute cache warmup tasks. Finally, the config flag `THUMBNAILS_SELENIUM_USER` has been removed. To use a fixed executor for async tasks, use the new `FixedExecutor` class. See the config and docs for more info on setting up different executor profiles.
- [31894](https://github.com/apache/superset/pull/31894) Domain sharding is deprecated in favor of HTTP2. The `SUPERSET_WEBSERVER_DOMAINS` configuration will be removed in the next major version (6.0)
@@ -252,7 +697,7 @@ Note: Pillow is now a required dependency (previously optional) to support image
- [31774](https://github.com/apache/superset/pull/31774): Fixes the spelling of the `USE-ANALAGOUS-COLORS` feature flag. Please update any scripts/configuration item to use the new/corrected `USE-ANALOGOUS-COLORS` flag spelling.
- [31582](https://github.com/apache/superset/pull/31582) Removed the legacy Area, Bar, Event Flow, Heatmap, Histogram, Line, Sankey, and Sankey Loop charts. They were all automatically migrated to their ECharts counterparts with the exception of the Event Flow and Sankey Loop charts which were removed as they were not actively maintained and not widely used. If you were using the Event Flow or Sankey Loop charts, you will need to find an alternative solution.
- [31198](https://github.com/apache/superset/pull/31198) Disallows by default the use of the following ClickHouse functions: "version", "currentDatabase", "hostName".
- [29798](https://github.com/apache/superset/pull/29798) Since 3.1.0, the intial schedule for an alert or report was mistakenly offset by the specified timezone's relation to UTC. The initial schedule should now begin at the correct time.
- [29798](https://github.com/apache/superset/pull/29798) Since 3.1.0, the initial schedule for an alert or report was mistakenly offset by the specified timezone's relation to UTC. The initial schedule should now begin at the correct time.
- [30021](https://github.com/apache/superset/pull/30021) The `dev` layer in our Dockerfile no long includes firefox binaries, only Chromium to reduce bloat/docker-build-time.
- [30099](https://github.com/apache/superset/pull/30099) Translations are no longer included in the default docker image builds. If your environment requires translations, you'll want to set the docker build arg `BUILD_TRANSLATIONS=true`.
- [31262](https://github.com/apache/superset/pull/31262) NOTE: deprecated `pylint` in favor of `ruff` as our only python linter. Only affect development workflows positively (not the release itself). It should cover most important rules, be much faster, but some things linting rules that were enforced before may not be enforce in the exact same way as before.
@@ -265,7 +710,7 @@ Note: Pillow is now a required dependency (previously optional) to support image
- [25166](https://github.com/apache/superset/pull/25166) Changed the default configuration of `UPLOAD_FOLDER` from `/app/static/uploads/` to `/static/uploads/`. It also removed the unused `IMG_UPLOAD_FOLDER` and `IMG_UPLOAD_URL` configuration options.
- [30284](https://github.com/apache/superset/pull/30284) Deprecated GLOBAL_ASYNC_QUERIES_REDIS_CONFIG in favor of the new GLOBAL_ASYNC_QUERIES_CACHE_BACKEND configuration. To leverage Redis Sentinel, set CACHE_TYPE to RedisSentinelCache, or use RedisCache for standalone Redis
- [31961](https://github.com/apache/superset/pull/31961) Upgraded React from version 16.13.1 to 17.0.2. If you are using custom frontend extensions or plugins, you may need to update them to be compatible with React 17.
- [31260](https://github.com/apache/superset/pull/31260) Docker images now use `uv pip install` instead of `pip install` to manage the python envrionment. Most docker-based deployments will be affected, whether you derive one of the published images, or have custom bootstrap script that install python libraries (drivers)
- [31260](https://github.com/apache/superset/pull/31260) Docker images now use `uv pip install` instead of `pip install` to manage the python environment. Most docker-based deployments will be affected, whether you derive one of the published images, or have custom bootstrap script that install python libraries (drivers)
### Potential Downtime
@@ -342,7 +787,7 @@ Note: Pillow is now a required dependency (previously optional) to support image
- [26462](https://github.com/apache/superset/issues/26462): Removes the Profile feature given that it's not actively maintained and not widely used.
- [26377](https://github.com/apache/superset/pull/26377): Removes the deprecated Redirect API that supported short URLs used before the permalink feature.
- [26329](https://github.com/apache/superset/issues/26329): Removes the deprecated `DASHBOARD_NATIVE_FILTERS` feature flag. The previous value of the feature flag was `True` and now the feature is permanently enabled.
- [25510](https://github.com/apache/superset/pull/25510): Reenforces that any newly defined Python data format (other than epoch) must adhere to the ISO 8601 standard (enforced by way of validation at the API and database level) after a previous relaxation to include slashes in addition to dashes. From now on when specifying new columns, dataset owners will need to use a SQL expression instead to convert their string columns of the form %Y/%m/%d etc. to a `DATE`, `DATETIME`, etc. type.
- [25510](https://github.com/apache/superset/pull/25510): Reinforces that any newly defined Python data format (other than epoch) must adhere to the ISO 8601 standard (enforced by way of validation at the API and database level) after a previous relaxation to include slashes in addition to dashes. From now on when specifying new columns, dataset owners will need to use a SQL expression instead to convert their string columns of the form %Y/%m/%d etc. to a `DATE`, `DATETIME`, etc. type.
- [26372](https://github.com/apache/superset/issues/26372): Removes the deprecated `GENERIC_CHART_AXES` feature flag. The previous value of the feature flag was `True` and now the feature is permanently enabled.
WEBDRIVER_BASEURL=f"http://superset_app{os.environ.get('SUPERSET_APP_ROOT','/')}/"# When using docker compose baseurl should be http://superset_nginx{ENV{BASEPATH}}/ # noqa: E501
Each section maintains its own version history and can be versioned independently.
@@ -36,23 +37,45 @@ Each section maintains its own version history and can be versioned independentl
To create a new version for any section, use the Docusaurus version command with the appropriate plugin ID or use our automated scripts:
#### Before You Cut
The cut snapshots whatever's on disk into a frozen historical version, including auto-generated content (database pages from `superset/db_engine_specs/`, API reference from `static/resources/openapi.json`, component pages from Storybook stories). The cut script refreshes these via `generate:smart` before snapshotting, but the **`databases.json` diagnostics file** needs special care to capture full detail:
1.**Canonical release cut**: download the `database-diagnostics` artifact from a green `Python-Integration` run on master, place it at `docs/src/data/databases.json`, then run the cut script with `--skip-generate` to preserve it. This is what the production deploy uses and includes full Flask-context diagnostics (driver versions, feature support matrix, etc.).
2.**Local dev cut**: just run the script normally. `generate:smart` will regenerate `databases.json` using your local Flask environment — accurate to whatever drivers/extras you have installed, but typically less complete than the CI artifact.
3.**No Flask available**: also fine — the database generator falls back to AST parsing of engine spec files. The MDX pages are still correct; only the diagnostics JSON is leaner.
Also: confirm `master` CI is green, and that your local checkout matches the SHA you intend to cut from.
#### Using Automated Scripts (Required)
**⚠️ Important:** Always use these custom commands instead of the native Docusaurus commands. These scripts ensure that both the Docusaurus versioning system AND the `versions-config.json` file are updated correctly.
**⚠️ Important:** Always use these custom commands instead of the native Docusaurus commands. These scripts ensure that both the Docusaurus versioning system AND the `versions-config.json` file are updated correctly, AND that auto-generated content is refreshed before snapshotting.
```bash
# Main Documentation
yarn version:add:docs 1.2.0
yarn version:add:user_docs 1.2.0
# Developer Portal
yarn version:add:developer_portal 1.2.0
# Admin Docs
yarn version:add:admin_docs 1.2.0
# Component Playground (when enabled)
# Developer Docs
yarn version:add:developer_docs 1.2.0
# Component Playground
yarn version:add:components 1.2.0
```
What the script does:
1. Refreshes auto-generated content via `generate:smart` (database pages, API reference, component pages).
2. Calls `yarn docusaurus docs:version` (or the per-section equivalent) to snapshot the section.
3. Freezes any data-file imports (`@site/static/*.json`, `../../data/*.json`) into a snapshot-local `_versioned_data/` dir so the historical version doesn't silently mutate when the source files change.
4. Adjusts relative import paths (`../../src/...` → `../../../src/...`) for files now one directory deeper.
5. Updates `versions-config.json` and `<section>_versions.json`.
**Do NOT use** the native Docusaurus commands directly (`yarn docusaurus docs:version`), as they will:
- ❌ Create version files but NOT update `versions-config.json`
- ❌ Skip auto-gen refresh, freezing whatever was on disk
- ❌ Skip data-import freezing, leaving the snapshot pointed at live data
- ❌ Cause versions to not appear in dropdown menus
- ❌ Require manual fixes to synchronize the configuration
@@ -75,7 +98,7 @@ If creating versions manually, you'll need to:
Users can configure automated alerts and reports to send dashboards or charts to an email recipient or Slack channel.
- *Alerts* are sent when a SQL condition is reached
- *Reports* are sent on a schedule
Alerts and reports are disabled by default. To turn them on, you'll need to change configuration settings and install a suitable headless browser in your environment.
## Requirements
### Commons
#### In your `superset_config.py` or `superset_config_docker.py`
- `"ALERT_REPORTS"` [feature flag](/admin-docs/configuration/configuring-superset#feature-flags) must be turned to True.
- `beat_schedule` in CeleryConfig must contain schedule for `reports.scheduler`.
- At least one of those must be configured, depending on what you want to use:
- emails: `SMTP_*` settings
- Slack messages: `SLACK_API_TOKEN`
- Users can customize the email subject by including date code placeholders, which will automatically be replaced with the corresponding UTC date when the email is sent. To enable this functionality, activate the `"DATE_FORMAT_IN_EMAIL_SUBJECT"` [feature flag](/admin-docs/configuration/configuring-superset#feature-flags). This enables date formatting in email subjects, preventing all reporting emails from being grouped into the same thread (optional for the reporting feature).
- Use date codes from [strftime.org](https://strftime.org/) to create the email subject.
- If no date code is provided, the original string will be used as the email subject.
##### Disable dry-run mode
Screenshots will be taken but no messages actually sent as long as `ALERT_REPORTS_NOTIFICATION_DRY_RUN = True`, its default value in `docker/pythonpath_dev/superset_config.py`. To disable dry-run mode and start receiving email/Slack notifications, set `ALERT_REPORTS_NOTIFICATION_DRY_RUN` to `False` in [superset config](https://github.com/apache/superset/blob/master/docker/pythonpath_dev/superset_config.py).
#### In your `Dockerfile`
You'll need to extend the Superset image to include a headless browser. Your options include:
- Use Playwright with Chromium: this is the recommended approach as of version 4.1.x or greater. Playwright always uses Chromium — the `WEBDRIVER_TYPE` config setting has no effect when Playwright is active. A working example of a Dockerfile that installs these tools is provided under "Building your own production Docker image" on the [Docker Builds](/admin-docs/installation/docker-builds#building-your-own-production-docker-image) page. Enable the `PLAYWRIGHT_REPORTS_AND_THUMBNAILS` feature flag in your config to activate it.
- Use Firefox (Selenium): you'll need to install geckodriver and Firefox. Set `WEBDRIVER_TYPE` to `"firefox"` in your `superset_config.py`.
- Use Chrome (Selenium): you'll need to install Chrome. Set `WEBDRIVER_TYPE` to `"chrome"` in your `superset_config.py`.
In Superset versions <=4.0x, users installed Firefox or Chrome and that was documented here.
Only the worker container needs the browser.
### Slack integration
To send alerts and reports to Slack channels, you need to create a new Slack Application on your workspace.
1. Connect to your Slack workspace, then head to [https://api.slack.com/apps].
2. Create a new app.
3. Go to "OAuth & Permissions" section, and give the following scopes to your app:
- `incoming-webhook`
- `files:write`
- `chat:write`
- `channels:read`
- `groups:read`
4. At the top of the "OAuth and Permissions" section, click "install to workspace".
5. Select a default channel for your app and continue.
(You can post to any channel by inviting your Superset app into that channel).
6. The app should now be installed in your workspace, and a "Bot User OAuth Access Token" should have been created. Copy that token in the `SLACK_API_TOKEN` variable of your `superset_config.py`.
7. Ensure the feature flag `ALERT_REPORT_SLACK_V2` is set to True in `superset_config.py`
8. Restart the service (or run `superset init`) to pull in the new configuration.
Note: when you configure an alert or a report, the Slack channel list takes channel names without the leading '#' e.g. use `alerts` instead of `#alerts`.
#### Large Slack Workspaces (10k+ channels)
For workspaces with many channels, fetching the complete channel list can take several minutes and may encounter Slack API rate limits. Add the following to your `superset_config.py`:
When a report includes file attachments (CSV, PDF, or PNG screenshots), the request is sent as `multipart/form-data` instead. In that case, each top-level payload field (`name`, `text`, `description`, `url`) becomes its own form field, and nested structures like `header` are serialized as a JSON-encoded string in their own field. Every attachment is added as a repeated form field named `files`:
Webhook consumers should branch on `Content-Type`: parse the body as JSON when `application/json`, or read the individual form fields (decoding `header` as JSON) when `multipart/form-data`.
#### HTTPS Enforcement
To require HTTPS webhook URLs (recommended for production), set:
```python
ALERT_REPORTS_WEBHOOK_HTTPS_ONLY = True
```
When enabled, Superset rejects webhook configurations that use `http://` URLs.
#### Retry Behavior
Superset automatically retries webhook deliveries on `429 Too Many Requests` and `5xx` server errors using exponential backoff.
### Kubernetes-specific
- You must have a `celery beat` pod running. If you're using the chart included in the GitHub repository under [helm/superset](https://github.com/apache/superset/tree/master/helm/superset), you need to put `supersetCeleryBeat.enabled = true` in your values override.
- You can see the dedicated docs about [Kubernetes installation](/admin-docs/installation/kubernetes) for more details.
### Docker Compose specific
#### You must have in your `docker-compose.yml`
- A Redis message broker
- PostgreSQL DB instead of SQLlite
- One or more `celery worker`
- A single `celery beat`
This process also works in a Docker swarm environment, you would just need to add `Deploy:` to the Superset, Redis and Postgres services along with your specific configs for your swarm.
### Detailed config
The following configurations need to be added to the `superset_config.py` file. This file is loaded when the image runs, and any configurations in it will override the default configurations found in the `config.py`.
You can find documentation about each field in the default `config.py` in the GitHub repository under [superset/config.py](https://github.com/apache/superset/blob/master/superset/config.py).
You need to replace default values with your custom Redis, Slack and/or SMTP config.
Superset uses Celery beat and Celery worker(s) to send alerts and reports.
- The beat is the scheduler that tells the worker when to perform its tasks. This schedule is defined when you create the alert or report.
- The worker will process the tasks that need to be performed when an alert or report is fired.
In the `CeleryConfig`, only the `beat_schedule` is relevant to this feature, the rest of the `CeleryConfig` can be changed for your needs.
Alternatively, you can assign a function to `ALERT_MINIMUM_INTERVAL` and/or `REPORT_MINIMUM_INTERVAL`. This is useful to dynamically retrieve a value as needed:
For security, Superset rewrites external links in alert/report email HTML so
they go through a warning page before the user is navigated to the external
site. Internal links (matching your configured base URL) are not affected.
```python
# Disable external link redirection entirely (default: True)
ALERT_REPORTS_ENABLE_LINK_REDIRECT = False
```
The feature uses `WEBDRIVER_BASEURL_USER_FRIENDLY` (or `WEBDRIVER_BASEURL`)
to determine which hosts are internal.
## Troubleshooting
There are many reasons that reports might not be working. Try these steps to check for specific issues.
### Confirm feature flag is enabled and you have sufficient permissions
If you don't see "Alerts & Reports" under the *Manage* section of the Settings dropdown in the Superset UI, you need to enable the `ALERT_REPORTS` feature flag (see above). Enable another feature flag and check to see that it took effect, to verify that your config file is getting loaded.
Log in as an admin user to ensure you have adequate permissions.
### Check the logs of your Celery worker
This is the best source of information about the problem. In a docker compose deployment, you can do this with a command like `docker logs superset_worker --since 1h`.
### Check web browser and webdriver installation
To take a screenshot, the worker visits the dashboard or chart using a headless browser, then takes a screenshot. If you are able to send a chart as CSV or text but can't send as PNG, your problem may lie with the browser.
If you are handling the installation of the headless browser on your own, do your own verification to ensure that the headless browser opens successfully in the worker environment.
### Send a test email
One symptom of an invalid connection to an email server is receiving an error of `[Errno 110] Connection timed out` in your logs when the report tries to send.
Confirm via testing that your outbound email configuration is correct. Here is the simplest test, for an un-authenticated email SMTP email service running on port 25. If you are sending over SSL, for instance, study how [Superset's codebase sends emails](https://github.com/apache/superset/blob/master/superset/utils/core.py#L818) and then test with those commands and arguments.
Start Python in your worker environment, replace all example values, and run:
- Some cloud hosts disable outgoing unauthenticated SMTP email to prevent spam. For instance, [Azure blocks port 25 by default on some machines](https://learn.microsoft.com/en-us/azure/virtual-network/troubleshoot-outbound-smtp-connectivity). Enable that port or use another sending method.
- Use another set of SMTP credentials that you verify works in this setup.
### Browse to your report from the worker
The worker may be unable to reach the report. It will use the value of `WEBDRIVER_BASEURL` to browse to the report. If that route is invalid, or presents an authentication challenge that the worker can't pass, the report screenshot will fail.
Check this by attempting to `curl` the URL of a report that you see in the error logs of your worker. For instance, from the worker environment, run `curl http://superset_app:8088/superset/dashboard/1/`. You may get different responses depending on whether the dashboard exists - for example, you may need to change the `1` in that URL. If there's a URL in your logs from a failed report screenshot, that's a good place to start. The goal is to determine a valid value for `WEBDRIVER_BASEURL` and determine if an issue like HTTPS or authentication is redirecting your worker.
In a deployment with authentication measures enabled like HTTPS and Single Sign-On, it may make sense to have the worker navigate directly to the Superset application running in the same location, avoiding the need to sign in. For instance, you could use `WEBDRIVER_BASEURL="http://superset_app:8088"` for a docker compose deployment, and set `"force_https": False,` in your `TALISMAN_CONFIG`.
### Duplicate report deliveries
In some deployment configurations a scheduled report can be delivered more than once around its planned time. This typically happens when more than one process is responsible for running the alerts & reports schedule (for example, multiple schedulers or Celery beat instances). To avoid duplicate emails or notifications:
- Ensure that only a **single scheduler/beat process** is configured to trigger alerts and reports for a given environment.
- If you run **multiple Celery workers**, verify that there is still only one component responsible for scheduling the report tasks (workers should execute tasks, not schedule them independently).
- Review your deployment/orchestration setup (for example systemd, Docker, or Kubernetes) to make sure the alerts & reports scheduler is **not started from multiple places by accident**.
## Scheduling Queries as Reports
You can optionally allow your users to schedule queries directly in SQL Lab. This is done by adding
extra metadata to saved queries, which are then picked up by an external scheduled (like
[Apache Airflow](https://airflow.apache.org/)).
To allow scheduled queries, add the following to `SCHEDULED_QUERIES` in your configuration file:
```python
SCHEDULED_QUERIES = {
# This information is collected when the user clicks "Schedule query",
# and saved into the `extra` field of saved queries.
[react-jsonschema-form](https://github.com/mozilla-services/react-jsonschema-form) and will add a
menu item called “Schedule” to SQL Lab. When the menu item is clicked, a modal will show up where
the user can add the metadata required for scheduling the query.
This information can then be retrieved from the endpoint `/api/v1/saved_query/` and used to
schedule the queries that have `schedule_info` in their JSON metadata. For schedulers other than
Airflow, additional fields can be easily added to the configuration file above.
:::resources
- [Tutorial: Automated Alerts and Reporting via Slack/Email in Superset](https://dev.to/ngtduc693/apache-superset-topic-5-automated-alerts-and-reporting-via-slackemail-in-superset-2gbe)
- [Blog: Integrating Slack alerts and Apache Superset for better data observability](https://medium.com/affinityanswers-tech/integrating-slack-alerts-and-apache-superset-for-better-data-observability-fd2f9a12c350)
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
*/}
---
title: AWS IAM Authentication
sidebar_label: AWS IAM Authentication
sidebar_position: 15
---
# AWS IAM Authentication for AWS Databases
Superset supports IAM-based authentication for **Amazon Aurora** (PostgreSQL and MySQL) and **Amazon Redshift**. IAM auth eliminates the need for database passwords — Superset generates a short-lived auth token using temporary AWS credentials instead.
Cross-account IAM role assumption via STS `AssumeRole` is supported, allowing a Superset deployment in one AWS account to connect to databases in a different account.
## Prerequisites
- Enable the `AWS_DATABASE_IAM_AUTH` feature flag in `superset_config.py`. IAM authentication is gated behind this flag; if it is disabled, connections using `aws_iam` fail with *"AWS IAM database authentication is not enabled."*
```python
FEATURE_FLAGS = {
"AWS_DATABASE_IAM_AUTH": True,
}
```
- `boto3` must be installed in your Superset environment:
```bash
pip install boto3
```
- The Superset server's IAM role (or static credentials) must have permission to call `sts:AssumeRole` (for cross-account) or the same-account permissions for the target service:
- **Redshift Serverless**: `redshift-serverless:GetCredentials` and `redshift-serverless:GetWorkgroup`
- SSL must be enabled on the Aurora / Redshift endpoint (required for IAM token auth).
## Configuration
IAM authentication is configured via the **encrypted_extra** field of the database connection. Access this field in the **Advanced** → **Security** section of the database connection form, under **Secure Extra**.
**3. Configure the database connection in Superset** using the `role_arn` and `external_id` from the trust policy (as shown in the configuration example above).
## Credential Caching
STS credentials are cached in memory keyed by `(role_arn, region, external_id)` with a 10-minute TTL. This reduces the number of STS API calls when multiple queries are executed with the same connection. Tokens are refreshed automatically before expiry.
Caching can be configured by providing dictionaries in
`superset_config.py` that comply with [the Flask-Caching config specifications](https://flask-caching.readthedocs.io/en/latest/#configuring-flask-caching).
The following cache configurations can be customized in this way:
- Dashboard filter state (required): `FILTER_STATE_CACHE_CONFIG`.
- Explore chart form data (required): `EXPLORE_FORM_DATA_CACHE_CONFIG`
- Metadata cache (optional): `CACHE_CONFIG`
- Charting data queried from datasets (optional): `DATA_CACHE_CONFIG`
For example, to configure the filter state cache using Redis:
```python
FILTER_STATE_CACHE_CONFIG = {
'CACHE_TYPE': 'RedisCache',
'CACHE_DEFAULT_TIMEOUT': 86400,
'CACHE_KEY_PREFIX': 'superset_filter_cache',
'CACHE_REDIS_URL': 'redis://localhost:6379/0'
}
```
## Dependencies
In order to use dedicated cache stores, additional python libraries must be installed
- For Redis: we recommend the [redis](https://pypi.python.org/pypi/redis) Python package
- Memcached: we recommend using [pylibmc](https://pypi.org/project/pylibmc/) client library as
`python-memcached` does not handle storing binary data correctly.
These libraries can be installed using pip.
## Fallback Metastore Cache
Note, that some form of Filter State and Explore caching are required. If either of these caches
are undefined, Superset falls back to using a built-in cache that stores data in the metadata
database. While it is recommended to use a dedicated cache, the built-in cache can also be used
to cache other data.
For example, to use the built-in cache to store chart data, use the following config:
```python
DATA_CACHE_CONFIG = {
"CACHE_TYPE": "SupersetMetastoreCache",
"CACHE_KEY_PREFIX": "superset_results", # make sure this string is unique to avoid collisions
The cache timeout for charts may be overridden by the settings for an individual chart, dataset, or
database. Each of these configurations will be checked in order before falling back to the default
value defined in `DATA_CACHE_CONFIG`.
Note, that by setting the cache timeout to `-1`, caching for charting data can be disabled, either
per chart, dataset or database, or by default if set in `DATA_CACHE_CONFIG`.
## SQL Lab Query Results
Caching for SQL Lab query results is used when async queries are enabled and is configured using
`RESULTS_BACKEND`.
Note that this configuration does not use a flask-caching dictionary for its configuration, but
instead requires a cachelib object.
See [Async Queries via Celery](/admin-docs/configuration/async-queries-celery) for details.
## Caching Thumbnails
This is an optional feature that can be turned on by activating its [feature flag](/admin-docs/configuration/configuring-superset#feature-flags) on config:
```
FEATURE_FLAGS = {
"THUMBNAILS": True,
"THUMBNAILS_SQLA_LISTENERS": True,
}
```
By default thumbnails are rendered per user, and will fall back to the Selenium user for anonymous users.
To always render thumbnails as a fixed user (`admin` in this example), use the following configuration:
```python
from superset.tasks.types import FixedExecutor
THUMBNAIL_EXECUTORS = [FixedExecutor("admin")]
```
For this feature you will need a cache system and celery workers. All thumbnails are stored on cache
and are processed asynchronously by the workers.
An example config where images are stored on S3 could be:
```python
from flask import Flask
from s3cache.s3cache import S3Cache
...
class CeleryConfig(object):
broker_url = "redis://localhost:6379/0"
imports = (
"superset.sql_lab",
"superset.tasks.thumbnails",
)
result_backend = "redis://localhost:6379/0"
worker_prefetch_multiplier = 10
task_acks_late = True
CELERY_CONFIG = CeleryConfig
def init_thumbnail_cache(app: Flask) -> S3Cache:
return S3Cache("bucket_name", 'thumbs_cache/')
THUMBNAIL_CACHE_CONFIG = init_thumbnail_cache
```
Using the above example cache keys for dashboards will be `superset_thumb__dashboard__{ID}`. You can
Thumbnail and screenshot endpoints return `ETag` response headers based on the cached content digest. Clients can use conditional requests to avoid downloading unchanged images:
```
GET /api/v1/chart/42/thumbnail/
If-None-Match: "abc123..."
→ 304 Not Modified (if unchanged)
→ 200 OK (with new image if changed)
```
This is particularly useful for embedded dashboards and external integrations that periodically poll for updated screenshots — unchanged thumbnails return immediately with no payload.
## Distributed Coordination Backend
Superset supports an optional distributed coordination (`DISTRIBUTED_COORDINATION_CONFIG`) for
high-performance distributed operations. This configuration enables:
- **Distributed locking**: Moves lock operations from the metadata database to Redis, improving
performance and reducing metastore load
- **Real-time event notifications**: Enables instant pub/sub messaging for task abort signals and
completion notifications instead of polling-based approaches
:::note
This requires Redis or Valkey specifically—it uses Redis-specific features (pub/sub, `SET NX EX`)
that are not available in general Flask-Caching backends.
:::
### Configuration
The distributed coordination uses Flask-Caching style configuration for consistency with other cache
backends. Configure `DISTRIBUTED_COORDINATION_CONFIG` in `superset_config.py`:
Individual lock acquisitions can override this value when needed.
### Database-Only Mode
When `DISTRIBUTED_COORDINATION_CONFIG` is not configured, Superset uses database-backed operations:
- **Locking**: Uses the KeyValue table with periodic cleanup of expired entries
- **Event notifications**: Uses database polling instead of pub/sub
While database-backed operations work reliably, the Redis backend is recommended for production
deployments where low latency and reduced database load are important.
:::resources
- [Blog: The Data Engineer's Guide to Lightning-Fast Superset Dashboards](https://preset.io/blog/the-data-engineers-guide-to-lightning-fast-apache-superset-dashboards/)
- [Blog: Accelerating Dashboards with Materialized Views](https://preset.io/blog/accelerating-apache-superset-dashboards-with-materialized-views/)
At the very least, you'll want to change `SECRET_KEY` and `SQLALCHEMY_DATABASE_URI`. Continue reading for more about each of these.
## Specifying a SECRET_KEY
### Adding an initial SECRET_KEY
Superset requires a user-specified SECRET_KEY to start up. This requirement was [added in version 2.1.0 to force secure configurations](https://preset.io/blog/superset-security-update-default-secret_key-vulnerability/). Add a strong SECRET_KEY to your `superset_config.py` file like:
Properly setting up metadata store is beyond the scope of this documentation. We recommend
using a hosted managed service such as [Amazon RDS](https://aws.amazon.com/rds/) or
[Google Cloud Databases](https://cloud.google.com/products/databases?hl=en) to handle
service and supporting infrastructure and backup strategy.
:::
To configure Superset metastore set `SQLALCHEMY_DATABASE_URI` config key on `superset_config`
to the appropriate connection string.
## Running on a WSGI HTTP Server
While you can run Superset on NGINX or Apache, we recommend using Gunicorn in async mode. This
enables impressive concurrency even and is fairly easy to install and configure. Please refer to the
documentation of your preferred technology to set up this Flask WSGI application in a way that works
well in your environment. Here’s an async setup known to work well in production:
```
-w 10 \
-k gevent \
--worker-connections 1000 \
--timeout 120 \
-b 0.0.0.0:6666 \
--limit-request-line 0 \
--limit-request-field_size 0 \
--statsd-host localhost:8125 \
"superset.app:create_app()"
```
Refer to the [Gunicorn documentation](https://docs.gunicorn.org/en/stable/design.html) for more
information. _Note that the development web server (`superset run` or `flask run`) is not intended
for production use._
If you're not using Gunicorn, you may want to disable the use of `flask-compress` by setting
`COMPRESS_REGISTER = False` in your `superset_config.py`.
Currently, the Google BigQuery Python SDK is not compatible with `gevent`, due to some dynamic monkeypatching on python core library by `gevent`.
So, when you use `BigQuery` datasource on Superset, you have to use `gunicorn` worker type except `gevent`.
## HTTPS Configuration
You can configure HTTPS upstream via a load balancer or a reverse proxy (such as nginx) and do SSL/TLS Offloading before traffic reaches the Superset application. In this setup, local traffic from a Celery worker taking a snapshot of a chart for Alerts & Reports can access Superset at a `http://` URL, from behind the ingress point.
You can also configure [SSL in Gunicorn](https://docs.gunicorn.org/en/stable/settings.html#ssl) (the Python webserver) if you are using an official Superset Docker image.
## Configuration Behind a Load Balancer
If you are running superset behind a load balancer or reverse proxy (e.g. NGINX or ELB on AWS), you
may need to utilize a healthcheck endpoint so that your load balancer knows if your superset
instance is running. This is provided at `/health` which will return a 200 response containing “OK”
if the webserver is running.
If the load balancer is inserting `X-Forwarded-For/X-Forwarded-Proto` headers, you should set
`ENABLE_PROXY_FIX = True` in the superset config file (`superset_config.py`) to extract and use the
headers.
In case the reverse proxy is used for providing SSL encryption, an explicit definition of the
`X-Forwarded-Proto` may be required. For the Apache webserver this can be set as follows:
```
RequestHeader set X-Forwarded-Proto "https"
```
## Configuring the application root
*Please be advised that this feature is in BETA.*
Superset supports running the application under a non-root path. The root path
prefix can be specified in one of three ways:
- Customizing the [Flask entrypoint](https://github.com/apache/superset/blob/master/superset/app.py#L29)
by passing the `superset_app_root` variable; or
- Setting the `SUPERSET_APP_ROOT` environment variable to the desired prefix; or
- Setting the `APPLICATION_ROOT` config in your `superset_config.py` file.
Note, the prefix should start with a `/`.
### Customizing the Flask entrypoint
To configure a prefix, e.g `/analytics`, pass the `superset_app_root` argument to
`create_app` when calling flask run either through the `FLASK_APP`
# Will allow user self registration, allowing to create Flask users from Authorized User
AUTH_USER_REGISTRATION = True
# The default user self registration role
AUTH_USER_REGISTRATION_ROLE = "Public"
```
In case you want to assign the `Admin` role on new user registration, it can be assigned as follows:
```python
AUTH_USER_REGISTRATION_ROLE = "Admin"
```
If you encounter the [issue](https://github.com/apache/superset/issues/13243) of not being able to list users from the Superset main page settings, although a newly registered user has an `Admin` role, please re-run `superset init` to sync the required permissions. Below is the command to re-run `superset init` using docker compose.
```
docker-compose exec superset superset init
```
Then, create a `CustomSsoSecurityManager` that extends `SupersetSecurityManager` and overrides
`oauth_user_info`:
```python
import logging
from superset.security import SupersetSecurityManager
class CustomSsoSecurityManager(SupersetSecurityManager):
For public OAuth2 clients that cannot securely store a client secret, enable Proof Key for Code Exchange (PKCE) by adding `code_challenge_method` to the `remote_app` configuration:
```python
OAUTH_PROVIDERS = [
{
'name': 'myProvider',
'remote_app': {
'client_id': 'myClientId',
'client_secret': 'mySecret', # may be empty for pure public clients
PKCE (`S256`) is recommended for all OAuth2 flows, even when a client secret is present, as it protects against authorization code interception attacks.
## LDAP Authentication
FAB supports authenticating user credentials against an LDAP server.
To use LDAP you must install the [python-ldap](https://www.python-ldap.org/en/latest/installing.html) package.
See [FAB's LDAP documentation](https://flask-appbuilder.readthedocs.io/en/latest/security.html#authentication-ldap)
for details.
## Mapping LDAP or OAUTH groups to Superset roles
AUTH_ROLES_MAPPING in Flask-AppBuilder is a dictionary that maps from LDAP/OAUTH group names to FAB roles.
It is used to assign roles to users who authenticate using LDAP or OAuth.
### Mapping OAUTH groups to Superset roles
The following `AUTH_ROLES_MAPPING` dictionary would map the OAUTH group "superset_users" to the Superset roles "Gamma" as well as "Alpha", and the OAUTH group "superset_admins" to the Superset role "Admin".
```python
AUTH_ROLES_MAPPING = {
"superset_users": ["Gamma","Alpha"],
"superset_admins": ["Admin"],
}
```
### Mapping LDAP groups to Superset roles
The following `AUTH_ROLES_MAPPING` dictionary would map the LDAP DN "cn=superset_users,ou=groups,dc=example,dc=com" to the Superset roles "Gamma" as well as "Alpha", and the LDAP DN "cn=superset_admins,ou=groups,dc=example,dc=com" to the Superset role "Admin".
Note: This requires `AUTH_LDAP_SEARCH` to be set. For more details, please see the [FAB Security documentation](https://flask-appbuilder.readthedocs.io/en/latest/security.html).
### Syncing roles at login
You can also use the `AUTH_ROLES_SYNC_AT_LOGIN` configuration variable to control how often Flask-AppBuilder syncs the user's roles with the LDAP/OAUTH groups. If `AUTH_ROLES_SYNC_AT_LOGIN` is set to True, Flask-AppBuilder will sync the user's roles each time they log in. If `AUTH_ROLES_SYNC_AT_LOGIN` is set to False, Flask-AppBuilder will only sync the user's roles when they first register.
## Flask app Configuration Hook
`FLASK_APP_MUTATOR` is a configuration function that can be provided in your environment, receives
the app object and can alter it in any way. For example, add `FLASK_APP_MUTATOR` into your
`superset_config.py` to setup session cookie expiration time to 24 hours:
To support a diverse set of users, Superset has some features that are not enabled by default. For
example, some users have stronger security restrictions, while some others may not. So Superset
allows users to enable or disable some features by config. For feature owners, you can add optional
functionalities in Superset, but will be only affected by a subset of users.
You can enable or disable features with flag from `superset_config.py`:
```python
FEATURE_FLAGS = {
'PRESTO_EXPAND_DATA': False,
}
```
A current list of feature flags can be found in the [Feature Flags](/admin-docs/configuration/feature-flags) documentation.
## Security Configuration
### HASH_ALGORITHM
Controls the hashing algorithm used for internal checksums and cache keys (thumbnails, cache keys, etc.). The default is `sha256`, which satisfies environments with stricter compliance requirements (e.g., FedRAMP). Set it to `md5` to retain the legacy behavior from older Superset deployments:
```python
HASH_ALGORITHM = "sha256" # default; set to "md5" for legacy behavior
```
A companion `HASH_ALGORITHM_FALLBACKS` list (default: `["md5"]`) lets UUID lookups fall back to older algorithms, which enables gradual migration without breaking existing entries. Set it to `[]` for strict mode (use only `HASH_ALGORITHM`).
:::note
This setting affects internal Superset operations only, not user passwords or authentication tokens. Changing it in an existing deployment may invalidate cached values but does not require a database migration.
:::
## SQL Lab Query History Pruning
SQL Lab query history is stored in the metadata database and is **not** pruned by default. To trim older rows, enable the `prune_query` Celery beat task by uncommenting it in `CELERY_BEAT_SCHEDULE` and choosing a retention window:
Adjust `retention_period_days` to control how long query rows are kept. Companion opt-in tasks (`prune_logs`, `prune_tasks`) exist for pruning the logs and tasks tables; see the commented-out examples in `superset/config.py`. Without enabling these tasks, the metadata database will grow unbounded over time.
:::resources
- [Blog: Feature Flags in Apache Superset](https://preset.io/blog/feature-flags-in-apache-superset-and-preset/)
The superset cli allows you to import and export datasources from and to YAML. Datasources include
databases. The data is expected to be organized in the following hierarchy:
:::info
Superset's ZIP-based import/export also covers **dashboards**, **charts**, and **saved queries**, exercised through the UI and REST API. The [Dashboard Import Overwrite Behavior](#dashboard-import-overwrite-behavior) and [UUIDs in API Responses](#uuids-in-api-responses) sections below document the behavior shared across all asset types.
:::
```text
├──databases
| ├──database_1
| | ├──table_1
| | | ├──columns
| | | | ├──column_1
| | | | ├──column_2
| | | | └──... (more columns)
| | | └──metrics
| | | ├──metric_1
| | | ├──metric_2
| | | └──... (more metrics)
| | └── ... (more tables)
| └── ... (more databases)
```
:::note
When you export a database connection, the `masked_encrypted_extra` field (used for sensitive connection parameters such as service account JSON, OAuth tokens, and other encrypted credentials) is included in the export. When importing on another instance, these values are decrypted and re-encrypted using the destination instance's `SECRET_KEY`. Ensure the receiving instance has a valid `SECRET_KEY` configured before importing.
:::
## Exporting Datasources to YAML
You can print your current datasources to stdout by running:
```bash
superset export_datasources
```
To save your datasources to a ZIP file run:
```bash
superset export_datasources -f <filename>
```
By default, default (null) values will be omitted. Use the -d flag to include them. If you want back
references to be included (e.g. a column to include the table id it belongs to) use the -b flag.
Alternatively, you can export datasources using the UI:
1. Open **Sources -> Databases** to export all tables associated to a single or multiple databases.
(**Tables** for one or more tables)
2. Select the items you would like to export.
3. Click **Actions -> Export** to YAML
4. If you want to import an item that you exported through the UI, you will need to nest it inside
its parent element, e.g. a database needs to be nested under databases a table needs to be nested
inside a database element.
In order to obtain an **exhaustive list of all fields** you can import using the YAML import run:
```bash
superset export_datasource_schema
```
As a reminder, you can use the `-b` flag to include back references.
## Importing Datasources
In order to import datasources from a ZIP file, run:
```bash
superset import_datasources -p <path / filename>
```
The optional username flag **-u** sets the user used for the datasource import. The default is 'admin'. Example:
When importing a dashboard ZIP with the **overwrite** option enabled, any existing charts that are part of the dashboard are **replaced** rather than duplicated. This applies to:
- Charts whose UUID matches a chart already present in the target instance
- The full chart configuration (query, visualization type, columns, metrics) is replaced by the imported version
If you import without the overwrite flag, existing charts with conflicting UUIDs are left unchanged and the import skips those objects. Use overwrite when you want to push a fully updated dashboard (including chart definitions) from a development or staging environment to production.
## UUIDs in API Responses
The REST API POST endpoints for **datasets**, **charts**, and **dashboards** include the auto-generated `uuid` field in the response body:
```json
{
"id": 42,
"uuid": "b8a8d5c3-1234-4abc-8def-0123456789ab",
...
}
```
UUIDs remain stable across import/export cycles and can be used for cross-environment workflows — for example, recording a UUID when creating a chart in development and using it to identify the matching chart after importing into production.
## Legacy Importing Datasources
### From older versions of Superset to current version
When using Superset version 4.x.x to import from an older version (2.x.x or 3.x.x) importing is supported as the command `legacy_import_datasources` and expects a JSON or directory of JSONs. The options are `-r` for recursive and `-u` for specifying a user. Example of legacy import without options:
```bash
superset legacy_import_datasources -p <path or filename>
```
### From older versions of Superset to older versions
When using an older Superset version (2.x.x & 3.x.x) of Superset, the command is `import_datasources`. ZIP and YAML files are supported and to switch between them the feature flag `VERSIONED_EXPORT` is used. When `VERSIONED_EXPORT` is `True`, `import_datasources` expects a ZIP file, otherwise YAML. Example:
```bash
superset import_datasources -p <path or filename>
```
When `VERSIONED_EXPORT` is `False`, if you supply a path all files ending with **yaml** or **yml** will be parsed. You can apply
additional flags (e.g. to search the supplied path recursively):
```bash
superset import_datasources -p <path> -r
```
The sync flag **-s** takes parameters in order to sync the supplied elements with your file. Be
careful this can delete the contents of your meta database. Example:
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
# MCP Server Deployment & Authentication
Superset includes a built-in [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) server that lets AI assistants -- Claude, ChatGPT, and other MCP-compatible clients -- interact with your Superset instance. Through MCP, clients can list dashboards, query datasets, execute SQL, create charts, and more.
This guide covers how to run, secure, and deploy the MCP server.
:::tip Looking for user docs?
See **[Using AI with Superset](/user-docs/using-superset/using-ai-with-superset)** for a guide on what AI can do with Superset and how to connect your AI client.
Both containers share the same `superset_config.py`, so authentication settings, database connections, and feature flags stay in sync.
### Multi-Pod (Kubernetes)
For high-availability deployments, configure Redis so that replicas share session state:
```mermaid
flowchart TD
LB["Load Balancer"] --> M1["MCP Pod 1"]
LB --> M2["MCP Pod 2"]
LB --> M3["MCP Pod 3"]
M1 --> R[("Redis<br/>(session store)")]
M2 --> R
M3 --> R
M1 --> DB[("Postgres")]
M2 --> DB
M3 --> DB
```
**superset_config.py:**
```python
MCP_STORE_CONFIG = {
"enabled": True,
"CACHE_REDIS_URL": "redis://redis-host:6379/0",
"event_store_max_events": 100,
"event_store_ttl": 3600,
}
```
When `CACHE_REDIS_URL` is set, the MCP server uses a Redis-backed EventStore for session management, allowing replicas to share state. Without Redis, each pod manages its own in-memory sessions and stateful MCP interactions may fail when requests hit different replicas.
---
## Configuration Reference
All MCP settings go in `superset_config.py`. Defaults are defined in `superset/mcp_service/mcp_config.py`.
### Core
| Setting | Default | Description |
|---------|---------|-------------|
| `MCP_SERVICE_HOST` | `"localhost"` | Host the MCP server binds to |
| `MCP_SERVICE_PORT` | `5008` | Port the MCP server binds to |
| `MCP_SERVICE_URL` | `None` | Public base URL for MCP-generated links (set this when behind a reverse proxy) |
| `MCP_DEBUG` | `False` | Enable debug logging |
| `MCP_DEV_USERNAME` | -- | Superset username for development mode (no auth) |
| `MCP_RBAC_ENABLED` | `True` | Enforce Superset's role-based access control on MCP tool calls. When `True`, each tool checks that the authenticated user has the required FAB permission before executing. Disable only for testing or trusted-network deployments. |
| `MCP_DISABLED_TOOLS` | `set()` | Set of tool names to remove from the MCP server at startup. Disabled tools are never advertised to AI clients during tool discovery. Useful when a custom extension tool should replace a built-in Superset tool. See [Disabling built-in tools](#disabling-built-in-tools). |
| `MCP_USER_RESOLVER` | `None` | Custom function `(app, access_token) -> username` to extract a Superset username from a validated JWT token. When `None`, the default resolver checks `preferred_username`, `username`, `email`, and `sub` claims in that order. |
### Response Size Guard
Limits response sizes to prevent exceeding LLM context windows:
| `event_store_max_events` | `100` | Maximum events retained per session |
| `event_store_ttl` | `3600` | Event TTL in seconds |
### Tool Search
By default the MCP server exposes a lightweight tool-search interface instead of advertising every tool at once. This reduces the initial context sent to the LLM by ~70%, which lowers cost and latency. The AI client discovers tools on demand by calling `search_tools` and then invokes them via `call_tool`.
```python
MCP_TOOL_SEARCH_CONFIG = {
"enabled": True,
"strategy": "bm25", # "bm25" (natural language) or "regex"
| `max_results` | `5` | Maximum tools returned per search query |
| `always_visible` | See above | Tools that always appear in `list_tools`, regardless of search |
| `include_schemas` | `False` | When `False` (default, "summary mode"), search results omit `inputSchema` entirely and include a lightweight `parameters_hint` listing top-level parameter names. Set to `True` to include the full `inputSchema` in search results. Full schemas are always used when a tool is actually invoked via `call_tool`. |
| `compact_schemas` | `True` | Strip `$defs` / `$ref` and replace with `{"type": "object"}` in search results to reduce token cost. Only takes effect when `include_schemas=True` — ignored in summary mode. |
| `max_description_length` | `300` | Truncate tool descriptions in search results (0 = no truncation). Applies in both summary and full-schema modes. |
:::tip
Set `enabled: False` to revert to the traditional "show all tools at once" behavior, which some clients or workflows may prefer.
:::
Tool search reduces the initial token cost from ~15–20K tokens (full catalog) down to ~4–5K tokens (pinned tools + search interface) — roughly 85% savings at the start of each conversation.
### Session & CSRF
These values are flat-merged into the Flask app config used by the MCP server process:
```python
MCP_SESSION_CONFIG = {
"SESSION_COOKIE_HTTPONLY": True,
"SESSION_COOKIE_SECURE": False,
"SESSION_COOKIE_SAMESITE": "Lax",
"SESSION_COOKIE_NAME": "superset_session",
"PERMANENT_SESSION_LIFETIME": 86400,
}
MCP_CSRF_CONFIG = {
"WTF_CSRF_ENABLED": True,
"WTF_CSRF_TIME_LIMIT": None,
}
```
---
## Access Control
### RBAC Enforcement
The MCP server respects Superset's full role-based access control (RBAC). Every authenticated user can only access the data and operations their Superset roles permit — the same rules that apply in the Superset UI apply through MCP.
Each tool declares one or more required FAB permissions. The table below maps tool groups to their permission requirements:
| Tool group | Required FAB permission |
|------------|------------------------|
| `list_charts`, `get_chart_info`, `get_chart_data`, `get_chart_preview`, `generate_chart`, `update_chart` | `can_read` on `Chart` (read), `can_write` on `Chart` (mutate) |
| `list_dashboards`, `get_dashboard_info`, `generate_dashboard`, `add_chart_to_existing_dashboard` | `can_read` on `Dashboard` (read), `can_write` on `Dashboard` (mutate) |
| `list_datasets`, `get_dataset_info`, `create_virtual_dataset` | `can_read` on `Dataset` (read), `can_write` on `Dataset` (mutate) |
| `list_databases`, `get_database_info` | `can_read` on `Database` |
| `execute_sql` | `can_execute_sql_query` on `SQLLab` |
| `open_sql_lab_with_context` | `can_read` on `SQLLab` |
| `save_sql_query` | `can_write` on `SavedQuery` |
| `health_check` | None (public) |
To disable RBAC checking globally (for trusted-network deployments or testing), set:
```python
# superset_config.py
MCP_RBAC_ENABLED = False
```
:::warning
Disabling RBAC removes all permission checks from MCP tool calls. Only do this on isolated, internal deployments where all MCP users are trusted admins.
:::
### Audit Log
All MCP tool calls are recorded in Superset's action log. You can view them at **Settings → Action Log** (admin only). Each log entry records:
- The tool name (e.g., `mcp.generate_chart.db_write`)
- The authenticated user
- A timestamp
This makes MCP activity fully auditable alongside regular Superset activity. The action log uses the same event logger as the rest of Superset, so existing log ingestion pipelines (e.g., sending logs to Elasticsearch or a SIEM) capture MCP events automatically.
### Middleware Pipeline
Every MCP request passes through a middleware stack before reaching the tool function. The default stack (assembled in `build_middleware_list()` in `server.py`) is:
| Middleware | Purpose | Default |
|------------|---------|---------|
| `StructuredContentStripperMiddleware` | Strips `structuredContent` from responses for Claude.ai bridge compatibility | Enabled |
| `LoggingMiddleware` | Logs each tool call with user, parameters, and duration | Enabled |
| `GlobalErrorHandlerMiddleware` | Catches unhandled exceptions and sanitizes sensitive data before it reaches the client | Enabled |
| `ResponseSizeGuardMiddleware` | Estimates token count, warns at 80% of limit, blocks at limit | Enabled (configurable via `MCP_RESPONSE_SIZE_CONFIG`) |
| `ResponseCachingMiddleware` | Caches read-heavy tool responses (in-memory or Redis) | Disabled (enable via `MCP_CACHE_CONFIG`) |
Additional middleware classes (`RateLimitMiddleware`, `FieldPermissionsMiddleware`, `PrivateToolMiddleware`) are implemented in `superset/mcp_service/middleware.py` but are not added to the default pipeline. They are available for operators who want to layer them in via a custom startup path.
### Error Sanitization
The `GlobalErrorHandlerMiddleware` automatically redacts sensitive information from all error messages before they reach the LLM client. The following are replaced with generic messages:
- **Database connection strings** — replaced with a generic connection error message
- **API keys and tokens** — redacted from error traces
- **File system paths** — stripped to prevent information disclosure
- **IP addresses** — removed from error context
This ensures that a misconfigured database connection or an unexpected exception never leaks credentials or internal topology to the LLM or its users. All regex patterns used for redaction are bounded to prevent ReDoS attacks.
---
## Performance
### Connection Pooling
Each MCP server process maintains its own SQLAlchemy connection pool to the database. For multi-worker deployments, total open connections = **workers × pool size**.
```python
# superset_config.py
SQLALCHEMY_POOL_SIZE = 5
SQLALCHEMY_MAX_OVERFLOW = 10
SQLALCHEMY_POOL_TIMEOUT = 30
SQLALCHEMY_POOL_RECYCLE = 3600 # Recycle connections after 1 hour
```
For a 3-pod Kubernetes deployment with the defaults above, expect up to 3 × (5 + 10) = 45 connections. Size your database's `max_connections` accordingly.
### Response Caching
Enable response caching for read-heavy workloads (dashboards/datasets that don't change frequently). With the in-memory backend (default when `MCP_STORE_CONFIG` is disabled), caching is per-process. Use Redis-backed caching for consistent cache hits across multiple pods:
Mutating tools (`generate_chart`, `update_chart`, `execute_sql`, `generate_dashboard`) are always excluded from caching regardless of this setting.
---
## Troubleshooting
### Server won't start
- Verify `fastmcp` is installed: `pip install fastmcp`
- Check that `MCP_DEV_USERNAME` is set if auth is disabled -- the server requires a user identity
- Confirm the port is not already in use: `lsof -i :5008`
### 401 Unauthorized
- Verify your JWT token has not expired (`exp` claim)
- Check that `MCP_JWT_ISSUER` and `MCP_JWT_AUDIENCE` match the token's `iss` and `aud` claims exactly
- For RS256 with JWKS: confirm the JWKS URI is reachable from the MCP server
- For RS256 with static key: confirm the public key string includes the `BEGIN`/`END` markers
- For HS256: confirm the secret matches between the token issuer and `MCP_JWT_SECRET`
- Enable `MCP_JWT_DEBUG_ERRORS = True` for detailed server-side logging (errors are never leaked to the client)
### Tool not found
- Ensure the MCP server and Superset share the same `superset_config.py`
- Check server logs at startup -- tool registration errors are logged with the tool name and reason
### Client can't connect
- Verify the MCP server URL is reachable from the client machine
- For Claude Desktop: fully quit the app (not just close the window) and restart after config changes
- For remote access: ensure your firewall and reverse proxy allow traffic to the MCP port
- Confirm the URL path ends with `/mcp` (e.g., `http://localhost:5008/mcp`)
### Permission errors on tool calls
- The MCP server enforces Superset's RBAC permissions -- the authenticated user must have the required roles
- In development mode, ensure `MCP_DEV_USERNAME` maps to a user with appropriate roles (e.g., Admin)
- Check `superset/security/manager.py` for the specific permission tuples required by each tool domain (e.g., `("can_execute_sql_query", "SQLLab")`)
### Response too large
- If a tool call returns an error about exceeding token limits, the response size guard is blocking an oversized result
- Reduce `page_size` or `limit` parameters, use `select_columns` to exclude large fields, or add filters to narrow results
- To adjust the threshold, change `token_limit` in `MCP_RESPONSE_SIZE_CONFIG`
- To disable the guard entirely, set `MCP_RESPONSE_SIZE_CONFIG = {"enabled": False}`
---
## Audit Events
All MCP tool calls are logged to Superset's event logger, the same system used by the web UI (viewable at **Settings → Action Log**). Each event captures:
- **User**: the resolved Superset username from the JWT or dev config
- **Timestamp**: when the operation ran
This means MCP activity is auditable alongside normal user activity. No additional configuration is required — logging is on by default whenever the event logger is enabled in your Superset deployment.
## Tool Pagination
MCP list tools (`list_datasets`, `list_charts`, `list_dashboards`, `list_databases`) use **offset pagination** via `page` (1-based) and `page_size` parameters. Responses include `page`, `page_size`, `total_count`, `total_pages`, `has_previous`, and `has_next`. To iterate through all results:
```python
# Example: fetch all charts across pages
all_charts = []
page = 1
while True:
result = mcp.list_charts(page=page, page_size=50)
all_charts.extend(result["charts"])
if not result.get("has_next"):
break
page += 1
```
## Disabling built-in tools
If you have deployed a custom tool via a Superset extension that supersedes one of the built-in Superset tools, you can suppress the built-in version so AI clients only discover your replacement. Disabled tools are removed from the server at startup and are never advertised during tool discovery.
Set `MCP_DISABLED_TOOLS` in your `superset_config.py` to a set of tool names:
Tool names match the function name used in the `@tool` decorator (e.g., `execute_sql`, `list_charts`, `health_check`). Extension-prefixed tools can also be disabled using their full prefixed name:
Specifying a tool name that does not exist logs a warning at startup and is otherwise ignored — it will not prevent the server from starting.
:::
## Security Best Practices
- **Use TLS** for all production MCP endpoints -- place the server behind a reverse proxy with HTTPS
- **Enable JWT authentication** for any internet-facing deployment
- **RBAC enforcement** -- The MCP server respects Superset's role-based access control. Users can only access data their roles permit
- **Secrets management** -- Store `MCP_JWT_SECRET`, database credentials, and API keys in environment variables or a secrets manager, never in config files committed to version control
- **Scoped tokens** -- Use `MCP_REQUIRED_SCOPES` to limit what operations a token can perform
- **Network isolation** -- In Kubernetes, restrict MCP pod network policies to only allow traffic from your AI client endpoints
- Review the **[Security documentation](/developer-docs/extensions/security)** for additional extension security guidance
---
## Next Steps
- **[Using AI with Superset](/user-docs/using-superset/using-ai-with-superset)** -- What AI can do with Superset and how to get started
- **[MCP Integration](/developer-docs/extensions/mcp)** -- Build custom MCP tools and prompts via Superset extensions
- **[Security](/developer-docs/extensions/security)** -- Security best practices for extensions
- **[Deployment](/developer-docs/extensions/deployment)** -- Package and deploy Superset extensions
Note that Superset bundles [flask-talisman](https://pypi.org/project/talisman/)
Self-described as a small Flask extension that handles setting HTTP headers that can help
protect against a few common web application security issues.
## HTML Embedding of Dashboards and Charts
There are two ways to embed a dashboard: Using the [SDK](https://www.npmjs.com/package/@superset-ui/embedded-sdk) or embedding a direct link. Note that in the latter case everybody who knows the link is able to access the dashboard.
### Embedding a Public Direct Link to a Dashboard
This works by first changing the content security policy (CSP) of [flask-talisman](https://github.com/GoogleCloudPlatform/flask-talisman) to allow for certain domains to display Superset content. Then a dashboard can be made publicly accessible, i.e. **bypassing authentication**. Once made public, the dashboard's URL can be added to an iframe in another website's HTML code.
#### Changing flask-talisman CSP
Add to `superset_config.py` the entire `TALISMAN_CONFIG` section from `config.py` and include a `frame-ancestors` section:
A chart's embed code can be generated by going to a chart's edit view and then clicking at the top right on `...` > `Share` > `Embed code`
### Enabling Embedding via the SDK
Clicking on `...` next to `EDIT DASHBOARD` on the top right of the dashboard's overview page should yield a drop-down menu including the entry "Embed dashboard".
To enable this entry, add the following line to the `.env` file:
```text
SUPERSET_FEATURE_EMBEDDED_SUPERSET=true
```
### Hiding the Logout Button in Embedded Contexts
When Superset is embedded in an application that manages authentication via SSO (OAuth2, SAML, or JWT), the logout button should be hidden since session management is handled by the parent application.
To hide the logout button in embedded contexts, add to `superset_config.py`:
```python
FEATURE_FLAGS = {
"DISABLE_EMBEDDED_SUPERSET_LOGOUT": True,
}
```
This flag only hides the logout button when Superset detects it is running inside an iframe. Users accessing Superset directly (not embedded) will still see the logout button regardless of this setting.
:::note
When embedding with SSO, also set `SESSION_COOKIE_SAMESITE = 'None'` and `SESSION_COOKIE_SECURE = True`. See [Security documentation](/admin-docs/security/securing_superset) for details.
:::
## CSRF settings
Similarly, [flask-wtf](https://flask-wtf.readthedocs.io/en/0.15.x/config/) is used to manage
some CSRF configurations. If you need to exempt endpoints from CSRF (e.g. if you are
running a custom auth postback endpoint), you can add the endpoints to `WTF_CSRF_EXEMPT_LIST`:
## SSH Tunneling
1. Turn on feature flag
- Change [`SSH_TUNNELING`](https://github.com/apache/superset/blob/eb8386e3f0647df6d1bbde8b42073850796cc16f/superset/config.py#L489) to `True`
- If you want to add more security when establishing the tunnel we allow users to overwrite the `SSHTunnelManager` class [here](https://github.com/apache/superset/blob/eb8386e3f0647df6d1bbde8b42073850796cc16f/superset/config.py#L507)
- You can also set the [`SSH_TUNNEL_LOCAL_BIND_ADDRESS`](https://github.com/apache/superset/blob/eb8386e3f0647df6d1bbde8b42073850796cc16f/superset/config.py#L508) this the host address where the tunnel will be accessible on your VPC
2. Create database w/ ssh tunnel enabled
- With the feature flag enabled you should now see ssh tunnel toggle.
- Click the toggle to enable SSH tunneling and add your credentials accordingly.
- Superset allows for two different types of authentication (Basic + Private Key). These credentials should come from your service provider.
3. Verify data is flowing
- Once SSH tunneling has been enabled, go to SQL Lab and write a query to verify data is properly flowing.
## Domain Sharding
:::note
Domain Sharding is deprecated as of Superset 5.0.0, and will be removed in Superset 6.0.0. Please Enable HTTP2 to keep more open connections per domain.
:::
Chrome allows up to 6 open connections per domain at a time. When there are more than 6 slices in
dashboard, a lot of time fetch requests are queued up and wait for next available socket.
[PR 5039](https://github.com/apache/superset/pull/5039) adds domain sharding to Superset,
and this feature will be enabled by configuration only (by default Superset doesn’t allow
cross-domain request).
Add the following setting in your `superset_config.py` file:
- `SUPERSET_WEBSERVER_DOMAINS`: list of allowed hostnames for domain sharding feature.
Please create your domain shards as subdomains of your main domain for authorization to
For a user-focused guide on writing Jinja templates in SQL Lab and virtual datasets, see the [SQL Templating User Guide](/user-docs/using-superset/sql-templating). This page covers administrator configuration options.
:::
## Jinja Templates
SQL Lab and Explore supports [Jinja templating](https://jinja.palletsprojects.com/en/2.11.x/) in queries.
To enable templating, the `ENABLE_TEMPLATE_PROCESSING` [feature flag](/admin-docs/configuration/configuring-superset#feature-flags) needs to be enabled in `superset_config.py`.
:::warning[Security Warning]
While powerful, this feature executes template code on the server. Within the Superset security model, this is **intended functionality**, as users with permissions to edit charts and virtual datasets are considered **trusted users**.
If you grant these permissions to untrusted users, this feature can be exploited as a **Server-Side Template Injection (SSTI)** vulnerability. Do not enable `ENABLE_TEMPLATE_PROCESSING` unless you fully understand and accept the associated security risks.
Additionally:
- The `url_param()` macro allows URL parameters to influence the rendered SQL. Always validate or restrict `url_param()` values in your templates rather than interpolating them directly.
- `filter.get('val')` returns raw filter values without escaping. Use the safe helpers described below (`|where_in`, `| replace("'", "''")`) rather than concatenating values directly into SQL strings.
:::
:::tip
`ENABLE_TEMPLATE_PROCESSING` defaults to `False`. Only enable it if your deployment requires Jinja templates and all users with dataset/chart edit access are administrators or fully trusted internal users.
:::
When templating is enabled, python code can be embedded in virtual datasets and
in Custom SQL in the filter and metric controls in Explore. By default, the following variables are
made available in the Jinja context:
- `columns`: columns which to group by in the query
- `filter`: filters applied in the query
- `from_dttm`: start `datetime` value from the selected time range (`None` if undefined). **Note:** Only available in virtual datasets when a time range filter is applied in Explore/Chart views—not available in standalone SQL Lab queries. (deprecated beginning in version 5.0, use `get_time_filter` instead)
- `to_dttm`: end `datetime` value from the selected time range (`None` if undefined). **Note:** Only available in virtual datasets when a time range filter is applied in Explore/Chart views—not available in standalone SQL Lab queries. (deprecated beginning in version 5.0, use `get_time_filter` instead)
- `groupby`: columns which to group by in the query (deprecated)
- `metrics`: aggregate expressions in the query
- `row_limit`: row limit of the query
- `row_offset`: row offset of the query
- `table_columns`: columns available in the dataset
- `time_column`: temporal column of the query (`None` if undefined)
- `time_grain`: selected time grain (`None` if undefined)
For example, to add a time range to a virtual dataset, you can write the following:
```sql
SELECT *
FROM tbl
WHERE dttm_col > '{{ from_dttm }}' and dttm_col < '{{ to_dttm }}'
```
You can also use [Jinja's logic](https://jinja.palletsprojects.com/en/2.11.x/templates/#tests)
to make your query robust to clearing the timerange filter:
```sql
SELECT *
FROM tbl
WHERE (
{% if from_dttm is not none %}
dttm_col > '{{ from_dttm }}' AND
{% endif %}
{% if to_dttm is not none %}
dttm_col < '{{ to_dttm }}' AND
{% endif %}
1 = 1
)
```
The `1 = 1` at the end ensures a value is present for the `WHERE` clause even when
the time filter is not set. For many database engines, this could be replaced with `true`.
Note that the Jinja parameters are called within _double_ brackets in the query and with
_single_ brackets in the logic blocks.
### Understanding Context Availability
Some Jinja variables like `from_dttm`, `to_dttm`, and `filter` are **only available when a chart or dashboard provides them**. They are populated from:
- Time range filters applied in Explore/Chart views
- Dashboard native filters
- Filter components
**These variables are NOT available in standalone SQL Lab queries** because there's no filter context. If you try to use `{{ from_dttm }}` directly in SQL Lab, you'll get an "undefined parameter" error.
#### Testing Time-Filtered Queries in SQL Lab
To test queries that use time variables in SQL Lab, you have several options:
**Option 1: Use Jinja defaults (recommended)**
```sql
SELECT *
FROM tbl
WHERE dttm_col > '{{ from_dttm | default("2024-01-01", true) }}'
AND dttm_col < '{{ to_dttm | default("2024-12-31", true) }}'
```
**Option 2: Use SQL Lab Parameters**
Set parameters in the SQL Lab UI (Parameters menu):
```json
{
"from_dttm": "2024-01-01",
"to_dttm": "2024-12-31"
}
```
**Option 3: Use `{% set %}` for testing**
```sql
{% set from_dttm = "2024-01-01" %}
{% set to_dttm = "2024-12-31" %}
SELECT *
FROM tbl
WHERE dttm_col > '{{ from_dttm }}' AND dttm_col < '{{ to_dttm }}'
```
:::tip
When you save a SQL Lab query as a virtual dataset and use it in a chart with time filters,
the actual filter values will override any defaults or test values you set.
:::
To add custom functionality to the Jinja context, you need to overload the default Jinja
context in your environment by defining the `JINJA_CONTEXT_ADDONS` in your superset configuration
(`superset_config.py`). Objects referenced in this dictionary are made available for users to use
where the Jinja context is made available.
```python
JINJA_CONTEXT_ADDONS = {
'my_crazy_macro': lambda x: x*2,
}
```
Default values for jinja templates can be specified via `Parameters` menu in the SQL Lab user interface.
In the UI you can assign a set of parameters as JSON
```json
{
"my_table": "foo"
}
```
The parameters become available in your SQL (example: `SELECT * FROM {{ my_table }}` ) by using Jinja templating syntax.
SQL Lab template parameters are stored with the dataset as `TEMPLATE PARAMETERS`.
There is a special ``_filters`` parameter which can be used to test filters used in the jinja template.
```json
{
"_filters": [
{
"col": "action_type",
"op": "IN",
"val": ["sell", "buy"]
}
]
}
```
```sql
SELECT action, count(*) as times
FROM logs
WHERE action in {{ filter_values('action_type')|where_in }}
GROUP BY action
```
Note ``_filters`` is not stored with the dataset. It's only used within the SQL Lab UI.
Besides default Jinja templating, SQL lab also supports self-defined template processor by setting
the `CUSTOM_TEMPLATE_PROCESSORS` in your superset configuration. The values in this dictionary
overwrite the default Jinja template processors of the specified database engine. The example below
configures a custom presto template processor which implements its own logic of processing macro
template with regex parsing. It uses the `$` style macro instead of `{{ }}` style in Jinja
templating.
By configuring it with `CUSTOM_TEMPLATE_PROCESSORS`, a SQL template on a presto database is
processed by the custom one rather than the default one.
In this section, we'll walkthrough the pre-defined Jinja macros in Superset.
### Current Username
The `{{ current_username() }}` macro returns the `username` of the currently logged in user.
If you have caching enabled in your Superset configuration, then by default the `username` value will be used
by Superset when calculating the cache key. A cache key is a unique identifier that determines if there's a
cache hit in the future and Superset can retrieve cached data.
You can disable the inclusion of the `username` value in the calculation of the
cache key by adding the following parameter to your Jinja code:
```python
{{ current_username(add_to_cache_keys=False) }}
```
### Current User ID
The `{{ current_user_id() }}` macro returns the account ID of the currently logged in user.
If you have caching enabled in your Superset configuration, then by default the account `id` value will be used
by Superset when calculating the cache key. A cache key is a unique identifier that determines if there's a
cache hit in the future and Superset can retrieve cached data.
You can disable the inclusion of the account `id` value in the calculation of the
cache key by adding the following parameter to your Jinja code:
```python
{{ current_user_id(add_to_cache_keys=False) }}
```
### Current User Email
The `{{ current_user_email() }}` macro returns the email address of the currently logged in user.
If you have caching enabled in your Superset configuration, then by default the email address value will be used
by Superset when calculating the cache key. A cache key is a unique identifier that determines if there's a
cache hit in the future and Superset can retrieve cached data.
You can disable the inclusion of the email value in the calculation of the
cache key by adding the following parameter to your Jinja code:
```python
{{ current_user_email(add_to_cache_keys=False) }}
```
### Current User Roles
The `{{ current_user_roles() }}` macro returns an array of roles for the logged in user.
If you have caching enabled in your Superset configuration, then by default the roles value will be used
by Superset when calculating the cache key. A cache key is a unique identifier that determines if there's a
cache hit in the future and Superset can retrieve cached data.
You can disable the inclusion of the roles value in the calculation of the
cache key by adding the following parameter to your Jinja code:
```python
{{ current_user_roles(add_to_cache_keys=False) }}
```
You can json-stringify the array by adding `|tojson` to your Jinja code:
```python
{{ current_user_roles()|tojson }}
```
You can use the `|where_in` filter to use your roles in a SQL statement. For example, if `current_user_roles()` returns `['admin', 'viewer']`, the following template:
```python
SELECT * FROM users WHERE role IN {{ current_user_roles()|where_in }}
```
Will be rendered as:
```sql
SELECT * FROM users WHERE role IN ('admin', 'viewer')
```
### Current User RLS Rules
The `{{ current_user_rls_rules() }}` macro returns an array of RLS rules applied to the current dataset for the logged in user.
If you have caching enabled in your Superset configuration, then the list of RLS Rules will be used
by Superset when calculating the cache key. A cache key is a unique identifier that determines if there's a
cache hit in the future and Superset can retrieve cached data.
### Custom URL Parameters
The `{{ url_param('custom_variable') }}` macro lets you define arbitrary URL
parameters and reference them in your SQL code.
:::warning
Always treat `url_param()` values as untrusted input. Escaping behaviour varies by context and configuration, so do not rely on it. Restrict values to an explicit allowlist before using them in SQL:
```sql
{% set cc = url_param('countrycode') %}
{% if cc not in ('US', 'ES', 'FR') %}{% set cc = 'US' %}{% endif %}
WHERE country_code = '{{ cc }}'
```
:::
Here's a concrete example:
- You write the following query in SQL Lab:
```sql
SELECT count(*)
FROM ORDERS
WHERE country_code = '{{ url_param('countrycode') }}'
```
- You're hosting Superset at the domain www.example.com and you send your
coworker in Spain the following SQL Lab URL `www.example.com/superset/sqllab?countrycode=ES`
and your coworker in the USA the following SQL Lab URL `www.example.com/superset/sqllab?countrycode=US`
- For your coworker in Spain, the SQL Lab query will be rendered as:
```sql
SELECT count(*)
FROM ORDERS
WHERE country_code = 'ES'
```
- For your coworker in the USA, the SQL Lab query will be rendered as:
```sql
SELECT count(*)
FROM ORDERS
WHERE country_code = 'US'
```
### Explicitly Including Values in Cache Key
The `{{ cache_key_wrapper() }}` function explicitly instructs Superset to add a value to the
accumulated list of values used in the calculation of the cache key.
This function is only needed when you want to wrap your own custom function return values
Note that this function powers the caching of the `user_id` and `username` values
in the `current_user_id()` and `current_username()` function calls (if you have caching enabled).
### Filter Values
You can retrieve the value for a specific filter as a list using `{{ filter_values() }}`.
This is useful if:
- You want to use a filter component to filter a query where the name of filter component column doesn't match the one in the select statement
- You want to have the ability to filter inside the main query for performance purposes
Here's a concrete example:
```sql
SELECT action, count(*) as times
FROM logs
WHERE
action in {{ filter_values('action_type')|where_in }}
GROUP BY action
```
There `where_in` filter converts the list of values from `filter_values('action_type')` into a string suitable for an `IN` expression.
### Filters for a Specific Column
The `{{ get_filters() }}` macro returns the filters applied to a given column. In addition to
returning the values (similar to how `filter_values()` does), the `get_filters()` macro
returns the operator specified in the Explore UI.
This is useful if:
- You want to handle more than the IN operator in your SQL clause
- You want to handle generating custom SQL conditions for a filter
- You want to have the ability to filter inside the main query for speed purposes
:::warning
`filter.get('val')` returns the raw filter value without escaping. For multi-value filters, use the `|where_in` Jinja filter, which handles quoting safely. For single-value operators like `LIKE`, escape single quotes before interpolating:
```sql
{%- if filter.get('op') == 'LIKE' -%}
AND full_name LIKE '{{ filter.get('val') | replace("'", "''") }}'
{%- endif -%}
```
:::
Here's a concrete example:
```sql
WITH RECURSIVE
superiors(employee_id, manager_id, full_name, level, lineage) AS (
SELECT
employee_id,
manager_id,
full_name,
1 as level,
employee_id as lineage
FROM
employees
WHERE
1=1
{# Render a blank line #}
{%- for filter in get_filters('full_name', remove_filter=True) -%}
{%- if filter.get('op') == 'IN' -%}
AND
full_name IN {{ filter.get('val')|where_in }}
{%- endif -%}
{%- if filter.get('op') == 'LIKE' -%}
AND
full_name LIKE '{{ filter.get('val') | replace("'", "''") }}'
Assuming we are creating a table chart with a simple `COUNT(*)` as the metric with a time filter `Last week` on the
`dttm` column, this would render the following query on Postgres (note the formatting of the temporal filters, and
the absence of time filters on the outer query):
```
SELECT COUNT(*) AS count
FROM
(SELECT *,
'Last week' AS time_range
FROM public.logs
WHERE 1 = 1
AND dttm >= TO_TIMESTAMP('2024-08-27 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')
AND dttm < TO_TIMESTAMP('2024-09-03 00:00:00.000000', 'YYYY-MM-DD HH24:MI:SS.US')) AS virtual_table
ORDER BY count DESC
LIMIT 1000;
```
When using the `default` parameter, the templated query can be simplified, as the endpoints will always be defined
(to use a fixed time range, you can also use something like `default="2024-08-27 : 2024-09-03"`)
```
{% set time_filter = get_time_filter("dttm", default="Last week", remove_filter=True) %}
SELECT
*,
'{{ time_filter.time_range }}' as time_range
FROM logs
WHERE
dttm >= {{ time_filter.from_expr }}
AND dttm < {{ time_filter.to_expr }}
```
### Datasets
It's possible to query physical and virtual datasets using the `dataset` macro. This is useful if you've defined computed columns and metrics on your datasets, and want to reuse the definition in adhoc SQL Lab queries.
To use the macro, first you need to find the ID of the dataset. This can be done by going to the view showing all the datasets, hovering over the dataset you're interested in, and looking at its URL. For example, if the URL for a dataset is https://superset.example.org/explore/?dataset_type=table&dataset_id=42 its ID is 42.
Once you have the ID you can query it as if it were a table:
```sql
SELECT * FROM {{ dataset(42) }} LIMIT 10
```
If you want to select the metric definitions as well, in addition to the columns, you need to pass an additional keyword argument:
```sql
SELECT * FROM {{ dataset(42, include_metrics=True) }} LIMIT 10
```
Since metrics are aggregations, the resulting SQL expression will be grouped by all non-metric columns. You can specify a subset of columns to group by instead:
The `{{ metric('metric_key', dataset_id) }}` macro can be used to retrieve the metric SQL syntax from a dataset. This can be useful for different purposes:
- Override the metric label in the chart level
- Combine multiple metrics in a calculation
- Retrieve a metric syntax in SQL lab
- Re-use metrics across datasets
This macro avoids copy/paste, allowing users to centralize the metric definition in the dataset layer.
The `dataset_id` parameter is optional, and if not provided Superset will use the current dataset from context (for example, when using this macro in the Chart Builder, by default the `macro_key` will be searched in the dataset powering the chart).
The parameter can be used in SQL Lab, or when fetching a metric from another dataset.
## Available Filters
Superset supports [builtin filters from the Jinja2 templating package](https://jinja.palletsprojects.com/en/stable/templates/#builtin-filters). Custom filters have also been implemented:
### Where In
Parses a list into a SQL-compatible statement. This is useful with macros that return an array (for example the `filter_values` macro):
```
Dashboard filter with "First", "Second" and "Third" options selected
By default, this filter returns `()` (as a string) in case the value is null. The `default_to_none` parameter can be se to `True` to return null in this case:
Superset now rides on **Ant Design v5's token-based theming**.
Every Antd token works, plus a handful of Superset-specific ones for charts and dashboard chrome.
## Managing Themes via UI
Superset includes a built-in **Theme Management** interface accessible from the admin menu under **Settings > Themes**.
### Creating a New Theme
1. Navigate to **Settings > Themes** in the Superset interface
2. Click **+ Theme** to create a new theme
3. Use the [Ant Design Theme Editor](https://ant.design/theme-editor) to design your theme:
- Design your palette, typography, and component overrides
- Open the `CONFIG` modal and copy the JSON configuration
4. Paste the JSON into the theme definition field in Superset
5. Give your theme a descriptive name and save
You can also extend with Superset-specific tokens (documented in the default theme object) before you import.
### System Theme Administration
When `ENABLE_UI_THEME_ADMINISTRATION = True` is configured, administrators can manage system-wide themes directly from the UI:
#### Setting System Themes
- **System Default Theme**: Click the sun icon on any theme to set it as the system-wide default
- **System Dark Theme**: Click the moon icon on any theme to set it as the system dark mode theme
- **Automatic OS Detection**: When both default and dark themes are set, Superset automatically detects and applies the appropriate theme based on OS preferences
#### Managing System Themes
- System themes are indicated with special badges in the theme list
- Only administrators with write permissions can modify system theme settings
- Removing a system theme designation reverts to configuration file defaults
### Applying Themes to Dashboards
Once created, themes can be applied to individual dashboards:
- Edit any dashboard and select your custom theme from the theme dropdown
- Each dashboard can have its own theme, allowing for branded or context-specific styling
## Configuration Options
### Python Configuration
Configure theme behavior via `superset_config.py`:
```python
# Enable UI-based theme administration for admins
ENABLE_UI_THEME_ADMINISTRATION = True
# Optional: Set initial default themes via configuration
# These can be overridden via the UI when ENABLE_UI_THEME_ADMINISTRATION = True
THEME_DEFAULT = {
"token": {
"colorPrimary": "#2893B3",
"colorSuccess": "#5ac189",
# ... your theme JSON configuration
}
}
# Optional: Dark theme configuration
THEME_DARK = {
"algorithm": "dark",
"token": {
"colorPrimary": "#2893B3",
# ... your dark theme overrides
}
}
# To force a single theme on all users, set THEME_DARK = None
# When both themes are defined (via UI or config):
# - Users can manually switch between themes
# - OS preference detection is automatically enabled
```
### App Branding
The application name shown in the browser title bar and navigation can be
set via the `brandAppName` theme token:
```python
THEME_DEFAULT = {
"token": {
"brandAppName": "Acme Analytics",
# ... other tokens
}
}
```
Or in the theme CRUD UI JSON editor:
```json
{
"token": {
"brandAppName": "Acme Analytics"
}
}
```
The existing `APP_NAME` Python config key continues to work for backward compatibility.
`brandAppName` takes precedence when both are set, and allows different themes to carry different brand names.
Email and alert/report notification subjects are driven by backend settings such as
`EMAIL_REPORTS_SUBJECT_PREFIX` and `APP_NAME`, not by this theme token.
### Migration from Configuration to UI
When `ENABLE_UI_THEME_ADMINISTRATION = True`:
1. System themes set via the UI take precedence over configuration file settings
2. The UI shows which themes are currently set as system defaults
3. Administrators can change system themes without restarting Superset
4. Configuration file themes serve as fallbacks when no UI themes are set
### Theme Validation and Fallback
Superset validates theme JSON when it is saved, either through the UI or via configuration. If a theme contains invalid tokens or an unrecognized structure, Superset logs a warning and falls back to the built-in default theme rather than applying a broken configuration. This prevents a bad theme from rendering the application unusable.
The fallback order is:
1. **UI-configured system theme** (highest priority, if `ENABLE_UI_THEME_ADMINISTRATION = True`)
2. **`THEME_DEFAULT` / `THEME_DARK`** from `superset_config.py`
3. **Built-in Superset default theme** (always present as a safety net)
If you see unexpected styling after a config change, check the Superset server logs for theme validation warnings.
### Copying Themes Between Systems
To export a theme for use in configuration files or another instance:
1. Navigate to **Settings > Themes** and click the export icon on your desired theme
2. Extract the JSON configuration from the exported YAML file
3. Use this JSON in your `superset_config.py` or import it into another Superset instance
## Theme Development Workflow
1. **Design**: Use the [Ant Design Theme Editor](https://ant.design/theme-editor) to iterate on your design
2. **Test**: Create themes in Superset's CRUD interface for testing
3. **Apply**: Assign themes to specific dashboards or configure instance-wide
4. **Iterate**: Modify theme JSON directly in the CRUD interface or re-import from the theme editor
## Custom Fonts
Superset supports custom fonts through the theme configuration, allowing you to use branded or custom typefaces without rebuilding the application.
### Default Fonts
By default, Superset uses **Inter** for UI text and **IBM Plex Mono** for code (SQL editors, JSON fields, and other monospace contexts). Both fonts are bundled with the application via `@fontsource` packages and work offline without any external network calls.
:::note
IBM Plex Mono replaced Fira Code as the default code font in Superset 6.1. If you have an existing theme that explicitly sets `fontFamilyCode: "Fira Code, ..."`, you may want to update it.
:::
### Configuring Custom Fonts
To use custom fonts, add font URLs to your theme configuration using the `fontUrls` token:
```python
THEME_DEFAULT = {
"token": {
# Load fonts from external sources (e.g., Google Fonts, Adobe Fonts)
Font URLs are validated against a configurable allowlist. By default, fonts from `fonts.googleapis.com`, `fonts.gstatic.com`, and `use.typekit.net` are allowed. Configure `THEME_FONT_URL_ALLOWED_DOMAINS` to customize the allowed domains.
:::
### Font Sources
- **Google Fonts**: Free, CDN-hosted fonts with wide variety
- **Adobe Fonts**: Premium fonts (requires subscription and kit ID)
- **Self-hosted**: Place font files in `/static/assets/fonts/` and reference via CSS
This feature works with the stock Docker image - no custom build required!
## ECharts Configuration Overrides
:::note
Available since Superset 6.0
:::
Superset provides fine-grained control over ECharts visualizations through theme-level configuration overrides. This allows you to customize the appearance and behavior of all ECharts-based charts without modifying individual chart configurations.
### Global ECharts Overrides
Apply settings to all ECharts visualizations using `echartsOptionsOverrides`:
```python
THEME_DEFAULT = {
"token": {
"colorPrimary": "#2893B3",
# ... other Ant Design tokens
},
"echartsOptionsOverrides": {
"grid": {
"left": "10%",
"right": "10%",
"top": "15%",
"bottom": "15%"
},
"tooltip": {
"backgroundColor": "rgba(0, 0, 0, 0.8)",
"borderColor": "#ccc",
"textStyle": {
"color": "#fff"
}
},
"legend": {
"textStyle": {
"fontSize": 14,
"fontWeight": "bold"
}
}
}
}
```
### Chart-Specific Overrides
Target specific chart types using `echartsOptionsOverridesByChartType`:
```python
THEME_DEFAULT = {
"token": {
"colorPrimary": "#2893B3",
# ... other tokens
},
"echartsOptionsOverridesByChartType": {
"echarts_pie": {
"legend": {
"orient": "vertical",
"right": 10,
"top": "center"
}
},
"echarts_timeseries": {
"xAxis": {
"axisLabel": {
"rotate": 45,
"fontSize": 12
}
},
"dataZoom": [{
"type": "slider",
"show": True,
"start": 0,
"end": 100
}]
},
"echarts_bubble": {
"grid": {
"left": "15%",
"bottom": "20%"
}
}
}
}
```
### UI Configuration
You can also configure ECharts overrides through the theme CRUD interface:
```json
{
"token": {
"colorPrimary": "#2893B3"
},
"echartsOptionsOverrides": {
"grid": {
"left": "10%",
"right": "10%"
},
"tooltip": {
"backgroundColor": "rgba(0, 0, 0, 0.8)"
}
},
"echartsOptionsOverridesByChartType": {
"echarts_pie": {
"legend": {
"orient": "vertical",
"right": 10
}
}
}
}
```
### Override Precedence
The system applies overrides in the following order (last wins):
This ensures chart-specific overrides take precedence over global ones.
### Common Chart Types
Available chart types for `echartsOptionsOverridesByChartType`:
- `echarts_timeseries` - Time series/line charts
- `echarts_pie` - Pie and donut charts
- `echarts_bubble` - Bubble/scatter charts
- `echarts_funnel` - Funnel charts
- `echarts_gauge` - Gauge charts
- `echarts_radar` - Radar charts
- `echarts_boxplot` - Box plot charts
- `echarts_treemap` - Treemap charts
- `echarts_sunburst` - Sunburst charts
- `echarts_graph` - Network/graph charts
- `echarts_sankey` - Sankey diagrams
- `echarts_heatmap` - Heatmaps
- `echarts_mixed_timeseries` - Mixed time series
### Array Property Overrides
Array properties (such as color palettes) are fully supported in overrides. Arrays are **replaced entirely** rather than merged, so specify the complete array:
```python
THEME_DEFAULT = {
"token": { ... },
"echartsOptionsOverrides": {
# Replace the default color palette for all ECharts visualizations
# Complete corporate theme with ECharts customization
THEME_DEFAULT = {
"token": {
"colorPrimary": "#1B4D3E",
"fontFamily": "Corporate Sans, Arial, sans-serif"
},
"echartsOptionsOverrides": {
"grid": {
"left": "8%",
"right": "8%",
"top": "12%",
"bottom": "12%"
},
"textStyle": {
"fontFamily": "Corporate Sans, Arial, sans-serif"
},
"title": {
"textStyle": {
"color": "#1B4D3E",
"fontSize": 18,
"fontWeight": "bold"
}
}
},
"echartsOptionsOverridesByChartType": {
"echarts_timeseries": {
"xAxis": {
"axisLabel": {
"color": "#666",
"fontSize": 11
}
}
},
"echarts_pie": {
"legend": {
"textStyle": {
"fontSize": 12
},
"itemGap": 20
}
}
}
}
```
This feature provides powerful theming capabilities while maintaining the flexibility of ECharts' extensive configuration options.
## Advanced Features
- **System Themes**: Manage system-wide default and dark themes via UI or configuration
- **Per-Dashboard Theming**: Each dashboard can have its own visual identity
- **JSON Editor**: Edit theme configurations directly within Superset's interface
- **Custom Fonts**: Load external fonts via configuration without rebuilding
- **OS Dark Mode Detection**: Automatically switches themes based on system preferences
- **Theme Import/Export**: Share themes between instances via YAML files
## API Access
For programmatic theme management, Superset provides REST endpoints:
- `GET /api/v1/theme/` - List all themes
- `POST /api/v1/theme/` - Create a new theme
- `PUT /api/v1/theme/{id}` - Update a theme
- `DELETE /api/v1/theme/{id}` - Delete a theme
- `PUT /api/v1/theme/{id}/set_system_default` - Set as system default theme (admin only)
- `PUT /api/v1/theme/{id}/set_system_dark` - Set as system dark theme (admin only)
- `DELETE /api/v1/theme/unset_system_default` - Remove system default designation
- `DELETE /api/v1/theme/unset_system_dark` - Remove system dark designation
- `GET /api/v1/theme/export/` - Export themes as YAML
- `POST /api/v1/theme/import/` - Import themes from YAML
These endpoints require appropriate permissions and are subject to RBAC controls.
:::resources
- [Video: Live Demo — Theming Apache Superset](https://www.youtube.com/watch?v=XsZAsO9tC3o)
- [CSS and Theming](https://docs.preset.io/docs/css-and-theming) - Additional theming techniques and CSS customization
- [Blog: Customizing Apache Superset Dashboards with CSS](https://preset.io/blog/customizing-superset-dashboards-with-css/)
- [Blog: Customizing Dashboards with CSS — Tips and Tricks](https://preset.io/blog/customizing-apache-superset-dashboards-with-css-additional-tips-and-tricks/)
There are four distinct timezone components which relate to Apache Superset,
1. The timezone that the underlying data is encoded in.
2. The timezone of the database engine.
3. The timezone of the Apache Superset backend.
4. The timezone of the Apache Superset client.
where if a temporal field (`DATETIME`, `TIME`, `TIMESTAMP`, etc.) does not explicitly define a timezone it defaults to the underlying timezone of the component.
To help make the problem somewhat tractable—given that Apache Superset has no control on either how the data is ingested (1) or the timezone of the client (4)—from a consistency standpoint it is highly recommended that both (2) and (3) are configured to use the same timezone with a strong preference given to [UTC](https://en.wikipedia.org/wiki/Coordinated_Universal_Time) to ensure temporal fields without an explicit timestamp are not incorrectly coerced into the wrong timezone. Actually Apache Superset currently has implicit assumptions that timestamps are in UTC and thus configuring (3) to a non-UTC timezone could be problematic.
To strive for data consistency (regardless of the timezone of the client) the Apache Superset backend tries to ensure that any timestamp sent to the client has an explicit (or semi-explicit as in the case with [Epoch time](https://en.wikipedia.org/wiki/Unix_time) which is always in reference to UTC) timezone encoded within.
The challenge however lies with the slew of [database engines](/user-docs/databases#installing-drivers-in-docker) which Apache Superset supports and various inconsistencies between their [Python Database API (DB-API)](https://www.python.org/dev/peps/pep-0249/) implementations combined with the fact that we use [Pandas](https://pandas.pydata.org/) to read SQL into a DataFrame prior to serializing to JSON. Regrettably Pandas ignores the DB-API [type_code](https://www.python.org/dev/peps/pep-0249/#type-objects) relying by default on the underlying Python type returned by the DB-API. Currently only a subset of the supported database engines work correctly with Pandas, i.e., ensuring timestamps without an explicit timestamp are serialized to JSON with the server timezone, thus guaranteeing the client will display timestamps in a consistent manner irrespective of the client's timezone.
For example the following is a comparison of MySQL and Presto,
```python
import pandas as pd
from sqlalchemy import create_engine
pd.read_sql_query(
sql="SELECT TIMESTAMP('2022-01-01 00:00:00') AS ts",
con=create_engine("mysql://root@localhost:3360"),
).to_json()
pd.read_sql_query(
sql="SELECT TIMESTAMP '2022-01-01 00:00:00' AS ts",
con=create_engine("presto://localhost:8080"),
).to_json()
```
which outputs `{"ts":{"0":1640995200000}}` (which infers the UTC timezone per the Epoch time definition) and `{"ts":{"0":"2022-01-01 00:00:00.000"}}` (without an explicit timezone) respectively and thus are treated differently in JavaScript:
```js
new Date(1640995200000)
> Sat Jan 01 2022 13:00:00 GMT+1300 (New Zealand Daylight Time)
new Date("2022-01-01 00:00:00.000")
> Sat Jan 01 2022 00:00:00 GMT+1300 (New Zealand Daylight Time)
If you install with Kubernetes or Docker Compose, all of these components will be created.
However, installing from PyPI only creates the application itself. Users installing from PyPI will need to configure a caching layer, worker, and beat on their own if they wish to enable the above features. Configuration of those components for a PyPI install is not currently covered in this documentation.
Here are further details on each component.
### The Superset Application
This is the core application. Superset operates like this:
- A user visits a chart or dashboard
- That triggers a SQL query to the data warehouse holding the underlying dataset
- The resulting data is served up in a data visualization
- The Superset application is comprised of the Python (Flask) backend application (server), API layer, and the React frontend, built via Webpack, and static assets needed for the application to work
### Metadata Database
This is where chart and dashboard definitions, user information, logs, etc. are stored. Superset is tested to work with PostgreSQL and MySQL databases as the metadata database (not be confused with a data source like your data warehouse, which could be a much greater variety of options like Snowflake, Redshift, etc.).
Some installation methods like our Quickstart and PyPI come configured by default to use a SQLite on-disk database. And in a Docker Compose installation, the data would be stored in a PostgreSQL container volume. Neither of these cases are recommended for production instances of Superset.
For production, a properly-configured, managed, standalone database is recommended. No matter what database you use, you should plan to back it up regularly.
### Caching Layer
The caching layer serves two main functions:
- Store the results of queries to your data warehouse so that when a chart is loaded twice, it pulls from the cache the second time, speeding up the application and reducing load on your data warehouse.
- Act as a message broker for the worker, enabling the Alerts & Reports, async queries, and thumbnail caching features.
Most people use Redis for their cache, but Superset supports other options too. See the [cache docs](/admin-docs/configuration/cache/) for more.
### Worker and Beat
This is one or more workers who execute tasks like run async queries or take snapshots of reports and send emails, and a "beat" that acts as the scheduler and tells workers when to perform their tasks. Most installations use Celery for these components.
## Other components
Other components can be incorporated into Superset. The best place to learn about additional configurations is the [Configuration page](/admin-docs/configuration/configuring-superset). For instance, you could set up a load balancer or reverse proxy to implement HTTPS in front of your Superset application, or specify a Mapbox URL to enable geospatial charts, etc.
Superset won't even start without certain configuration settings established, so it's essential to review that page.
How should you install Superset? Here's a comparison of the different options. It will help if you've first read the [Architecture](/admin-docs/installation/architecture) page to understand Superset's different components.
The fundamental trade-off is between you needing to do more of the detail work yourself vs. using a more complex deployment route that handles those details.
**Summary:** This takes advantage of containerization while remaining simpler than Kubernetes. This is the best way to try out Superset; it's also useful for developing & contributing back to Superset.
If you're not just demoing the software, you'll need a moderate understanding of Docker to customize your deployment and avoid a few risks. Even when fully-optimized this is not as robust a method as Kubernetes when it comes to large-scale production deployments.
You manage a superset-config.py file and a docker-compose.yml file. Docker Compose brings up all the needed services - the Superset application, a Postgres metadata DB, Redis cache, Celery worker and beat. They are automatically connected to each other.
**Responsibilities**
You will need to back up your metadata DB. That could mean backing up the service running as a Docker container and its volume; ideally you are running Postgres as a service outside of that container and backing up that service.
You will also need to extend the Superset docker image. The default `lean` images do not contain drivers needed to access your metadata database (Postgres or MySQL), nor to access your data warehouse, nor the headless browser needed for Alerts & Reports. You could run a `-dev` image while demoing Superset, which has some of this, but you'll still need to install the driver for your data warehouse. The `-dev` images run as root, which is not recommended for production.
Ideally you will build your own image of Superset that extends `lean`, adding what your deployment needs. See [Building your own production Docker image](/admin-docs/installation/docker-builds/#building-your-own-production-docker-image).
**Summary:** This is the best-practice way to deploy a production instance of Superset, but has the steepest skill requirement - someone who knows Kubernetes.
You will deploy Superset into a K8s cluster. The most common method is using the community-maintained Helm chart, though work is now underway to implement [SIP-149 - a Kubernetes Operator for Superset](https://github.com/apache/superset/issues/31408).
A K8s deployment can scale up and down based on usage and deploy rolling updates with zero downtime - features that big deployments appreciate.
**Responsibilities**
You will need to build your own Docker image, and back up your metadata DB, both as described in Docker Compose above. You'll also need to customize your Helm chart values and deploy and maintain your Kubernetes cluster.
## [PyPI (Python)](/admin-docs/installation/pypi)
**Summary:** This is the only method that requires no knowledge of containers. It requires the most hands-on work to deploy, connect, and maintain each component.
You install Superset as a Python package and run it that way, providing your own metadata database. Superset has documentation on how to install this way, but it is updated infrequently.
If you want caching, you'll set up Redis or RabbitMQ. If you want Alerts & Reports, you'll set up Celery.
**Responsibilities**
You will need to get the component services running and communicating with each other. You'll need to arrange backups of your metadata database.
When upgrading, you'll need to manage the system environment and packages and ensure all components have functional dependencies.
superset/superset 0.1.1 1.0 Apache Superset is a modern, enterprise-ready b...
```
3. Configure your setting overrides
Just like any typical Helm chart, you'll need to craft a `values.yaml` file that would define/override any of the values exposed into the default [values.yaml](https://github.com/apache/superset/tree/master/helm/superset/values.yaml), or from any of the dependent charts it depends on:
The exact list will depend on some of your specific configuration overrides but you should generally expect:
- N `superset-xxxx-yyyy` and `superset-worker-xxxx-yyyy` pods (depending on your `supersetNode.replicaCount` and `supersetWorker.replicaCount` values)
- 1 `superset-postgresql-0` depending on your postgres settings
- 1 `superset-redis-master-0` depending on your redis settings
- 1 `superset-celerybeat-xxxx-yyyy` pod if you have `supersetCeleryBeat.enabled = true` in your values overrides
1. Access it
The chart will publish appropriate services to expose the Superset UI internally within your k8s cluster. To access it externally you will have to either:
- Configure the Service as a `LoadBalancer` or `NodePort`
- Set up an `Ingress` for it - the chart includes a definition, but will need to be tuned to your needs (hostname, tls, annotations etc...)
- Run `kubectl port-forward superset-xxxx-yyyy :8088` to directly tunnel one pod's port into your localhost
Depending how you configured external access, the URL will vary. Once you've identified the appropriate URL you can log in with:
- user: `admin`
- password: `admin`
## Important settings
### Security settings
Default security settings and passwords are included but you **MUST** update them to run `prod` instances, in particular:
```yaml
postgresql:
postgresqlPassword: superset
```
Make sure, you set a unique strong complex alphanumeric string for your SECRET_KEY and use a tool to help you generate
a sufficiently random sequence.
- To generate a good key you can run, `openssl rand -base64 42`
Superset uses [Scarf Gateway](https://about.scarf.sh/scarf-gateway) to collect telemetry data. Knowing the installation counts for different Superset versions informs the project's decisions about patching and long-term support. Scarf purges personally identifiable information (PII) and provides only aggregated statistics.
To opt-out of this data collection in your Helm-based installation, edit the `repository:` line in your `helm/superset/values.yaml` file, replacing `apachesuperset.docker.scarf.sh/apache/superset` with `apache/superset` to pull the image directly from Docker Hub.
:::
### Dependencies
Install additional packages and do any other bootstrap configuration in the bootstrap script.
For production clusters it's recommended to build own image with this step done in CI.
:::note
Superset requires a Python DB-API database driver and a SQLAlchemy
dialect to be installed for each datastore you want to connect to.
See [Install Database Drivers](/user-docs/databases#installing-database-drivers) for more information.
It is recommended that you refer to versions listed in
instead of hard-coding them in your bootstrap script, as seen below.
:::
The following example installs the drivers for BigQuery and Elasticsearch, allowing you to connect to these data sources within your Superset setup:
```yaml
bootstrapScript: |
#!/bin/bash
uv pip install .[postgres] \
.[bigquery] \
.[elasticsearch] &&\
if [ ! -f ~/bootstrap ]; then echo "Running Superset with uid {{ .Values.runAsUser }}" > ~/bootstrap; fi
```
### superset_config.py
The default `superset_config.py` is fairly minimal and you will very likely need to extend it. This is done by specifying one or more key/value entries in `configOverrides`, e.g.:
```yaml
configOverrides:
my_override: |
# This will make sure the redirect_uri is properly computed, even with SSL offloading
ENABLE_PROXY_FIX = True
FEATURE_FLAGS = {
"DYNAMIC_PLUGINS": True
}
```
Those will be evaluated as Helm templates and therefore will be able to reference other `values.yaml` variables e.g. `{{ .Values.ingress.hosts[0] }}` will resolve to your ingress external domain.
The entire `superset_config.py` will be installed as a secret, so it is safe to pass sensitive parameters directly... however it might be more readable to use secret env variables for that.
Full python files can be provided by running `helm upgrade --install --values my-values.yaml --set-file configOverrides.oauth=set_oauth.py`
### Environment Variables
Those can be passed as key/values either with `extraEnv` or `extraSecretEnv` if they're sensitive. They can then be referenced from `superset_config.py` using e.g. `os.environ.get("VAR")`.
# Will allow user self registration, allowing to create Flask users from Authorized User
AUTH_USER_REGISTRATION = True
# The default user self registration role
AUTH_USER_REGISTRATION_ROLE = "Admin"
```
### Enable Alerts and Reports
For this, as per the [Alerts and Reports doc](/admin-docs/configuration/alerts-reports), you will need to:
#### Install a supported webdriver in the Celery worker
This is done either by using a custom image that has the webdriver pre-installed, or installing at startup time by overriding the `command`. Here's a working example for `chromedriver`:
```yaml
supersetWorker:
command:
- /bin/sh
- -c
- |
# Install chrome webdriver
# See https://github.com/apache/superset/blob/4fa3b6c7185629b87c27fc2c0e5435d458f7b73d/docs/src/pages/admin-docs/installation/email_reports.mdx
# This is required because our process runs as root (in order to install pip packages)
"--no-sandbox",
"--disable-setuid-sandbox",
"--disable-extensions",
]
```
### Load the Examples data and dashboards
If you are trying Superset out and want some data and dashboards to explore, you can load some examples by creating a `my_values.yaml` and deploying it as described above in the **Configure your setting overrides** step of the **Running** section.
To load the examples, add the following to the `my_values.yaml` file:
```yaml
init:
loadExamples: true
```
:::resources
- [Tutorial: Mastering Data Visualization — Installing Superset on Kubernetes with Helm Chart](https://mahira-technology.medium.com/mastering-data-visualization-installing-superset-on-kubernetes-cluster-using-helm-chart-e4ec99199e1e)
- [Tutorial: Installing Apache Superset in Kubernetes](https://aws.plainenglish.io/installing-apache-superset-in-kubernetes-1aec192ac495)
This page describes how to install Superset using the `apache_superset` package [published on PyPI](https://pypi.org/project/apache_superset/).
## OS Dependencies
Superset stores database connection information in its metadata database. For that purpose, we use
the cryptography Python library to encrypt connection passwords. Unfortunately, this library has OS
level dependencies.
**Debian and Ubuntu**
Ubuntu **24.04** uses python 3.12 per default, which currently is not supported by Superset. You need to add a second python installation of 3.11 and install the required additional dependencies.
These will now be available when pip installing requirements.
## Python Virtual Environment
We highly recommend installing Superset inside of a virtual environment.
You can create and activate a virtual environment using the following commands. Ensure you are using a compatible version of python. You might have to explicitly use for example `python3.11` instead of `python3`.
```bash
# virtualenv is shipped in Python 3.6+ as venv instead of pyvenv.
# See https://docs.python.org/3.6/library/venv.html
python3 -m venv venv
. venv/bin/activate
```
Or with pyenv-virtualenv:
```bash
# Here we name the virtual env 'superset'
pyenv virtualenv superset
pyenv activate superset
```
Once you activated your virtual environment, all of the Python packages you install or uninstall
will be confined to this environment. You can exit the environment by running `deactivate` on the
command line.
### Installing and Initializing Superset
First, start by installing `apache_superset`:
```bash
pip install apache_superset
```
Then, define mandatory configurations, SECRET_KEY and FLASK_APP:
```bash
export SUPERSET_SECRET_KEY=YOUR-SECRET-KEY # For production use, make sure this is a strong key, for example generated using `openssl rand -base64 42`. See https://superset.apache.org/admin-docs/configuration/configuring-superset#specifying-a-secret_key
export FLASK_APP=superset
```
Then, you need to initialize the database:
```bash
superset db upgrade
```
Finish installing by running through the following commands:
```bash
# Create an admin user in your metadata database (use `admin` as username to be able to load the examples)
superset fab create-admin
# Load some data to play with
superset load_examples
# Create default roles and permissions
superset init
# To start a development web server on port 8088, use -p to bind to another port
superset run -p 8088 --with-threads --reload --debugger
```
If everything worked, you should be able to navigate to `hostname:port` in your browser (e.g.
locally by default at `localhost:8088`) and login using the username and password you created.
title: Securing Your Superset Installation for Production
sidebar_position: 3
---
> *This guide applies to Apache Superset version 4.0 and later and is an evolving set of best practices that administrators should adapt to their specific deployment architecture.*
The default Apache Superset configuration is optimized for ease of use and development, not for security. For any production deployment, it is **critical** that you review and apply the following security configurations to harden your instance, protect user data, and prevent unauthorized access.
This guide provides a comprehensive checklist of essential security configurations and best practices.
Running Superset without HTTPS (TLS) is not secure. Without it, all network traffic—including user credentials, session tokens, and sensitive data—is sent in cleartext and can be easily intercepted.
* **Use a Reverse Proxy:** Your Superset instance should always be deployed behind a reverse proxy (e.g., Nginx, Traefik) or a load balancer (e.g., AWS ALB, Google Cloud Load Balancer) that is configured to handle HTTPS termination.
* **Enforce Modern TLS:** Configure your proxy to enforce TLS 1.2 or higher with strong, industry-standard cipher suites.
* **Implement HSTS:** Use the HTTP Strict Transport Security (HSTS) header to ensure browsers only connect to your Superset instance over HTTPS. This can be configured in your reverse proxy or within Superset's Talisman settings.
This is the most critical security setting for your Superset instance. It is used to sign all session cookies and encrypt sensitive information in the metadata database, such as database connection credentials.
* **Generate a Unique, Strong Key:** A unique key must be generated for every Superset instance. Use a cryptographically secure method to create it.
```bash
# Example using openssl to generate a strong key
openssl rand -base64 42
```
* **Store the Key Securely:** The key must be kept confidential. The recommended approach is to store it as an environment variable or in a secrets management system (e.g., AWS Secrets Manager, HashiCorp Vault). **Do not hardcode the key in `superset_config.py` or commit it to version control.**
> #### ⚠️ Warning: Your `SUPERSET_SECRET_KEY` Must Be Unique
>
> **NEVER** reuse the same `SUPERSET_SECRET_KEY` across different environments (e.g., development, staging, production) or different Superset instances. Reusing a key allows cryptographically signed session cookies to be used across those instances, which can lead to a full authentication bypass if a cookie is compromised. Treat this key like a master password.
### **Session Management Security (CRITICAL)**
Properly configuring user sessions is essential to prevent session hijacking and ensure that sessions are terminated correctly.
#### **Use a Server-Side Session Backend (Strongly Recommended for Production)**
The default stateless cookie-based session handling presents challenges for immediate session invalidation upon logout. For all production deployments, we strongly recommend configuring an optional server-side session backend like Redis, Memcached, or a database. This ensures that session data is stored securely on the server and can be instantly destroyed upon logout, rendering any copied session cookies immediately useless.
**Example `superset_config.py` for Redis:**
```python
# superset_config.py
from redis import Redis
import os
# 1. Enable server-side sessions
SESSION_SERVER_SIDE = True
# 2. Choose your backend (e.g., 'redis', 'memcached', 'filesystem', 'sqlalchemy')
ssl_cert_reqs='required' # Or another appropriate SSL setting
)
# 4. Ensure the session cookie is signed for integrity
SESSION_USE_SIGNER = True
```
#### **Configure Session Lifetime and Cookie Security Flags**
This is mandatory for *all* deployments, whether stateless or server-side.
```python
# superset_config.py
from datetime import timedelta
# Set a short absolute session timeout
# The default is 31 days, which is NOT recommended for production.
PERMANENT_SESSION_LIFETIME = timedelta(hours=8)
# Enforce secure cookie flags to prevent browser-based attacks
SESSION_COOKIE_SECURE = True # Transmit cookie only over HTTPS
SESSION_COOKIE_HTTPONLY = True # Prevent client-side JS from accessing the cookie
SESSION_COOKIE_SAMESITE = 'Lax' # Provide protection against CSRF attacks
```
> ##### Note on iFrame Embedding and `SESSION_COOKIE_SAMESITE`
>The recommended default setting `'Lax'` provides good CSRF protection for most use cases. However, if you need to embed Superset dashboards into other applications using an iFrame, you will need to change this setting to `'None'`.
SESSION_COOKIE_SAMESITE = 'None'
Setting SameSite to 'None' requires that SESSION_COOKIE_SECURE is also set to True. Be aware that this configuration disables some of the browser's built-in CSRF protections to allow for cross-domain functionality, so it should only be used when iFrame embedding is necessary.
### **Authentication and Authorization**
While Superset's built-in database authentication is convenient, for production it's highly recommended to integrate with an enterprise-grade identity provider (IdP).
* **Use an Enterprise IdP:** Configure authentication via OAuth or LDAP to leverage your organization's existing identity management system. This provides benefits like Single Sign-On (SSO), Multi-Factor Authentication (MFA), and centralized user provisioning/deprovisioning.
* **Principle of Least Privilege:** Assign users to the most restrictive roles necessary for their jobs. Avoid over-provisioning users with Admin or Alpha roles, and ensure row-level security is applied where appropriate.
* **Admin Accounts:** Delete or disable the default admin user after a new administrative account has been configured.
### **Content Security Policy (CSP) and Other Headers**
Superset can use Flask-Talisman to set security headers. However, it must be explicitly enabled.
> #### ⚠️ Important: Talisman is Disabled by Default
>
> In Superset 4.0 and later, Talisman is disabled by default (`TALISMAN_ENABLED = False`). You **must** explicitly enable it in your `superset_config.py` for the security headers defined in `TALISMAN_CONFIG` to take effect.
Here's the documentation section how how to set up Talisman: https://superset.apache.org/admin-docs/security/#content-security-policy-csp
### **Database Security**
> #### ❗ Superset is Not a Database Firewall
>
> It is essential to understand that **Apache Superset is a data visualization and exploration platform, not a database firewall or a comprehensive security solution for your data warehouse.** While Superset provides features to help manage data access, the ultimate responsibility for securing your underlying databases lies with your database administrators (DBAs) and security teams. This includes managing network access, user privileges, and fine-grained permissions directly within the database. The configurations below are an important secondary layer of security but should not be your only line of defense.
* **Use a Dedicated Database User:** The database connection configured in Superset should use a dedicated, limited-privilege database user. This user should only have the minimum required permissions (e.g., `SELECT` on specific schemas) for the data sources it needs to query. It should **not** have `INSERT`, `UPDATE`, `DELETE`, or administrative privileges.
* **Restrict Dangerous SQL Functions:** To mitigate potential SQL injection risks, configure the `DISALLOWED_SQL_FUNCTIONS` list in your `superset_config.py`. Be aware that this is a defense-in-depth measure, not a substitute for proper database permissions.
### **Additional Security Layers**
* **Web Application Firewall (WAF):** Deploying Superset behind a WAF (e.g., Cloudflare, AWS WAF) is strongly recommended. A WAF with a standard ruleset (like the OWASP Core Rule Set) provides a critical layer of defense against common attacks like SQL Injection, XSS, and remote code execution.
### **Monitoring and Logging**
* **Configure Structured Logging:** Set up a robust logging configuration to capture important security events.
* **Centralize Logs:** Ship logs from all Superset components (frontend, worker, etc.) to a centralized SIEM (Security Information and Event Management) system for analysis and alerting.
* **Monitor Key Events:** Create alerts for suspicious activities, including:
* Multiple failed login attempts for a single user or from a single IP address.
* Changes to user roles or permissions.
* Creation or deletion of high-privilege users.
* Attempts to use disallowed SQL functions.
-----
### **Appendix A: Production Deployment Checklist**
#### **Initial Setup:**
- [ ] HTTPS/TLS is configured and enforced via a reverse proxy.
- [ ] A unique, strong `SUPERSET_SECRET_KEY` is generated and secured in an environment variable or secrets vault.
- [ ] Server-side session management is configured (e.g., Redis).
- [ ] `PERMANENT_SESSION_LIFETIME` is set to a short duration (e.g., 8 hours).
- [ ] All session cookie security flags (`Secure`, `HttpOnly`, `SameSite`) are enabled.
- [ ] `DEBUG` mode is set to `False`.
- [ ] Talisman is explicitly enabled and configured with a strict Content Security Policy.
- [ ] Database connections use dedicated, limited-privilege accounts.
- [ ] Authentication is integrated with an enterprise identity provider (OAuth/LDAP).
- [ ] A Web Application Firewall (WAF) is deployed in front of Superset.
- [ ] Logging is configured and logs are shipped to a central monitoring system.
#### **Ongoing Maintenance:**
- [ ] Regularly update to the latest major or minor versions of Superset. Those versions receive up-to-date security patches.
- [ ] Rotate the `SUPERSET_SECRET_KEY` periodically (e.g., quarterly) and after any potential security incident.
- [ ] Conduct quarterly access reviews for all users.
- [ ] Assuming logging and monitoring is in place, review security monitoring alerts weekly.
### **Appendix B: `SECRET_KEY` Rotation and Compromise Response**
**Why and When to Rotate the `SECRET_KEY`**
Rotating the `SUPERSET_SECRET_KEY` is a critical security procedure. It is mandatory after a known or suspected compromise and is a best practice when an employee with access to the key departs. While periodic rotation can limit the window of exposure for an unknown leak, it is a high-impact operation that will invalidate all user sessions and requires careful execution to avoid breaking your instance. The principles behind managing this key align with general best practices for cryptographic storage, which are further detailed in the OWASP Cryptographic Storage Cheat Sheet here: https://cheatsheetseries.owasp.org/cheatsheets/Cryptographic_Storage_Cheat_Sheet.html
**Procedure for Rotating the Key**
The procedure for safely rotating the SECRET_KEY must be followed precisely to avoid locking yourself out of your instance. The official Apache Superset documentation maintains the correct, up-to-date procedure. Please follow the official guide here:
- [Blog: Running Apache Superset on the Open Internet](https://preset.io/blog/running-apache-superset-on-the-open-internet-a-report-from-the-fireline/)
- [Blog: How Security Vulnerabilities are Reported & Handled in Apache Superset](https://preset.io/blog/how-security-vulnerabilities-are-reported-and-handled-in-apache-superset/)
Authentication and authorization in Superset is handled by Flask AppBuilder (FAB), an application development framework
built on top of Flask. FAB provides authentication, user management, permissions and roles.
Please read its [Security documentation](https://flask-appbuilder.readthedocs.io/en/latest/security.html).
### Provided Roles
Superset ships with a set of roles that are handled by Superset itself. You can assume
that these roles will stay up-to-date as Superset evolves (and as you update Superset versions).
Even though **Admin** users have the ability, we don't recommend altering the
permissions associated with each role (e.g. by removing or adding permissions to them). The permissions
associated with each role will be re-synchronized to their original values when you run
the **superset init** command (often done between Superset versions).
A table with the permissions for these roles can be found at [/RESOURCES/STANDARD_ROLES.md](https://github.com/apache/superset/blob/master/RESOURCES/STANDARD_ROLES.md).
### Admin
Admins have all possible rights, including granting or revoking rights from other
users and altering other people’s slices and dashboards.
>#### Threat Model and Privilege Boundaries: The Admin Role
>
>Apache Superset is built with a granular permission model where users assigned the Admin role are considered fully trusted. Admins possess complete control over the application's configuration, UI rendering, and access controls.
>
>Consequently, actions performed by an Admin that alter the application's behavior or presentation—such as injecting custom CSS, modifying Jinja templates, or altering security flags—are intended administrative capabilities by design.
>
>In accordance with MITRE CNA Rule 4.1, a vulnerability must represent a violation of an explicit security policy. Because the Admin role is defined as a trusted operational boundary, actions executed with Admin privileges do not cross a security perimeter. Therefore, exploit vectors that strictly require Admin access are not classified as security vulnerabilities and are ineligible for CVE assignment.
### Alpha
Alpha users have access to all data sources, but they cannot grant or revoke access
from other users. They are also limited to altering the objects that they own. Alpha users can add and alter data sources.
### Gamma
Gamma users have limited access. They can only consume data coming from data sources
they have been given access to through another complementary role. They only have access to
view the slices and dashboards made from data sources that they have access to. Currently Gamma
users are not able to alter or add data sources. We assume that they are mostly content consumers, though they can create slices and dashboards.
Also note that when Gamma users look at the dashboards and slices list view, they will
only see the objects that they have access to.
### sql_lab
The **sql_lab** role grants access to SQL Lab. Note that while **Admin** users have access
to all databases by default, both **Alpha** and **Gamma** users need to be given access on a per database basis.
Beyond the base `sql_lab` role, two additional SQL Lab permissions must be explicitly granted for users who need these capabilities:
| Permission | Feature |
|------------|---------|
| `can_estimate_query_cost` on `SQLLab` | Estimate query cost before running |
| `can_format_sql` on `SQLLab` | Format SQL using the database's dialect |
Grant these in **Security → List Roles** by adding the permissions to the relevant role.
### Public
The **Public** role is the most restrictive built-in role, designed specifically for anonymous/unauthenticated
users who need to view dashboards. It provides minimal read-only access for:
- Viewing dashboards and charts
- Using interactive dashboard filters
- Accessing dashboard and chart permalinks
- Reading embedded dashboards
- Viewing annotations on charts
The Public role explicitly excludes:
- Any write permissions on dashboards, charts, or datasets
- SQL Lab access
- Share functionality
- User profile or admin features
- Menu access to most Superset features
Anonymous users are automatically assigned the Public role when `AUTH_ROLE_PUBLIC` is configured
(a Flask-AppBuilder setting). The `PUBLIC_ROLE_LIKE` setting is **optional** and controls what
permissions are synced to the Public role when you run `superset init`:
```python
# Optional: Sync sensible default permissions to the Public role
PUBLIC_ROLE_LIKE = "Public"
# Alternative: Copy permissions from Gamma for broader access
# PUBLIC_ROLE_LIKE = "Gamma"
```
If you prefer to manually configure the Public role's permissions (or use `DASHBOARD_RBAC` to
grant access at the dashboard level), you do not need to set `PUBLIC_ROLE_LIKE`.
**Important notes:**
- **Data access is still required:** The Public role only grants UI/API permissions. You must
also grant access to specific datasets necessary to view a dashboard. As with other roles,
this can be done in two ways:
- **Without `DASHBOARD_RBAC`:** Dashboards only appear in the list and are accessible if
the user has permission to at least one of their datasets. Grant dataset access by editing
the Public role in the Superset UI (Menu → Security → List Roles → Public) and adding the
relevant data sources. All published dashboards using those datasets become visible.
- **With `DASHBOARD_RBAC` enabled:** Anonymous users will only see dashboards where the
"Public" role has been explicitly added in the dashboard's properties. Dataset permissions
are not required—DASHBOARD_RBAC handles the cascading permissions check. This provides
fine-grained control over which dashboards are publicly visible.
- **Role synchronization:** Built-in role permissions (Admin, Alpha, Gamma, sql_lab, and Public
when `PUBLIC_ROLE_LIKE = "Public"`) are synchronized when you run `superset init`. Any manual
permission edits to these roles may be overwritten during upgrades. To customize the Public
role permissions, you can either:
- Edit the Public role directly and avoid setting `PUBLIC_ROLE_LIKE` (permissions won't be
overwritten by `superset init`)
- Copy the Public role via "Copy Role" in the Superset web UI, save it under a different name
(e.g., "Public_Custom"), customize the permissions, then update **both** configs:
`PUBLIC_ROLE_LIKE = "Public_Custom"` and `AUTH_ROLE_PUBLIC = "Public_Custom"`
### Managing Data Source Access for Gamma Roles
Here’s how to provide users access to only specific datasets. First make sure the users with
limited access have [only] the Gamma role assigned to them. Second, create a new role (Menu -> Security -> List Roles) and click the + sign.
This new window allows you to give this new role a name, attribute it to users and select the
tables in the **Permissions** dropdown. To select the data sources you want to associate with this role, simply click on the dropdown and use the typeahead to search for your table names.
You can then confirm with users assigned to the **Gamma** role that they see the
objects (dashboards and slices) associated with the tables you just extended them.
### Dashboard Access Control
Access to dashboards is managed via owners (users that have edit permissions to the dashboard).
Non-owner user access can be managed in two ways. Note that dashboards must be published to be
visible to other users.
#### Dataset-Based Access (Default)
By default, users can view published dashboards if they have access to at least one dataset
used in that dashboard. Grant dataset access by adding the relevant data source permissions
to a role (Menu → Security → List Roles).
This is the simplest approach but provides all-or-nothing access based on dataset permissions—
if a user has access to a dataset, they can see all published dashboards using that dataset.
#### Dashboard-Level Access (DASHBOARD_RBAC)
For fine-grained control over which dashboards specific roles can access, enable the
`DASHBOARD_RBAC` feature flag:
```python
FEATURE_FLAGS = {
"DASHBOARD_RBAC": True,
}
```
With this enabled, you can assign specific roles to each dashboard in its properties. Users
will only see dashboards where their role is explicitly added.
**Important considerations:**
- Dashboard access **bypasses** dataset-level checks—granting a role access to a dashboard
implicitly grants read access to all charts and datasets in that dashboard
- Dashboards without any assigned roles fall back to dataset-based access
- The dashboard must still be published to be visible
This feature is particularly useful for:
- Making specific dashboards public while keeping others private
- Granting access to dashboards without exposing the underlying datasets for other uses
- Creating dashboard-specific access patterns that don't align with dataset ownership
### SQL Execution Security Considerations
Apache Superset includes features designed to provide safeguards when interacting with connected databases, such as the `DISALLOWED_SQL_FUNCTIONS` configuration setting. This aims to prevent the execution of potentially harmful database functions or system variables directly from Superset interfaces like SQL Lab.
However, it is crucial to understand the following:
**Superset is Not a Database Firewall**: Superset's built-in checks, like `DISALLOWED_SQL_FUNCTIONS`, provide a layer of protection but cannot guarantee complete security against all database-level threats or advanced bypass techniques (like specific comment injection methods). They should be viewed as a supplement to, not a replacement for, robust database security.
**Configuration is Key**: The effectiveness of Superset's safeguards heavily depends on proper configuration by the Superset administrator. This includes maintaining the `DISALLOWED_SQL_FUNCTIONS` list, carefully managing feature flags (like `ENABLE_TEMPLATE_PROCESSING`), and configuring other security settings appropriately.
**Database Security is Paramount**: The ultimate responsibility for securing database access, controlling permissions, and preventing unauthorized function execution lies with the database administrators (DBAs) and security teams managing the underlying database instance.
**Recommended Database Practices**: We strongly recommend implementing security best practices at the database level, including:
* **Least Privilege**: Connecting Superset using dedicated database user accounts with the minimum permissions required for Superset's operation (typically read-only access to necessary schemas/tables).
* **Database Roles & Permissions**: Utilizing database-native roles and permissions to restrict access to sensitive functions, system variables (like `@@hostname`), schemas, or tables.
* **Network Security**: Employing network-level controls like database firewalls or proxies to restrict connections.
* **Auditing**: Enabling database-level auditing to monitor executed queries and access patterns.
By combining Superset's configurable safeguards with strong database-level security practices, you can achieve a more robust and layered security posture.
**Dataset Sample Access**: The `get_samples()` endpoint now enforces datasource-level access control. Users can only fetch sample rows from datasets they have been explicitly granted access to — the same permission check applied when running chart queries. This closes a prior gap where unauthenticated or under-privileged access could retrieve sample data.
### REST API for user & role management
Flask-AppBuilder supports a REST API for user CRUD,
but this feature is in beta and is not enabled by default in Superset.
To enable this feature, set the following in your Superset configuration:
```python
FAB_ADD_SECURITY_API = True
```
Once configured, the documentation for additional "Security" endpoints will be visible in Swagger for you to explore.
### API Key Authentication
Superset supports long-lived API keys for service accounts, CI/CD pipelines, and programmatic integrations (including MCP clients).
#### Enabling API Key Authentication
API key authentication is **disabled by default**. To turn it on, set the Flask-AppBuilder config value in `superset_config.py` and also enable the matching feature flag so the management UI is exposed:
```python
FAB_API_KEY_ENABLED = True
FEATURE_FLAGS = {
"FAB_API_KEY_ENABLED": True,
}
```
The config value registers the `ApiKeyApi` blueprint on the backend; the feature flag controls whether the UI for managing keys appears for the user. See the [Feature Flags](/admin-docs/configuration/feature-flags) documentation for more on feature flag configuration.
#### Creating an API Key
Once enabled, each user manages their own keys from their profile page:
1. Open the user menu (top-right) and click **Info** to navigate to the User Info page
2. Expand the **API Keys** section
3. Click **+ API Key**
4. Enter a name and (optionally) an expiration date
5. Copy the generated token — it is shown only once
Only users with the `can_read` and `can_write` permissions on `ApiKey` (granted by default to Admins) can manage API keys.
#### Using an API Key
Pass the key as a Bearer token in the `Authorization` header:
```
Authorization: Bearer <your-api-key>
```
This works for all REST API endpoints and the MCP server. The request is executed with the permissions of the user who created the key.
#### Use Cases
- **CI/CD pipelines** — automated chart/dashboard exports and imports
- **MCP integrations** — connect AI assistants without interactive login
- **External services** — dashboards embedded in other applications
- **Service accounts** — long-lived credentials that don't expire with session cookies
:::caution
Store API keys securely. Anyone with a valid key can make requests on behalf of the creating user. Revoke keys promptly if they are compromised by deleting them from the **API Keys** section of your User Info page.
:::
### Customizing Permissions
The permissions exposed by FAB are very granular and allow for a great level of
customization. FAB creates many permissions automatically for each model that is
created (can_add, can_delete, can_show, can_edit, …) as well as for each view.
On top of that, Superset can expose more granular permissions like **all_datasource_access**.
**We do not recommend altering the 3 base roles as there are a set of assumptions that
Superset is built upon**. It is possible though for you to create your own roles, and union them to existing ones.
### Permissions
Roles are composed of a set of permissions, and Superset has many categories of
permissions. Here are the different categories of permissions:
- Model & Action: models are entities like Dashboard, Slice, or User. Each model has
a fixed set of permissions, like **can_edit**, **can_show**, **can_delete**, **can_list**, **can_add**,
and so on. For example, you can allow a user to delete dashboards by adding **can_delete** on
Dashboard entity to a role and granting this user that role.
- Views: views are individual web pages, like the Explore view or the SQL Lab view.
When granted to a user, they will see that view in its menu items, and be able to load that page.
- Data source: For each data source, a permission is created. If the user does not have the
`all_datasource_access permission` granted, the user will only be able to see Slices or explore the data sources that are granted to them
- Database: Granting access to a database allows for the user to access all
data sources within that database, and will enable the user to query that
database in SQL Lab, provided that the SQL Lab specific permission have been granted to the user
### Restricting Access to a Subset of Data Sources
We recommend giving a user the **Gamma** role plus any other roles that would add
access to specific data sources. We recommend that you create individual roles for
each access profile. For example, the users on the Finance team might have access to a set of
databases and data sources; these permissions can be consolidated in a single role.
Users with this profile then need to be assigned the **Gamma** role as a foundation to
the models and views they can access, and that Finance role that is a collection of permissions to data objects.
A user can have multiple roles associated with them. For example, an executive on the Finance
team could be granted **Gamma**, **Finance**, and the **Executive** roles. The **Executive**
role could provide access to a set of data sources and dashboards made available only to executives.
In the **Dashboards** view, a user can only see the ones they have access to
based on the roles and permissions that were attributed.
### Row Level Security
Using Row Level Security filters (under the **Security** menu) you can create filters
that are assigned to a particular dataset, as well as a set of roles.
If you want members of the Finance team to only have access to
rows where `department = "finance"`, you could:
- Create a Row Level Security filter with that clause (`department = "finance"`)
- Then assign the clause to the **Finance** role and the dataset it applies to
The **clause** field, which can contain arbitrary text, is then added to the generated
SQL statement's WHERE clause. So you could even do something like create a filter
for the last 30 days and apply it to a specific role, with a clause
like `date_field > DATE_SUB(NOW(), INTERVAL 30 DAY)`. It can also support
multiple conditions: `client_id = 6` AND `advertiser="foo"`, etc.
RLS clauses also support **Jinja templating** when `ENABLE_TEMPLATE_PROCESSING` is enabled, so you can write dynamic filters such as
`user_id = '{{ current_username() }}'` to restrict rows based on the logged-in user.
#### Filter Types
There are two types of RLS filters:
- **Regular** — The filter clause is applied when the querying user belongs to one of the
roles assigned to the filter. Use this to restrict what specific roles can see.
- **Base** — The filter clause is applied to **all** users _except_ those in the assigned
roles. Use this to define a default restriction that privileged roles (e.g. Admin) are
exempt from. For example, a Base filter with clause `1 = 0` and the Admin role would
hide all rows from everyone except Admin — useful as a deny-by-default baseline.
#### Group Keys and Filter Combination
All applicable RLS filters are combined before being added to the query. The combination
rules are:
- Filters that share the **same group key** are combined with **OR** (any match within
the group is sufficient).
- Different filter groups (different group keys, or no group key) are combined with
**AND** (all groups must match).
- Filters with **no group key** are each treated as their own group and are always AND'd.
Setting `TALISMAN_ENABLED = True` will invoke Talisman's protection with its default arguments,
of which `content_security_policy` is only one. Those can be found in the
[Talisman documentation](https://pypi.org/project/flask-talisman/) under *Options*.
These generally improve security, but administrators should be aware of their existence.
In particular, the option of `force_https = True` (`False` by default) may break Superset's Alerts & Reports
if workers are configured to access charts via a `WEBDRIVER_BASEURL` beginning
with `http://`. As long as a Superset deployment enforces https upstream, e.g.,
through a load balancer or application gateway, it should be acceptable to keep this
option disabled. Otherwise, you may want to enable `force_https` like this:
```python
TALISMAN_CONFIG = {
"force_https": True,
"content_security_policy": { ...
```
#### Configuring Talisman in Superset
Talisman settings in Superset can be modified using superset_config.py. If you need to adjust security policies, you can override the default configuration.
Example: Overriding Talisman Configuration in superset_config.py for loading images form s3 or other external sources.
```python
TALISMAN_CONFIG = {
"content_security_policy": {
"base-uri": ["'self'"],
"default-src": ["'self'"],
"img-src": [
"'self'",
"blob:",
"data:",
"https://apachesuperset.gateway.scarf.sh",
"https://static.scarf.sh/",
# "https://cdn.brandfolder.io", # Uncomment when SLACK_ENABLE_AVATARS is True # noqa: E501
"ows.terrestris.de",
"aws.s3.com", # Add Your Bucket or external data source
Apache Software Foundation takes a rigorous standpoint in annihilating the security issues in its
software projects. Apache Superset is highly sensitive and forthcoming to issues pertaining to its
features and functionality.
If you have apprehensions regarding Superset security or you discover vulnerability or potential
threat, don’t hesitate to get in touch with the Apache Security Team by dropping a mail at
security@apache.org. In the mail, specify the project name Superset with the description of the
issue or potential threat. You are also urged to recommend the way to reproduce and replicate the
issue. The security team and the Superset community will get back to you after assessing and
analysing the findings.
PLEASE PAY ATTENTION to report the security issue on the security email before disclosing it on
public domain. The ASF Security Team maintains a page with the description of how vulnerabilities
and potential threats are handled, check [their web page](https://apache.org/security/committers.html)
for more details.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.