mirror of https://github.com/apache/superset.git synced 2026-06-01 13:49:21 +00:00

Go to file

Xiao Hanyu b71f551493 Optimize presto SQL Lab query performance. (#5132 )

By stop polling when presto query already finished.

When user make queries to Presto via SQL Lab, presto will run the query
and then it can return all data back to superset in one shot.

However, the default implementation of superset has enabled a default
polling for presto to:

- Get the fancy progress bar
- Get the data back when the query finished.

However, the polling implementation of superset is not right.

I've done a profiling with a table of 1 billion rows, here're some data:

- Total number of rows: 1.02 Billion
- SQL Lab query limit: 1 million
- Output Data: 1.5 GB
- Superset memory consumed: about 10-20 GB
- Time: 7 minutes to finish in Presto, takes additional 15 minutes for
  superset to get and store data.

The problems with default issue is, even if presto has finished the
query (7 minutes with above profiling), superset still do lots of wasted
polling, in above profiling, superset sent about 540 polling in total,
and at half of the polling is not necessary.

Part of the simplied polling response:

```
{
  "infoUri": "http://10.65.204.39:8000/query.html?20180525_042715_03742_nza9u",
  "id": "20180525_042715_03742_nza9u",
  "nextUri": "http://10.65.204.39:8000/v1/statement/20180525_042715_03742_nza9u/11",
  "stats": {
    "state": "FINISHED",
    "queuedSplits": 21701,
    "progressPercentage": 35.98264191882267,
    "elapsedTimeMillis": 1029,
    "nodes": 116,
    "completedSplits": 15257,
    "scheduled": true,
    "wallTimeMillis": 2571904,
    "peakMemoryBytes": 0,
    "processedBytes": 40825519532,
    "processedRows": 47734066,
    "queuedTimeMillis": 0,
    "queued": false,
    "cpuTimeMillis": 849228,
    "rootStage": {
      "state": "FINISHED",
      "queuedSplits": 0,
      "nodes": 1,
      "totalSplits": 17,
      "processedBytes": 16829644,
      "processedRows": 11495,
      "completedSplits": 17,
      "stageId": "0",
      "done": true,
      "cpuTimeMillis": 69,
      "subStages": [
        {
          "state": "CANCELED",
          "queuedSplits": 21701,
          "nodes": 116,
          "totalSplits": 42384,
          "processedBytes": 40825519532,
          "processedRows": 47734066,
          "completedSplits": 15240,
          "stageId": "1",
          "done": true,
          "cpuTimeMillis": 849159,
          "subStages": [],
          "wallTimeMillis": 2570374,
          "userTimeMillis": 730020,
          "runningSplits": 5443
        }
      ],
      "wallTimeMillis": 1530,
      "userTimeMillis": 50,
      "runningSplits": 0
    },
    "totalSplits": 42401,
    "userTimeMillis": 730070,
    "runningSplits": 5443
  }
  }
}
```

Superset will terminate the polling when it finds that `nextUri`
becomes none, but actually, when `["stats"]["state"] == "FINISHED"`,
it means that presto has already finished the query and superset can stop
polling and get the data back.

After this simple optimization, we get a 2-5x performance boost for
Presto SQL Lab queries.

2018-06-05 08:56:18 -07:00

docs

docs: Add new Athena URI scheme awsathena+rest:// (#5112 )

2018-05-31 22:19:07 -07:00

install/helm/superset

Install superset in Kubernetes with helm chart (#4923 )

2018-05-03 17:35:38 -07:00

scripts

[flake8] Adding flake8-coding (#4477 )

2018-02-25 15:06:11 -08:00

superset

Optimize presto SQL Lab query performance. (#5132 )

2018-06-05 08:56:18 -07:00

tests

URL shortner for dashboards (#4760 )

2018-06-02 11:08:43 -07:00

.gitignore

[docs] add entry for Hive in installation.rst (#4942 )

2018-05-07 15:26:19 -07:00

.pylintrc

treating floats like doubles for druid versions lower than 11.0.0 (#5030 )

2018-05-21 11:50:04 -07:00

.travis.yml

[setup] Dropping 3.4 and adding 3.6 (#4835 )

2018-04-17 21:30:12 -07:00

alembic.ini

[WiP] rename project from Caravel to Superset (#1576 )

2016-11-09 23:08:22 -08:00

CHANGELOG.md

CHANGELOG for 0.25.0 (#4948 )

2018-05-08 08:24:54 -07:00

CODE_OF_CONDUCT.md

Create CODE_OF_CONDUCT.md (#3991 )

2017-12-02 14:57:54 -08:00

CONTRIBUTING.md

[docs] minor file name and format fix for the setup document (#4844 )

2018-04-19 11:34:23 -07:00

gen_changelog.sh

CHANGELOG for 0.20.0 (#3545 )

2017-09-28 14:42:57 -07:00

ISSUE_TEMPLATE.md

[WiP] rename project from Caravel to Superset (#1576 )

2016-11-09 23:08:22 -08:00

LICENSE.txt

LICENSE

2015-07-21 20:54:31 +00:00

MANIFEST.in

Removing files from MANIFEST.in (#4542 )

2018-03-06 09:39:31 -08:00

pypi_push.sh

Fixing pypi_push.sh

2017-01-24 11:42:49 -08:00

README.md

Add Lime to Superset user list.

2018-05-30 14:24:03 -07:00

requirements-dev.txt

RFC: add logger that logs into browser console (#4702 )

2018-04-12 21:48:17 -07:00

requirements.txt

Bump celery to 4.1.1 (#5134 )

2018-06-04 14:54:36 -07:00

setup.cfg

[travis/tox] Restructuring configuration (#4552 )

2018-04-10 15:59:44 -07:00

setup.py

Bump celery to 4.1.1 (#5134 )

2018-06-04 14:54:36 -07:00

tox.ini

[pylint] prepping for enabling pylint for non-errors (#4884 )

2018-04-28 20:08:09 -07:00

UPDATING.md

Bump celery to 4.1.1 (#5134 )

2018-06-04 14:54:36 -07:00

README.md

Superset

Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application

[this project used to be named Caravel, and Panoramix in the past]

Screenshots & Gifs

View Dashboards

Slice & dice your data

Query and visualize your data with SQL Lab

Visualize geospatial data with deck.gl

Choose from a wide array of visualizations

Apache Superset

Apache Superset is a data exploration and visualization web application.

Superset provides:

An intuitive interface to explore and visualize datasets, and create interactive dashboards.
A wide array of beautiful visualizations to showcase your data.
Easy, code-free, user flows to drill down and slice and dice the data underlying exposed dashboards. The dashboards and charts acts as a starting point for deeper analysis.
A state of the art SQL editor/IDE exposing a rich metadata browser, and an easy workflow to create visualizations out of any result set.
An extensible, high granularity security model allowing intricate rules on who can access which product features and datasets. Integration with major authentication backends (database, OpenID, LDAP, OAuth, REMOTE_USER, ...)
A lightweight semantic layer, allowing to control how data sources are exposed to the user by defining dimensions and metrics
Out of the box support for most SQL-speaking databases
Deep integration with Druid allows for Superset to stay blazing fast while slicing and dicing large, realtime datasets
Fast loading dashboards with configurable caching

Database Support

Superset speaks many SQL dialects through SQLAlchemy, a Python ORM that is compatible with most common databases.

Superset can be used to visualize data out of most databases:

MySQL
Postgres
Vertica
Oracle
Microsoft SQL Server
SQLite
Greenplum
Firebird
MariaDB
Sybase
IBM DB2
Exasol
MonetDB
Snowflake
Redshift
more! look for the availability of a SQLAlchemy dialect for your database to find out whether it will work with Superset

Druid!

On top of having the ability to query your relational databases, Superset ships with deep integration with Druid (a real time distributed column-store). When querying Druid, Superset can query humongous amounts of data on top of real time dataset. Note that Superset does not require Druid in any way to function, it's simply another database backend that it can query.

Here's a description of Druid from the http://druid.io website:

Druid is an open-source analytics data store designed for business intelligence (OLAP) queries on event data. Druid provides low latency (real-time) data ingestion, flexible data exploration, and fast data aggregation. Existing Druid deployments have scaled to trillions of events and petabytes of data. Druid is best used to power analytic dashboards and applications.

Installation & Configuration

See in the documentation

Resources

Contributing

Interested in contributing? Casual hacking? Check out Contributing.MD

Who uses Apache Superset (incubating)?

Here's a list of organizations who have taken the time to send a PR to let the world know they are using Superset. Join our growing community!

Languages

TypeScript 39.9%

Python 34.6%

Jupyter Notebook 22.2%

HTML 2.7%

JavaScript 0.3%

Other 0.2%