* Use py3's f-strings instead of s.format(**locals())
In light of the bug reported here
https://github.com/apache/incubator-superset/issues/6347, which seems
like an odd `.format()` issue in py3, I greped and replaced all
instances of `.format(**locals())` using py3's f-strings
* lint
* fix tests
* [utils] gathering/refactoring into a "utils/" folder
Moving current utils.py into utils/core.py and moving other *util*
modules under this new "utils/" as well.
Following steps include eroding at "utils/core.py" and breaking it down
into smaller modules.
* Improve tests
* Make loading examples in scope for tests
* Remove test class attrs examples_loaded and requires_examples
* Allow user to force refresh metadata
* fix javascript test error
* nit
* fix styling
* allow custom cache timeout configuration on any database
* minor improvement
* nit
* fix test
* nit
* preserve the old endpoint
* Enable Teradata
New DB engine spec for Teradata:
- LimitMethod should be WRAP_SQL since Teradata does not supporting "LIMIT" clause (TOP)
- Timegrains for Teradata is added
* Update formatting to pass flake8 tests
* Replace dataframe label override logic with table column override
* Add mutation to any_date_col
* Linting
* Add mutation to oracle and redshift
* Fine tune how and which labels are mutated
* Implement alias quoting logic for oracle-like databases
* Fix and align column and metric sqla_col methods
* Clean up typos and redundant logic
* Move new attribute to old location
* Linting
* Replace old sqla_col property references with function calls
* Remove redundant calls to mutate_column_label
* Move duplicated logic to common function
* Add db_engine_specs to all sqla_col calls
* Add missing mydb
* Add note about snowflake-sqlalchemy regression
* Make db_engine_spec mandatory in sqla_col
* Small refactoring and cleanup
* Remove db_engine_spec from get_from_clause call
* Make db_engine_spec mandatory in adhoc_metric_to_sa
* Remove redundant mutate_expression_label call
* Add missing db_engine_specs to adhoc_metric_to_sa
* Rename arg label_name to label in get_column_label()
* Rename label function and add docstring
* Remove redundant db_engine_spec args
* Rename col_label to label
* Remove get_column_name wrapper and make direct calls to db_engine_spec
* Remove unneeded db_engine_specs
* Rename sa_ vars to sqla_
* Bug: fixing async syntax for python 3.7
Rename async to async_ so superset installs for python 3.7.
* Addressing PR comments. Use kwargs instead of explicitly specifying async_ so downstream engines (e.g. PyHive) that supports async can choose to use the async_ in pythonwq3.7 and async in <=python3.6
* addressing additional pr comments
* Field names in big query can contain only alphanumeric and underscore
* bad quote
* better place for mutating labels
* lint
* bug fix thanks to mistercrunch
* lint
* lint again
* Add function to fix dataframe column case
* Fix broken handle_nulls method
* Add case sensitivity option to dedup
* Refactor function definition and call location
* Remove added blank line
* Move df column rename logit to db_engine_spec
* Remove redundant variable
* Update comments in db_engine_specs
* Tie df adjustment to db_engine_spec class attribute
* Fix dedup error
* Linting
* Check for db_engine_spec attribute prior to adjustment
* Rename case sensitivity flag
* Linting
* Remove function that was moved to db_engine_specs
* Get metrics names from utils
* Remove double import and rename dedup variable
* [sql lab] simplify the visualize flow
The "visualize flow" linking SQL Lab to the "explore view" has never
worked so great for people, here's a list of issues:
* it's not really clear to users that their query is wrapped as a
subquery, and the explore view runs queries on top of it
* lint + fix tests
* Addressing comments
* Add interim grains
* Refactor and add blacklist
* Change PT30M to PT0.5H
* Linting
* Linting
* Add time grain addons to config.py and refactor engine spec logic
* Remove redundant import and clean up config.py
* Fix bad rebase
* Implement changes proposed by @betodealmeida
* Revert removal of name from Grain
* Linting
* [sql lab] extract Hive error messages
So pyhive returns an exception object with a stringified thrift error
object. This PR uses a regex to extract the errorMessage portion of that
string.
* Unit test
* Fix how the annotation layer interpretes the timestamp string without timezone info; use it as UTC
* [Bug fix] Fixed/Refactored annotation layer code so that non-timeseries annotations are applied based on the updated chart object after adding all data
* [Bug fix] Fixed/Refactored annotation layer code so that non-timeseries annotations are applied based on the updated chart object after adding all data
* Fixed indentation
* Fix the key string value in case series.key is a string
* Fix the key string value in case series.key is a string
* [Bug fix] Divide by 1000.000 in epoch_ms_to_dttm() to not lose precision in Presto
* [Bug fix] Divide by 1000.000 in epoch_ms_to_dttm() to not lose precision in Presto
* Improve database type inference
Python's DBAPI isn't super clear and homogeneous on the
cursor.description specification, and this PR attempts to improve
inferring the datatypes returned in the cursor.
This work started around Presto's TIMESTAMP type being mishandled as
string as the database driver (pyhive) returns it as a string. The work
here fixes this bug and does a better job at inferring MySQL and Presto types.
It also creates a new method in db_engine_specs allowing for other
databases engines to implement and become more precise on type-inference
as needed.
* Fixing tests
* Adressing comments
* Using infer_objects
* Removing faulty line
* Addressing PrestoSpec redundant method comment
* Fix rebase issue
* Fix tests
By stop polling when presto query already finished.
When user make queries to Presto via SQL Lab, presto will run the query
and then it can return all data back to superset in one shot.
However, the default implementation of superset has enabled a default
polling for presto to:
- Get the fancy progress bar
- Get the data back when the query finished.
However, the polling implementation of superset is not right.
I've done a profiling with a table of 1 billion rows, here're some data:
- Total number of rows: 1.02 Billion
- SQL Lab query limit: 1 million
- Output Data: 1.5 GB
- Superset memory consumed: about 10-20 GB
- Time: 7 minutes to finish in Presto, takes additional 15 minutes for
superset to get and store data.
The problems with default issue is, even if presto has finished the
query (7 minutes with above profiling), superset still do lots of wasted
polling, in above profiling, superset sent about 540 polling in total,
and at half of the polling is not necessary.
Part of the simplied polling response:
```
{
"infoUri": "http://10.65.204.39:8000/query.html?20180525_042715_03742_nza9u",
"id": "20180525_042715_03742_nza9u",
"nextUri": "http://10.65.204.39:8000/v1/statement/20180525_042715_03742_nza9u/11",
"stats": {
"state": "FINISHED",
"queuedSplits": 21701,
"progressPercentage": 35.98264191882267,
"elapsedTimeMillis": 1029,
"nodes": 116,
"completedSplits": 15257,
"scheduled": true,
"wallTimeMillis": 2571904,
"peakMemoryBytes": 0,
"processedBytes": 40825519532,
"processedRows": 47734066,
"queuedTimeMillis": 0,
"queued": false,
"cpuTimeMillis": 849228,
"rootStage": {
"state": "FINISHED",
"queuedSplits": 0,
"nodes": 1,
"totalSplits": 17,
"processedBytes": 16829644,
"processedRows": 11495,
"completedSplits": 17,
"stageId": "0",
"done": true,
"cpuTimeMillis": 69,
"subStages": [
{
"state": "CANCELED",
"queuedSplits": 21701,
"nodes": 116,
"totalSplits": 42384,
"processedBytes": 40825519532,
"processedRows": 47734066,
"completedSplits": 15240,
"stageId": "1",
"done": true,
"cpuTimeMillis": 849159,
"subStages": [],
"wallTimeMillis": 2570374,
"userTimeMillis": 730020,
"runningSplits": 5443
}
],
"wallTimeMillis": 1530,
"userTimeMillis": 50,
"runningSplits": 0
},
"totalSplits": 42401,
"userTimeMillis": 730070,
"runningSplits": 5443
}
}
}
```
Superset will terminate the polling when it finds that `nextUri`
becomes none, but actually, when `["stats"]["state"] == "FINISHED"`,
it means that presto has already finished the query and superset can stop
polling and get the data back.
After this simple optimization, we get a 2-5x performance boost for
Presto SQL Lab queries.
* [sql lab] a better approach at limiting queries
Currently there are two mechanisms that we use to enforce the row
limiting constraints, depending on the database engine:
1. use dbapi's `cursor.fetchmany()`
2. wrap the SQL into a limiting subquery
Method 1 isn't great as it can result in the database server storing
larger than required result sets in memory expecting another fetch
command while we know we don't need that.
Method 2 has a positive side of working with all database engines,
whether they use LIMIT, ROWNUM, TOP or whatever else since sqlalchemy
does the work as specified for the dialect. On the downside though
the query optimizer might not be able to optimize this as much as an
approach that doesn't use a subquery.
Since most modern DBs use the LIMIT syntax, this adds a regex approach
to modify the query and force a LIMIT clause without using a subquery
for the database that support this syntax and uses method 2 for all
others.
* Fixing build
* Fix lint
* Added more tests
* Fix tests
* Force lowercase column names for Snowflake and Oracle
* Force lowercase column names for Snowflake and Oracle
* Remove lowercasing of DB2 columns
* Remove DB2 lowercasing
* Fix test cases