* Field names in big query can contain only alphanumeric and underscore
* bad quote
* better place for mutating labels
* lint
* bug fix thanks to mistercrunch
* lint
* lint again
* Add function to fix dataframe column case
* Fix broken handle_nulls method
* Add case sensitivity option to dedup
* Refactor function definition and call location
* Remove added blank line
* Move df column rename logit to db_engine_spec
* Remove redundant variable
* Update comments in db_engine_specs
* Tie df adjustment to db_engine_spec class attribute
* Fix dedup error
* Linting
* Check for db_engine_spec attribute prior to adjustment
* Rename case sensitivity flag
* Linting
* Remove function that was moved to db_engine_specs
* Get metrics names from utils
* Remove double import and rename dedup variable
* [sql lab] simplify the visualize flow
The "visualize flow" linking SQL Lab to the "explore view" has never
worked so great for people, here's a list of issues:
* it's not really clear to users that their query is wrapped as a
subquery, and the explore view runs queries on top of it
* lint + fix tests
* Addressing comments
* Add interim grains
* Refactor and add blacklist
* Change PT30M to PT0.5H
* Linting
* Linting
* Add time grain addons to config.py and refactor engine spec logic
* Remove redundant import and clean up config.py
* Fix bad rebase
* Implement changes proposed by @betodealmeida
* Revert removal of name from Grain
* Linting
* [sql lab] extract Hive error messages
So pyhive returns an exception object with a stringified thrift error
object. This PR uses a regex to extract the errorMessage portion of that
string.
* Unit test
* Fix how the annotation layer interpretes the timestamp string without timezone info; use it as UTC
* [Bug fix] Fixed/Refactored annotation layer code so that non-timeseries annotations are applied based on the updated chart object after adding all data
* [Bug fix] Fixed/Refactored annotation layer code so that non-timeseries annotations are applied based on the updated chart object after adding all data
* Fixed indentation
* Fix the key string value in case series.key is a string
* Fix the key string value in case series.key is a string
* [Bug fix] Divide by 1000.000 in epoch_ms_to_dttm() to not lose precision in Presto
* [Bug fix] Divide by 1000.000 in epoch_ms_to_dttm() to not lose precision in Presto
* Improve database type inference
Python's DBAPI isn't super clear and homogeneous on the
cursor.description specification, and this PR attempts to improve
inferring the datatypes returned in the cursor.
This work started around Presto's TIMESTAMP type being mishandled as
string as the database driver (pyhive) returns it as a string. The work
here fixes this bug and does a better job at inferring MySQL and Presto types.
It also creates a new method in db_engine_specs allowing for other
databases engines to implement and become more precise on type-inference
as needed.
* Fixing tests
* Adressing comments
* Using infer_objects
* Removing faulty line
* Addressing PrestoSpec redundant method comment
* Fix rebase issue
* Fix tests
By stop polling when presto query already finished.
When user make queries to Presto via SQL Lab, presto will run the query
and then it can return all data back to superset in one shot.
However, the default implementation of superset has enabled a default
polling for presto to:
- Get the fancy progress bar
- Get the data back when the query finished.
However, the polling implementation of superset is not right.
I've done a profiling with a table of 1 billion rows, here're some data:
- Total number of rows: 1.02 Billion
- SQL Lab query limit: 1 million
- Output Data: 1.5 GB
- Superset memory consumed: about 10-20 GB
- Time: 7 minutes to finish in Presto, takes additional 15 minutes for
superset to get and store data.
The problems with default issue is, even if presto has finished the
query (7 minutes with above profiling), superset still do lots of wasted
polling, in above profiling, superset sent about 540 polling in total,
and at half of the polling is not necessary.
Part of the simplied polling response:
```
{
"infoUri": "http://10.65.204.39:8000/query.html?20180525_042715_03742_nza9u",
"id": "20180525_042715_03742_nza9u",
"nextUri": "http://10.65.204.39:8000/v1/statement/20180525_042715_03742_nza9u/11",
"stats": {
"state": "FINISHED",
"queuedSplits": 21701,
"progressPercentage": 35.98264191882267,
"elapsedTimeMillis": 1029,
"nodes": 116,
"completedSplits": 15257,
"scheduled": true,
"wallTimeMillis": 2571904,
"peakMemoryBytes": 0,
"processedBytes": 40825519532,
"processedRows": 47734066,
"queuedTimeMillis": 0,
"queued": false,
"cpuTimeMillis": 849228,
"rootStage": {
"state": "FINISHED",
"queuedSplits": 0,
"nodes": 1,
"totalSplits": 17,
"processedBytes": 16829644,
"processedRows": 11495,
"completedSplits": 17,
"stageId": "0",
"done": true,
"cpuTimeMillis": 69,
"subStages": [
{
"state": "CANCELED",
"queuedSplits": 21701,
"nodes": 116,
"totalSplits": 42384,
"processedBytes": 40825519532,
"processedRows": 47734066,
"completedSplits": 15240,
"stageId": "1",
"done": true,
"cpuTimeMillis": 849159,
"subStages": [],
"wallTimeMillis": 2570374,
"userTimeMillis": 730020,
"runningSplits": 5443
}
],
"wallTimeMillis": 1530,
"userTimeMillis": 50,
"runningSplits": 0
},
"totalSplits": 42401,
"userTimeMillis": 730070,
"runningSplits": 5443
}
}
}
```
Superset will terminate the polling when it finds that `nextUri`
becomes none, but actually, when `["stats"]["state"] == "FINISHED"`,
it means that presto has already finished the query and superset can stop
polling and get the data back.
After this simple optimization, we get a 2-5x performance boost for
Presto SQL Lab queries.
* [sql lab] a better approach at limiting queries
Currently there are two mechanisms that we use to enforce the row
limiting constraints, depending on the database engine:
1. use dbapi's `cursor.fetchmany()`
2. wrap the SQL into a limiting subquery
Method 1 isn't great as it can result in the database server storing
larger than required result sets in memory expecting another fetch
command while we know we don't need that.
Method 2 has a positive side of working with all database engines,
whether they use LIMIT, ROWNUM, TOP or whatever else since sqlalchemy
does the work as specified for the dialect. On the downside though
the query optimizer might not be able to optimize this as much as an
approach that doesn't use a subquery.
Since most modern DBs use the LIMIT syntax, this adds a regex approach
to modify the query and force a LIMIT clause without using a subquery
for the database that support this syntax and uses method 2 for all
others.
* Fixing build
* Fix lint
* Added more tests
* Fix tests
* Force lowercase column names for Snowflake and Oracle
* Force lowercase column names for Snowflake and Oracle
* Remove lowercasing of DB2 columns
* Remove DB2 lowercasing
* Fix test cases
* Add ISO duration to time grains
* Use ISO duration
* Remove debugging code
* Add module to yarn.lock
* Remove autolint
* Druid granularity as ISO
* Remove dangling comma