[SIP-3] Scheduled email reports for Slices / Dashboards (#5294)

* [scheduled reports] Add support for scheduled reports

* Scheduled email reports for slice and dashboard visualization
  (attachment or inline)
* Scheduled email reports for slice data (CSV attachment on inline table)
* Each schedule has a list of recipients (all of them can receive a single mail,
  or separate mails)
* All outgoing mails can have a mandatory bcc - for audit purposes.
* Each dashboard/slice can have multiple schedules.

In addition, this PR also makes a few minor improvements to the celery
infrastructure.
* Create a common celery app
* Added more celery annotations for the tasks
* Introduced celery beat
* Update docs about concurrency / pools

* [scheduled reports] - Debug mode for scheduled emails

* [scheduled reports] - Ability to send test mails

* [scheduled reports] - Test email functionality - minor improvements

* [scheduled reports] - Rebase with master. Minor fixes

* [scheduled reports] - Add warning messages

* [scheduled reports] - flake8

* [scheduled reports] - fix rebase

* [scheduled reports] - fix rebase

* [scheduled reports] - fix flake8

* [scheduled reports] Rebase in prep for merge

* Fixed alembic tree after rebase
* Updated requirements to latest version of packages (and tested)
* Removed py2 stuff

* [scheduled reports] - fix flake8

* [scheduled reports] - address review comments

* [scheduled reports] - rebase with master
This commit is contained in:
Mahendra M
2018-12-10 22:29:29 -08:00
committed by Maxime Beauchemin
parent f366bbe735
commit 808622414c
23 changed files with 1569 additions and 40 deletions

View File

@@ -88,7 +88,7 @@ It's easy: use the ``Filter Box`` widget, build a slice, and add it to your
dashboard.
The ``Filter Box`` widget allows you to define a query to populate dropdowns
that can be use for filtering. To build the list of distinct values, we
that can be used for filtering. To build the list of distinct values, we
run a query, and sort the result by the metric you provide, sorting
descending.

View File

@@ -603,14 +603,12 @@ Upgrading should be as straightforward as running::
superset db upgrade
superset init
SQL Lab
-------
SQL Lab is a powerful SQL IDE that works with all SQLAlchemy compatible
databases. By default, queries are executed in the scope of a web
request so they
may eventually timeout as queries exceed the maximum duration of a web
request in your environment, whether it'd be a reverse proxy or the Superset
server itself.
Celery Tasks
------------
On large analytic databases, it's common to run background jobs, reports
and/or queries that execute for minutes or hours. In certain cases, we need
to support long running tasks that execute beyond the typical web request's
timeout (30-60 seconds).
On large analytic databases, it's common to run queries that
execute for minutes or hours.
@@ -634,15 +632,41 @@ have the same configuration.
class CeleryConfig(object):
BROKER_URL = 'redis://localhost:6379/0'
CELERY_IMPORTS = ('superset.sql_lab', )
CELERY_IMPORTS = (
'superset.sql_lab',
'superset.tasks',
)
CELERY_RESULT_BACKEND = 'redis://localhost:6379/0'
CELERY_ANNOTATIONS = {'tasks.add': {'rate_limit': '10/s'}}
CELERYD_LOG_LEVEL = 'DEBUG'
CELERYD_PREFETCH_MULTIPLIER = 10
CELERY_ACKS_LATE = True
CELERY_ANNOTATIONS = {
'sql_lab.get_sql_results': {
'rate_limit': '100/s',
},
'email_reports.send': {
'rate_limit': '1/s',
'time_limit': 120,
'soft_time_limit': 150,
'ignore_result': True,
},
}
CELERYBEAT_SCHEDULE = {
'email_reports.schedule_hourly': {
'task': 'email_reports.schedule_hourly',
'schedule': crontab(minute=1, hour='*'),
},
}
CELERY_CONFIG = CeleryConfig
To start a Celery worker to leverage the configuration run: ::
* To start a Celery worker to leverage the configuration run: ::
celery worker --app=superset.sql_lab:celery_app --pool=gevent -Ofair
celery worker --app=superset.tasks.celery_app:app --pool=prefork -Ofair -c 4
* To start a job which schedules periodic background jobs, run ::
celery beat --app=superset.tasks.celery_app:app
To setup a result backend, you need to pass an instance of a derivative
of ``werkzeug.contrib.cache.BaseCache`` to the ``RESULTS_BACKEND``
@@ -665,11 +689,65 @@ look something like:
RESULTS_BACKEND = RedisCache(
host='localhost', port=6379, key_prefix='superset_results')
Note that it's important that all the worker nodes and web servers in
the Superset cluster share a common metadata database.
This means that SQLite will not work in this context since it has
limited support for concurrency and
typically lives on the local file system.
**Important notes**
* It is important that all the worker nodes and web servers in
the Superset cluster share a common metadata database.
This means that SQLite will not work in this context since it has
limited support for concurrency and
typically lives on the local file system.
* There should only be one instance of ``celery beat`` running in your
entire setup. If not, background jobs can get scheduled multiple times
resulting in weird behaviors like duplicate delivery of reports,
higher than expected load / traffic etc.
Email Reports
-------------
Email reports allow users to schedule email reports for
* slice and dashboard visualization (Attachment or inline)
* slice data (CSV attachment on inline table)
Schedules are defined in crontab format and each schedule
can have a list of recipients (all of them can receive a single mail,
or separate mails). For audit purposes, all outgoing mails can have a
mandatory bcc.
**Requirements**
* A selenium compatible driver & headless browser
* `geckodriver <https://github.com/mozilla/geckodriver>`_ and Firefox is preferred
* `chromedriver <http://chromedriver.chromium.org/>`_ is a good option too
* Run `celery worker` and `celery beat` as follows ::
celery worker --app=superset.tasks.celery_app:app --pool=prefork -Ofair -c 4
celery beat --app=superset.tasks.celery_app:app
**Important notes**
* Be mindful of the concurrency setting for celery (using ``-c 4``).
Selenium/webdriver instances can consume a lot of CPU / memory on your servers.
* In some cases, if you notice a lot of leaked ``geckodriver`` processes, try running
your celery processes with ::
celery worker --pool=prefork --max-tasks-per-child=128 ...
* It is recommended to run separate workers for ``sql_lab`` and
``email_reports`` tasks. Can be done by using ``queue`` field in ``CELERY_ANNOTATIONS``
SQL Lab
-------
SQL Lab is a powerful SQL IDE that works with all SQLAlchemy compatible
databases. By default, queries are executed in the scope of a web
request so they may eventually timeout as queries exceed the maximum duration of a web
request in your environment, whether it'd be a reverse proxy or the Superset
server itself. In such cases, it is preferred to use ``celery`` to run the queries
in the background. Please follow the examples/notes mentioned above to get your
celery setup working.
Also note that SQL Lab supports Jinja templating in queries and that it's
possible to overload
@@ -684,6 +762,8 @@ in this dictionary are made available for users to use in their SQL.
}
Celery Flower
-------------
Flower is a web based tool for monitoring the Celery cluster which you can
install from pip: ::
@@ -691,7 +771,7 @@ install from pip: ::
and run via: ::
celery flower --app=superset.sql_lab:celery_app
celery flower --app=superset.tasks.celery_app:app
Building from source
---------------------