mirror of
https://github.com/apache/superset.git
synced 2026-04-17 07:05:04 +00:00
feat(docs): Docusaurus multi-versioning, Developer Portal starter kit (#34271)
Co-authored-by: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,328 @@
|
||||
---
|
||||
title: Exploring Data in Superset
|
||||
hide_title: true
|
||||
sidebar_position: 2
|
||||
version: 1
|
||||
---
|
||||
|
||||
import useBaseUrl from "@docusaurus/useBaseUrl";
|
||||
|
||||
## Exploring Data in Superset
|
||||
|
||||
In this tutorial, we will introduce key concepts in Apache Superset through the exploration of a
|
||||
real dataset which contains the flights made by employees of a UK-based organization in 2011. The
|
||||
following information about each flight is given:
|
||||
|
||||
- The traveler’s department. For the purposes of this tutorial the departments have been renamed
|
||||
Orange, Yellow and Purple.
|
||||
- The cost of the ticket.
|
||||
- The travel class (Economy, Premium Economy, Business and First Class).
|
||||
- Whether the ticket was a single or return.
|
||||
- The date of travel.
|
||||
- Information about the origin and destination.
|
||||
- The distance between the origin and destination, in kilometers (km).
|
||||
|
||||
### Enabling Data Upload Functionality
|
||||
|
||||
You may need to enable the functionality to upload a CSV or Excel file to your database. The following section
|
||||
explains how to enable this functionality for the examples database.
|
||||
|
||||
In the top menu, select **Settings ‣ Data ‣ Database Connections**. Find the **examples** database in the list and
|
||||
select the **Edit** button.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/edit-record.png" )} />
|
||||
|
||||
In the resulting modal window, switch to the **Advanced** tab and open **Security** section.
|
||||
Then, tick the checkbox for **Allow file uploads to database**. End by clicking the **Finish** button.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/allow-file-uploads.png" )} />
|
||||
|
||||
### Loading CSV Data
|
||||
|
||||
Download the CSV dataset to your computer from
|
||||
[GitHub](https://raw.githubusercontent.com/apache-superset/examples-data/master/tutorial_flights.csv).
|
||||
In the top menu, select **Settings ‣ Data ‣ Database Connections**. Then, **Upload file to database ‣ Upload CSV**.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/upload_a_csv.png" )} />
|
||||
|
||||
Then, select select the CSV file from your computer, select **Database** and **Schema**, and enter the **Table Name**
|
||||
as _tutorial_flights_.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/csv_to_database_configuration.png" )} />
|
||||
|
||||
Next enter the text _Travel Date_ into the **File settings ‣ Columns to be parsed as dates** field.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/parse_dates_column.png" )} />
|
||||
|
||||
Leaving all the other options in their default settings, select **Upload** at the bottom of the page.
|
||||
|
||||
### Table Visualization
|
||||
|
||||
You should now see _tutorial_flights_ as a dataset in the **Datasets** tab. Click on the entry to
|
||||
launch an Explore workflow using this dataset.
|
||||
|
||||
In this section, we'll create a table visualization
|
||||
to show the number of flights and cost per travel class.
|
||||
|
||||
By default, Apache Superset only shows the last week of data. In our example, we want to visualize all
|
||||
of the data in the dataset. Click the **Time ‣ Time Range** section and change
|
||||
the **Range Type** to **No Filter**.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/no_filter_on_time_filter.png" )} />
|
||||
|
||||
Click **Apply** to save.
|
||||
|
||||
Now, we want to specify the rows in our table by using the **Group by** option. Since in this
|
||||
example, we want to understand different Travel Classes, we select **Travel Class** in this menu.
|
||||
|
||||
Next, we can specify the metrics we would like to see in our table with the **Metrics** option.
|
||||
|
||||
- `COUNT(*)`, which represents the number of rows in the table
|
||||
(in this case, quantity of flights in each Travel Class)
|
||||
- `SUM(Cost)`, which represents the total cost spent by each Travel Class
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/sum_cost_column.png" )} />
|
||||
|
||||
Finally, select **Run Query** to see the results of the table.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/tutorial_table.png" )} />
|
||||
|
||||
To save the visualization, click on **Save** in the top left of the screen. In the following modal,
|
||||
|
||||
- Select the **Save as**
|
||||
option and enter the chart name as Tutorial Table (you will be able to find it again through the
|
||||
**Charts** screen, accessible in the top menu).
|
||||
- Select **Add To Dashboard** and enter
|
||||
Tutorial Dashboard. Finally, select **Save & Go To Dashboard**.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/save_tutorial_table.png" )} />
|
||||
|
||||
### Dashboard Basics
|
||||
|
||||
Next, we are going to explore the dashboard interface. If you’ve followed the previous section, you
|
||||
should already have the dashboard open. Otherwise, you can navigate to the dashboard by selecting
|
||||
Dashboards on the top menu, then Tutorial dashboard from the list of dashboards.
|
||||
|
||||
On this dashboard you should see the table you created in the previous section. Select **Edit
|
||||
dashboard** and then hover over the table. By selecting the bottom right hand corner of the table
|
||||
(the cursor will change too), you can resize it by dragging and dropping.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/resize_tutorial_table_on_dashboard.png" )} />
|
||||
|
||||
Finally, save your changes by selecting Save changes in the top right.
|
||||
|
||||
### Pivot Table
|
||||
|
||||
In this section, we will extend our analysis using a more complex visualization, Pivot Table. By the
|
||||
end of this section, you will have created a table that shows the monthly spend on flights for the
|
||||
first six months, by department, by travel class.
|
||||
|
||||
Create a new chart by selecting **+ ‣ Chart** from the top right corner. Choose
|
||||
tutorial_flights again as a datasource, then click on the visualization type to get to the
|
||||
visualization menu. Select the **Pivot Table** visualization (you can filter by entering text in the
|
||||
search box) and then **Create New Chart**.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/create_pivot.png" )} />
|
||||
|
||||
In the **Time** section, keep the Time Column as Travel Date (this is selected automatically as we
|
||||
only have one time column in our dataset). Then select Time Grain to be month as having daily data
|
||||
would be too granular to see patterns from. Then select the time range to be the first six months of
|
||||
2011 by click on Last week in the Time Range section, then in Custom selecting a Start / end of 1st
|
||||
January 2011 and 30th June 2011 respectively by either entering directly the dates or using the
|
||||
calendar widget (by selecting the month name and then the year, you can move more quickly to far
|
||||
away dates).
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/select_dates_pivot_table.png" )} />
|
||||
|
||||
Next, within the **Query** section, remove the default COUNT(\*) and add Cost, keeping the default
|
||||
SUM aggregate. Note that Apache Superset will indicate the type of the metric by the symbol on the
|
||||
left hand column of the list (ABC for string, # for number, a clock face for time, etc.).
|
||||
|
||||
In **Group by**, select **Time**: this will automatically use the Time Column and Time Grain
|
||||
selections we defined in the Time section.
|
||||
|
||||
Within **Columns**, first select Department and then Travel Class. All set – let’s **Run Query** to
|
||||
see some data!
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/tutorial_pivot_table.png" )} />
|
||||
|
||||
You should see months in the rows and Department and Travel Class in the columns. Publish this chart
|
||||
to your existing Tutorial Dashboard you created earlier.
|
||||
|
||||
### Line Chart
|
||||
|
||||
In this section, we are going to create a line chart to understand the average price of a ticket by
|
||||
month across the entire dataset.
|
||||
|
||||
In the Time section, as before, keep the Time Column as Travel Date and Time Grain as month but this
|
||||
time for the Time range select No filter as we want to look at entire dataset.
|
||||
|
||||
Within Metrics, remove the default `COUNT(*)` metric and instead add `AVG(Cost)`, to show the mean value.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/average_aggregate_for_cost.png" )} />
|
||||
|
||||
Next, select **Run Query** to show the data on the chart.
|
||||
|
||||
How does this look? Well, we can see that the average cost goes up in December. However, perhaps it
|
||||
doesn’t make sense to combine both single and return tickets, but rather show two separate lines for
|
||||
each ticket type.
|
||||
|
||||
Let’s do this by selecting Ticket Single or Return in the Group by box, and the selecting **Run
|
||||
Query** again. Nice! We can see that on average single tickets are cheaper than returns and that the
|
||||
big spike in December is caused by return tickets.
|
||||
|
||||
Our chart is looking pretty good already, but let’s customize some more by going to the Customize
|
||||
tab on the left hand pane. Within this pane, try changing the Color Scheme, removing the range
|
||||
filter by selecting No in the Show Range Filter drop down and adding some labels using X Axis Label
|
||||
and Y Axis Label.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/tutorial_line_chart.png" )} />
|
||||
|
||||
Once you’re done, publish the chart in your Tutorial Dashboard.
|
||||
|
||||
### Markup
|
||||
|
||||
In this section, we will add some text to our dashboard. If you’re there already, you can navigate
|
||||
to the dashboard by selecting Dashboards on the top menu, then Tutorial dashboard from the list of
|
||||
dashboards. Got into edit mode by selecting **Edit dashboard**.
|
||||
|
||||
Within the Insert components pane, drag and drop a Markdown box on the dashboard. Look for the blue
|
||||
lines which indicate the anchor where the box will go.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/blue_bar_insert_component.png" )} />
|
||||
|
||||
Now, to edit the text, select the box. You can enter text, in markdown format (see
|
||||
[this Markdown Cheatsheet](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) for
|
||||
more information about this format). You can toggle between Edit and Preview using the menu on the
|
||||
top of the box.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/markdown.png" )} />
|
||||
|
||||
To exit, select any other part of the dashboard. Finally, don’t forget to keep your changes using
|
||||
**Save changes**.
|
||||
|
||||
### Publishing Your Dashboard
|
||||
|
||||
If you have followed all of the steps outlined in the previous section, you should have a dashboard
|
||||
that looks like the below. If you would like, you can rearrange the elements of the dashboard by
|
||||
selecting **Edit dashboard** and dragging and dropping.
|
||||
|
||||
If you would like to make your dashboard available to other users, simply select Draft next to the
|
||||
title of your dashboard on the top left to change your dashboard to be in Published state. You can
|
||||
also favorite this dashboard by selecting the star.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/publish_dashboard.png" )} />
|
||||
|
||||
### Annotations
|
||||
|
||||
Annotations allow you to add additional context to your chart. In this section, we will add an
|
||||
annotation to the Tutorial Line Chart we made in a previous section. Specifically, we will add the
|
||||
dates when some flights were cancelled by the UK’s Civil Aviation Authority in response to the
|
||||
eruption of the Grímsvötn volcano in Iceland (23-25 May 2011).
|
||||
|
||||
First, add an annotation layer by navigating to Manage ‣ Annotation Layers. Add a new annotation
|
||||
layer by selecting the green plus sign to add a new record. Enter the name Volcanic Eruptions and
|
||||
save. We can use this layer to refer to a number of different annotations.
|
||||
|
||||
Next, add an annotation by navigating to Manage ‣ Annotations and then create a new annotation by
|
||||
selecting the green plus sign. Then, select the Volcanic Eruptions layer, add a short description
|
||||
Grímsvötn and the eruption dates (23-25 May 2011) before finally saving.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/edit_annotation.png" )} />
|
||||
|
||||
Then, navigate to the line chart by going to Charts then selecting Tutorial Line Chart from the
|
||||
list. Next, go to the Annotations and Layers section and select Add Annotation Layer. Within this
|
||||
dialogue:
|
||||
|
||||
- Name the layer as Volcanic Eruptions
|
||||
- Change the Annotation Layer Type to Event
|
||||
- Set the Annotation Source as Superset annotation
|
||||
- Specify the Annotation Layer as Volcanic Eruptions
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/annotation_settings.png" )} />
|
||||
|
||||
Select **Apply** to see your annotation shown on the chart.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/annotation.png" )} />
|
||||
|
||||
If you wish, you can change how your annotation looks by changing the settings in the Display
|
||||
configuration section. Otherwise, select **OK** and finally **Save** to save your chart. If you keep
|
||||
the default selection to overwrite the chart, your annotation will be saved to the chart and also
|
||||
appear automatically in the Tutorial Dashboard.
|
||||
|
||||
### Advanced Analytics
|
||||
|
||||
In this section, we are going to explore the Advanced Analytics feature of Apache Superset that
|
||||
allows you to apply additional transformations to your data. The three types of transformation are:
|
||||
|
||||
**Setting up the base chart**
|
||||
|
||||
In this section, we’re going to set up a base chart which we can then apply the different **Advanced
|
||||
Analytics** features to. Start off by creating a new chart using the same _tutorial_flights_
|
||||
datasource and the **Line Chart** visualization type. Within the Time section, set the Time Range as
|
||||
1st October 2011 and 31st October 2011.
|
||||
|
||||
Next, in the query section, change the Metrics to the sum of Cost. Select **Run Query** to show the
|
||||
chart. You should see the total cost per day for each month in October 2011.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/advanced_analytics_base.png" )} />
|
||||
|
||||
Finally, save the visualization as Tutorial Advanced Analytics Base, adding it to the Tutorial
|
||||
Dashboard.
|
||||
|
||||
### Rolling Mean
|
||||
|
||||
There is quite a lot of variation in the data, which makes it difficult to identify any trend. One
|
||||
approach we can take is to show instead a rolling average of the time series. To do this, in the
|
||||
**Moving Average** subsection of **Advanced Analytics**, select mean in the **Rolling** box and
|
||||
enter 7 into both Periods and Min Periods. The period is the length of the rolling period expressed
|
||||
as a multiple of the Time Grain. In our example, the Time Grain is day, so the rolling period is 7
|
||||
days, such that on the 7th October 2011 the value shown would correspond to the first seven days of
|
||||
October 2011. Lastly, by specifying Min Periods as 7, we ensure that our mean is always calculated
|
||||
on 7 days and we avoid any ramp up period.
|
||||
|
||||
After displaying the chart by selecting **Run Query** you will see that the data is less variable
|
||||
and that the series starts later as the ramp up period is excluded.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/rolling_mean.png" )} />
|
||||
|
||||
Save the chart as Tutorial Rolling Mean and add it to the Tutorial Dashboard.
|
||||
|
||||
### Time Comparison
|
||||
|
||||
In this section, we will compare values in our time series to the value a week before. Start off by
|
||||
opening the Tutorial Advanced Analytics Base chart, by going to **Charts** in the top menu and then
|
||||
selecting the visualization name in the list (alternatively, find the chart in the Tutorial
|
||||
Dashboard and select Explore chart from the menu for that visualization).
|
||||
|
||||
Next, in the Time Comparison subsection of **Advanced Analytics**, enter the Time Shift by typing in
|
||||
“minus 1 week” (note this box accepts input in natural language). Run Query to see the new chart,
|
||||
which has an additional series with the same values, shifted a week back in time.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/time_comparison_two_series.png" )} />
|
||||
|
||||
Then, change the **Calculation type** to Absolute difference and select **Run Query**. We can now
|
||||
see only one series again, this time showing the difference between the two series we saw
|
||||
previously.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/time_comparison_absolute_difference.png" )} />
|
||||
|
||||
Save the chart as Tutorial Time Comparison and add it to the Tutorial Dashboard.
|
||||
|
||||
### Resampling the data
|
||||
|
||||
In this section, we’ll resample the data so that rather than having daily data we have weekly data.
|
||||
As in the previous section, reopen the Tutorial Advanced Analytics Base chart.
|
||||
|
||||
Next, in the Python Functions subsection of **Advanced Analytics**, enter 7D, corresponding to seven
|
||||
days, in the Rule and median as the Method and show the chart by selecting **Run Query**.
|
||||
|
||||
<img src={useBaseUrl("/img/tutorial/resample.png" )} />
|
||||
|
||||
Note that now we have a single data point every 7 days. In our case, the value showed corresponds to
|
||||
the median value within the seven daily data points. For more information on the meaning of the
|
||||
various options in this section, refer to the
|
||||
[Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html).
|
||||
|
||||
Lastly, save your chart as Tutorial Resample and add it to the Tutorial Dashboard. Go to the
|
||||
tutorial dashboard to see the four charts side by side and compare the different outputs.
|
||||
Reference in New Issue
Block a user