resources
|
February 13, 2025

Levelling up Tableau for Big Data : with Apache Spark + HyperAPI

Levelling up Tableau for Big Data : with Apache Spark + HyperAPI
Share article

Introduction →

We at Games24x7have witnessed explosive growth in a variety of skill-based interactive games over the last few years, including Rummy, Fantasy, Ludo and Poker. The desire and drive to gradually enhance the user experience over the years, with data serving as the primary driver of all organizational business decisions, has been the fundamental factor that has allowed us to design and draw users at the pace that we do.

At Games24x7,dataat its rawest form can be consumed directly by using Apache Spark SQL queries and hitting it up on our AWS S3 based data lake.

Business teams rely upon Tableau dashboards as the preferred method of consuming data and are considered the authoritative source for presentations in all verticals.

 Everything sounds good.… right? 

Not exactly, over the years the size of our data lake has increased considerably (into Petabytes as of now) thereby increasing the volume of data that needs to be scanned and aggregated for eventually being presented onto a Tableau dashboard. This has resulted in increased costs and severe scaling issues  for Tableau data extracts.

Additionally, the creation of a Tableau dashboard itself is a manual and time-consuming process. Following is a simplified view of the data flows with in the organisation to reach onto a tableau dashboard.

Problem statement →

Let’s look at the core issues here:

  • Scalability of Tableau data extract refresh
    • Time consuming: The backgrounder of Tableau server has a limit on how many dashboards can be refreshed at once with anything exceeding the defined limit being queued. With the volume of data that needs to be scanned each time being in GBs, the process can be very slow.
    • Frequent failures : Each refresh request has a default timeout limit of 7200 seconds which is inclusive of queue time which causes multiple critical dashboard refreshes to often fail. Increasing the timeout is not feasible as it would breach our SLAs.
    • High Cost: The overall costs were much higher because the SQL endpoint that wasused to refresh the extracts remained active for a much longer period of timethan planned due to queued refreshes.
  • Building your Tableau dashboard
  • Time consuming :Business teams generally don’t have in-depth expertise in SQL and theassociated datamarts due to which they are dependent on data analysts for thecreation of each of their dashboards where often the requirements can berepetitive.
  • Cumbersome : There are multiple cases where changes in filters might result in fullrefresh of the extract which can take up to 1 hour making the processcumbersome and time consuming. Due to high data volume and flexibility whichextracts offer, moving away towards live connection is not feasible.
  • Error prone : A dashboard at each stage needs to be manually saved to ensure theprogress is persisted.

Overall, the whole user experience accessing data on a regular basis in a dashboard can be cumber some and time consuming impacting the overall efficiency of business teams.

Making dashboards great again →

To address the problems we brainstormed multiple solutions which could address the pain points of not just the technical difficulties in the current flow of refreshing the dashboard but also the significant time spent in achieving the dashboard from stakeholders point of view. Let’s look at how the pain points were addressed -

Scalability of Tableau data extracts -

Programmatically creation of tableau extracts -

We created a python library that could leverage Tableau’s new Hyper API using which Hyper files can be created that function as a data engine for Tableau.

To make the data source creation scalable we leveraged an open source library hyperleaup which is able to utilize Apache Spark to create hyper files.

Essentially, the primary library was built on top of the following key libraries:

  • hyperleaup : To createHyper files using Apache spark
  • tableau server client : To authenticate and publish the hyper file to the required location
  • pyspark : To create there quired tables which will be converted to a hyper file 

Key Features :

  • Prerequisite for this setup is that workbooks and data sources will now have to be published separately as a twbx file and a hyper file.
  • Tool can be accessed by as simple as importing a python package
  • Creation or refresh of an extract can be initiated by passing SQL query to function as a parameter along with other server details
  • Tableau’s own backgrounder is totally bypassed as now all the data preparation happens at Databricks post which the extract is simply pushed to the Tableau server
  • Individual Databricks’s job clusters were used to publish each extract.
  • The data source is available as a tdsx file on the Tableau server.

Advantage:

  • Cost efficient : With Databricks’s job clusters the creation of data sources is significantly more efficient as clusters will auto-terminate post job completion thereby saving $4000 each month
  • Scalable : Each data source preparation can be powered by an optimal resource now by configuring the suitable job cluster increasing the concurrency by 10x.With extracts of size >=70 GB being refreshed on a daily basis.
  • Zero queuing : With no involvement of Tableau’s backgrounder all the pain points associated with it has been eliminated.

Overall, This tool resolved all the pain points associated with scalability of daily refresh of tableau data extracts on the server.

Building your Tableau dashboard :

Developing a dashboard involves multiple steps being repetitive in nature across dashboards, Example - For experiment analysis often the metrics analysed are standard. Ideally having a single big fat data extract would have been the go to solution but due to high data volume and excessive dashboard load time this is not feasible leading to individual extracts for each dashboard.

Dynamic Linkage

We created a framework that would populate data in the campaign dashboard based on UTM parameters defined with in the dashboard link. The link is dynamically created using the in-house EDS(event detection system) platform which is used to run campaigns.

A sample link would look like this -

http://1X.5XX.2XX.1XX/#/views/C/Retention?:iid=1&exp_end_date=2023-05-31&exp_start_date=2023-05-09&expname=RCRet_ss_Seed_Value_2

Where,

            expname : Experiment name which essentially and alias for a group of campaigns,

            exp_start_date : The start date of the experiment,

            exp_end_date : The end date of the experiment.

Key Features:

  • UTM parameters: A new dashboard can be accessed directly by clicking on this link
  • Live connection via SQL endpoint: Instead of the usual data source where the data source is persisted on the Tableau server, this method relies upon data being queried dynamically when the link is clicked upon.
  • Dynamic: For each campaign, the link can be accessed directly by selecting the specific campaign details.

Advantages:

  • Zero creation time for any dashboard.
  • Use friendly since now afresh dashboard can be effectively created within their browser instead of manually setting one up using RDP(Remote desktop protocol) saving effectively3-4 hours per campaign in dashboard creation.

Disadvantages:

  • High load time: Due to a live connection being used here, full data refresh occurs each time the dashboard is loaded.
  • No Portability: Due to reliance upon a live connection, the final output is never persisted resulting in the output being fixed to just one window.
  • Cost surge: With data being refreshed each time the dashboard is loaded results in increased cost.

Overall, the user interface to access improved significantly but it came at a cost of increased cost and a poorer user experience due to higher load times of the dashboard.

BLITZ

The previous solution even though was able to simplify the process of creating your own view of the dashboard as per need but it resulted in significant downgrade in terms of load time and portability.

To address this we have created a new feature BLITZ within our in-house platform BIMaze where we can configure dashboards covering all the possible use cases based on domains similar to previous solutions and also by leveraging Hyper API as seen previously we can address the issues related to high load times and portability.

A user can submit their request for dashboard , go for a coffee break and by the end of which they would have received an email notification with their dashboard link.

Let's walk through the new journey to create a Tableau dashboard →

  1. Any user across the organization can log onto our platform and can configure a job which would be responsible for creating as well as refreshing the dashboard at a user chosen interval.
  2. At the chosen interval a Databricks job will be triggered.
  3. The job will run pre-configured parameterized Spark SQL queries to create final tables consisting of all the data points requested in the job.
  4. Tables are then converted to Hyper file using the Hyper API and then published to the Tableau server using Tableau Server’s Rest APIs where they are available as .tdsx data source .
  5. Published data sources are attached to a copy of a predefined     Tableau workbook (.twb file) to generate the final dashboard

Let’s understand how a Tableau workbook is attached to the published data sources →

A Tableau workbook (.twb) is essentially a set of XML instructions hence our hypothesis was that it should be possible to swap data sources by altering values of available keys.

Along this hypothesis we experimented by altering multiple variables to see which one we could successfully swap data source . Eventually ,we found that replacing the ‘dbname’ key with the name of the published data source would attach the published data sources to this notebook successfully. Key ‘dbname’ of data source can be identified by finding the element based on ‘sqlproxy’ of that particular connection.

Following code blocks can be observed throughout any Tableau workbook (.twb),since the templates are fixed we can track these sqlproxy’s and use them to swap data sources repeatedly.

<datasource caption='general_ds_query' inline='true' name='sqlproxy.0yyz6tk1vbzwrvhjvnvrrv5g04x1gks' version='18.1'>
      <repository-location derived-from='http://1XX.5XX.5XX.5XXX/datasources/general_ds_query_ad131?rev=2.2' id='general_ds_query_adityam_STAGE131' path='/datasources' revision='5.0' />
      <connection channel='https' class='sqlproxy' dataserver-permissions='true' dbname='general_ds_query_vi755' directory='/dataserver' port='443' saved-credentials-viewerid='4' server='stage.tableau.g24x7.com' server-ds-friendly-name='general_ds_query' server-oauth='' username='' workgroup-auth-mode='prompt'>

To swap data source name within a workbook based on a fixed sqlproxy we can use the following code block -

from typing import Dict, List
import xml.etree.ElementTree as ET

def update_dbname_in_twb(twb_file_path: str, ds_mapping: List[Dict[str, str]]):
""" 
The function is used to swap datasources to a workbook that is about to be published using 'sqlproxy' of the connection between using workbook and datasource
Args:
	twb_file_path = Downloaded Tableau workbook's local file path
	ds_mapping = A list of dictionary containing the key-value pair containing datasource's name and its 'sqlproxy' """

    tree = ET.parse(twb_file_path)
    root = tree.getroot()

    for ds in ds_mapping:
        extract_name = ds["extract_name"]
        sqlProxy = ds["sqlProxy"]

        connection_element = root.find(f".//datasource[@name='{sqlProxy}']/connection")

        if connection_element is not None:
            connection_element.set("dbname", extract_name)
        else:
            raise Exception(f"Extract's proxy not found : {extract_name}")

    tree.write(twb_file_path, encoding="utf-8", xml_declaration=True)

 Eventually publishing this altered work book would give us the final dashboard which we aimed for.

 Key Features :

  • Platform: Users can log onto the platform and submit their requirement ,once done they would receive a mail with the dashboard link.
  • Daily refresh : Users can set up auto refresh of their dashboards using the platform itself.
  • Programmatic extract creation : It leveraged our previous creation to publish all the extracts.
  • Zero human involvement: Once a request has been placed ,at no point manual intervention is needed.
  • Dashboards need to be configured beforehand : A template workbook needs to be configured which can be replicated for creating new dashboards based on requirements. This is a one time activity

Overall, We were able to completely address all the technical pain points faced earlier while also keeping manual input to one time effort when setting a new dashboard.

Advantages :

  1. Zero manual effort in creating a dashboard
  2. Turnaround time to create a dashboard has gone down from over a week to an average of just one hour. Saving at least 5 working days per campaign.
  3. The new cycle has significantly improved the productivity of Data analysts and analytics teams by automating the dashboard requirements end-to-end . Over 1000 dashboards have been created already on this platform.

Note -

  • Tableau server referenced here is a single-node server with version being 2023.2.0

Contributors :

  • Atul Sawant , Director - Business Intelligence at Games24x7 (LinkedIn)
  • Anup Tiwari ,  Engineering Manager at Games24x7 (LinkedIn)
  • Pratik Khedkar, SDE-3 at Games24x7 (LinkedIn)
  • Abhishek Mishra , SDE-1 at Games24x7 (LinkedIn)
  • Devansh Mehta , SDE-3 atGames24x7  (LinkedIn)

Authors :

Abhishek Mishra is working as an SDE-1at Games24x7,working with the Data Engineering team. You can reach out to Abhishek on LinkedIn  

Pratik Khedkar works with the Data engineering team as an SDE-3 at Games24x7. You can reach out to Pratik on LinkedIn