Last updated: May 15 16:00

Spark on z/OS Challenge


Level 1: Getting Started

Level 1.1: Reading materials

Follow these links to learn about IBM DB2 for z/OS, Virtual Storage Access Method (VSAM), IBM z/OS Platform for Apache Spark, Jupyter Notebook, and Scala.

  1. IBM DB2 for z/OS
  2. VSAM
  3. IBM z/OS Platform for Apache Spark
  4. Jupyter Notebook
  5. Scala

Get points:: If you are participating in the hackZone challenge and would like to earn points, please return to the Advocate Hub and answer questions based on this level.

Level 1.2: Sign up for the challenge

  1. Obtain a Spark instance from the Apache Spark on z/OS tiral if you have not done so already.
  2. Check for an email from zcloud-admin that contains information and credentials needed for you to continue.
  3. Download the following pre-requisite files from https://github.com/cloud4z/spark.
    • DB2 data file – sppaytb1.data
    • DB2 DDL – sppaytb1.ddl
    • Spark Demo JAR file – ClientJoinVSAM.jar

Level 2: Self-service Dashboard

Level 2.1: Overview

This application is going to demonstrate running an analytics application using Spark Submit. We are going to do this using the IBM z Systems Community Cloud self-service portal and dashboard.

The exercises in this challenge use data stored in DB2 and VSAM tables, and a machine learning application written in Scala. The exercises use fictitious customer information and credit card transaction data to evaluate customer retention.

Level 2.2: Working with the dashboard

The Mozilla Firefox browser is recommended for these exercises.

  1. Open a web browser and enter the URL to access the z Systems Community Cloud self-service portal.

    Image: Portal

    Enter your Portal User ID and password. 
    click ‘Sign In’.
  2. You will see the home page for the z Systems Community Cloud self-service portal.
    Click on ‘Try Analytics Service’

    Image: Analytics trial

  3. You will now see a dashboard, which shows the status of your Apache Spark on z/OS instance.

    Image: Analytics trial dashboard

    • At the top of the screen, notice the ‘z/OS Status’ indicator, which should show the status of your instance as ‘OK’.
    • In the middle of the screen, the ‘Spark Instance’, ‘Status’, ‘Data management’, and ‘Operations’ sections will be displayed. The ‘Spark Instance’ section contains your individual Spark Instance Username and IP address.
    • Below the field headings, you will see buttons for functions that can be applied to your instance.

    The following table lists the operation for each function.

    Function
    Operation
    Change Password Click to change your Spark password
    Start Click to start your individual Spark cluster
    Stop Click to stop your individual Spark cluster
    Upload Data Click to select and load your DDL and data file into DB2
    Spark Submit Click to select and run your Spark program
    Spark UI Click to launch your individual Spark worker output GUI
    Jupyter Click to launch your individual Jupyter Notebook GUI
  4. For logging in the first time, you must set a new Spark instance password.
    Click ‘Change Password’ in the 'Spark Instance' section.

    Image: Change initial password

    Enter a new password for your Spark instance.
    Repeat the new password for your Spark instance.
    Click ‘Change Password’.

    Image: Change initial password

  5. Confirm that your instance is Active.
    If it is ‘Stopped’, click ‘Start’ to start it. 

    Image: Start Spark

  6. Next, load the DB2 data you downloaded in Level 1, from your local system. This will create the appropriate DB2 table for analysis.
    Click ‘Upload Data’. 

    Image: Upload data

    Select the DB2 DDL file. 
    Select the DB2 data file. 
    Click 'Upload'.

    Image: Upload DB2

    You will see the status change from 'Transferring' to 'Loading' to 'Upload Success'. The VSAM data for this exercise has already been loaded for you, no further action is required.

  7. Submit a prepared Scala program to analyze the data.
    Click ‘Spark Submit’. 

    Image: Spark submit

    Select the ClientJoinVSAM JAR file you downloaded.
    Specify Main class name ‘com.ibm.scalademo.ClientJoinVSAM’.
    Enter the arguments: ‘Spark Instance Username’ ‘Spark Instance Password’.
    Click ‘Submit’.

    Image: Spark submit

    A status of “JOB Submitted” will appear the program is complete. This Scala program will access DB2 and VSAM data, perform transformations on the data, join these two tables in a Spark dataframe, and store the result back to DB2.

  8. Launch your individual Spark worker output GUI to view the job you just submitted.
    Click ‘Spark UI’. 

    Image: Spark web UI

    Authenticate with your Spark instance username and Spark password. 

    Image: Authenticate

    Click on the ‘Worker ID’ for your program in the ‘Completed Drivers’ section.

    Image: Spark web UI

    Authenticate with your Spark instance username and Spark password. 
    Click on ‘stdout’ for your program in the ‘Finished Drivers’ section to view your results.

    Image: Spark web UI

    • The first table shows the top 20 rows of the VSAM data (customer information).
    • The second table shows the top 20 rows of the DB2 data (transaction data).
    • The third table shows the top 20 rows of the result ‘client_join’ table.

    Question: What is the age of customer with ID 1009550400?

    Question: What is the annual income of Customer ID 1009532800?

    Question: What is the activity level of Customer ID 1009520420?

Get points:: If you are participating in the hackZone challenge and would like to earn points, please return to the Advocate Hub and answer questions based on this level.


Level 3: Jupyter Notebook

Level 3.1: Overview

In this section, you will use the Jupyter Notebook tool that is installed in the dashboard. This tool will allow you to write and submit Scala pre to your Spark instance, and view the output within a web GUI.

Level 3.2: Working with Jupyter Notebook

  1. Launch the Jupyter Notebook service in your browser from your dashboard.
    Click on 'Jupyter'

    Image: Jupyter Notebook

    You will see the Jupyter home page.

    Image: Jupyter Home

  2. The prepared Scala program in this level will access DB2 and VSAM data, perform transformations on the data, join these two tables in a Spark dataframe, and store the result back to DB2. It will also perform a logistic regression analysis and plot the output.
    Click on the Demo.jpynb file

    The Jupyter Notebook will connect to your Spark on z/OS instance automatically and will be in the ready state when the Apache Toree – Scala indicator in the top right hand corner of the screen is clear.

    Image: Jupyter ready

    Jupyter Notebook environment is divided into input cells labeled with ‘In [#]:’.

  3. Run cell #1 - The Scala code in the first cell loads the VSAM data (customer information) into Spark and performs a data transformation.
    Click on the first ‘In [ ] cell: ’ 

    The left border will change to blue when a cell is in command mode.

    Click in the cell to edit. 

    The left border will change to green when a cell is in edit mode.

    Change the value of zOS_IP to your Spark Instance IP Address.
    Change the value of zOS_USERNAME to your Spark Instance Username.
    Change the value of zOS_PASSWORD to your Spark Instance Password.

    Image: Edit cell #1

    Execute the Scala code in the first cell. Jupyter Notebook will check the Scala code for syntax errors and compile the code for you.

    Click the run cell button as shown below.

    Image: Run cell #1

    Jupyter Notebook connection to your Spark instance is in the busy state when the Apache Toree – Scala indicator in the top right hand corner of the screen is grey.

    Image: Spark busy status

    When this indicator turns clear, the cell run has completed and returned to the ready state. The result of the first 20 records is displayed.

    Image: cell #1 VSAM results

    Question: What is the youngest age in this top 20 results?

  4. Run cell #2

    The Scala code in the second cell loads the DB2 data (transaction data) into Spark and performs a data transformation.

    Click on the next ‘In [ ]:’ to select the next cell
    Click the run cell button

    Question: What is the name of the first merchant in the top 20 results?

  5. Run cell #3

    The Scala code in the third cell joins the VSAM and DB2 data into a new ‘client_join’ dataframe in Spark.

    Click on the next ‘In [ ]:’ to select the next cell
    Click the run cell button

    Question: How many customers in the top 20 results have an annual income greater than 100000?

  6. Run cell #4

    The Scala code in the fourth cell performs a logistic regression to evaluate the probability of customer churn as a function of customer activity level. The ‘result_df’ dataframe is also created, which is used to plot the results on a line graph.

    Click on the next ‘In [ ]:’ to select the next cell
    Click the run cell button 
  7. Run cell #5

    The Scala code in the fifth cell plots the ‘plot_df’ dataframe.

    Click on the next ‘In [ ]:’ to select the next cell.
    Click the run cell button

    Question: What inferences can you make based on the shape of the plot?

  8. Challenge yourself!

    In a new input cell, write Scala code to determine the following:

    • The number of rows in the input VSAM dataset.
    • The number of rows in the input DB2 dataset.
    • The number of rows in the joined dataset.
    • The probability of customer churn as a function of customer age.
    • Question: What inferences can you make based on the shape of the plot?

Get points:: If you are participating in the hackZone challenge and would like to earn points, please return to the Advocate Hub and answer questions based on this level.