Building Azure Databricks Cluster installing desired packages & with a demo run (Time stone from Python Verse)

Today, I’ll be showing how to prepare a cluster in Azure Databricks from command prompt & will demonstrate any sample csv file process using Pyspark. This can be useful, especially when you want to customize your environment & need to install specific packages inside the clusters with more options.

This is not like any of my earlier posts, where my primary attention is on the Python side. At the end of this post, I’ll showcase one use of Pyspark script & how we can execute them inside Azure Data bricks.

Let’s roll the dice!

Step -1:

Type Azure Databricks in your search folder inside the Azure portal.

0. Azure Search

As shown in the red box, you have to click these options. And, it will take the application to new data bricks sign-in page.

Step -2:

Next step would be clicking the “Add” button. For the first time, the application will ask you to create a storage account associated with this brick.

1. Create Storage

After creation, the screen should look like this –

2.5. Azure-Data-Bricks Options

Now, click the Azure command-line & chose bash as your work environment –

2. After Creation

For security reason, I’ve masked the details.

After successful creation, this page should look like this –

3. Azure Databricks

Once, you click the launch workspace, it will take you to this next page –

4. Detailed Bricks

As you can see that, there are no notebook or python scripts there under Recents tab.

Step -3:

Let’s verify it from the command line shell environment.

5. Python-Env

As you can see, by default python version in bricks is 3.5.2.

Step -4:

Now, we’ll prepare one environment by creating a local directory under the cloud.

The directory that we’ll be creating is – “rndBricks.”

6. Creating Directory

Step -5:

Let’s create the virtual environment here –

Using “virtualenv” function, we’ll be creating the virtual environment & it should look like this –

7. Creating Python-VM

As you can see, that – this will create the first python virtual environment along with the pip & wheel, which is essential for your python environment.

After creating the VM, you need to update Azure CLI, which is shown in the next screenshot given below –

8. Installing Databricks CLI in Python-VM

Before you create the cluster, you need to first generate the token, which will be used for the cluster –

9.1. Generating Token

As shown in the above screen, the “red” marked area is our primary interest. The “green” box, which represents the account image that you need to click & then you have to click “User Settings” marked in blue. Once you click that, you can see the “purple” area, where you need to click the Generate new token button in case if you are doing it for the first time.

Now, we’ll be using this newly generated token to configure data bricks are as follows –

9.2. Configuring with Token

Make sure, you need to mention the correct zone, i.e. westus2/westus or any region as per your geography & convenience.

Once, that is done. You can check the cluster list by the following command (In case, if you already created any clusters in your subscription) –

10. Checking Clusters List

Since we’re building it from scratch. There is no cluster information showing here.

Step -6:

Let’s create the clusters –

11. Creating-Clusters-From-Command

Please find the command that you will be using are as follows –

databricks clusters create –json ‘{ “autoscale”: {“min_workers”: 2, “max_workers”: 8}, “cluster_name”: “pyRnd”, “spark_version”: “5.3.x-scala2.11”, “spark_conf”: {}, “node_type_id”: “Standard_DS3_v2”, “driver_node_type_id”: “Standard_DS3_v2”, “ssh_public_keys”: [], “custom_tags”: {}, “spark_env_vars”: {“PYSPARK_PYTHON”: “/databricks/python3/bin/python3”}, “autotermination_minutes”: 20, “enable_elastic_disk”: true, “cluster_source”: “UI”, “init_scripts”: [] }’

As you can see, you need to pass the information in JSON format. For your better understanding, please find the JSON in a proper format –

11.5. JSON

And, the raw version –

{
  "autoscale": {
    "min_workers": 2,
    "max_workers": 8
  },
  "cluster_name": "pyRnd",
  "spark_version": "5.3.x-scala2.11",
  "spark_conf": {},
  "node_type_id": "Standard_DS3_v2",
  "driver_node_type_id": "Standard_DS3_v2",
  "ssh_public_keys": [],
  "custom_tags": {},
  "spark_env_vars": {
    "PYSPARK_PYTHON": "/databricks/python3/bin/python3"
  },
  "autotermination_minutes": 20,
  "enable_elastic_disk": true,
  "cluster_source": "UI",
  "init_scripts": []
}

Initially, the cluster status will show from the GUI are as follows –

12. Cluster-Status-In-Progress

After a few minutes, this will show the running state –

13. Cluster-Running Status

Let’s check the detailed configuration once the cluster created –

14. Initial Cluster Details

Step -7:

We need to check the library section. This is important as we might need to install many dependant python package to run your application on Azure data bricks. And, the initial Libraries will look like this –

15. Libraries

You can install libraries into an existing cluster either through GUI or through shell command prompt as well. Let’s explore the GUI option.

GUI Option:

First, click the Libraries tab under your newly created clusters, as shown in the above picture. Then you need to click “Install New” button. This will pop-up the following windows –

16. Installing Libraries

As you can see, you have many options along with the possibilities for your python (marked in red) application as well.

Case 1 (Installing PyPi packages):

19. Installing through GUI

Note: You can either mention the specific version or just simply name the package name.

Case 2 (Installing Wheel packages):

16.5. Installing Wheel Libraries

As you can see, from the upload options, you can upload your local libraries & then click the install button to install the same.

UI Option:

Here is another way, you can install your python libraries using the command line as shown in the below screenshots –

17. Running & Installing Libraries - Alternate Options

Few things to notice. The first command shows the current running cluster list. Second, command updating your pip packages. And, the third command, install your desired pypi packages.

Please find the raw commands –

databricks clusters list

pip install -U pip

databricks libraries install –cluster-id “XXXX-XXXXX-leech896” –pypi-package “pandas” –pypi-repo “https://pypi.org/project/pandas/”

After installing, the GUI page under the libraries section will look like this –

18. Installed Libraries

Note that, for any failed case, you can check the log in this way –

20. Installation-In-progress

If you click on the marked red area, it will pop-up the detailed error details, which is as follows –

19.5. Error Details

So, we’re done with our initial set-up.

Let’s upload one sample file into this environment & try to parse the data.

Step -8:

You can upload your sample file as follows –

23.1. First Step

First, click the “data” & then click the “add data” marked in the red box.

You can import this entire csv data as tables as shown in the next screenshot –

23.2. Uploading Data Files

Also, you can create a local directory here based on your requirements are explained as –

24. Creating Local Directory For Process

Step -9:

Let’s run the code.

Please find the following snippet in PySpark for our test –

1. DBFromFile.py (This script will call the Bricks script & process the data to create an SQL like a table for our task.)

###########################################
#### Written By: SATYAKI DE        ########
#### Written On: 10-Feb-2019       ########
####                               ########
#### Objective: Pyspark File to    ########
#### parse the uploaded csv file.  ########
###########################################

# File location and type
file_location = "/FileStore/tables/src_file/customer_addr_20180112.csv"
file_type = "csv"

# CSV options
infer_schema = "false"
first_row_is_header = "true"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

display(df)

# Create a view or table

temp_table_name = "customer_addr_20180112_csv"

df.createOrReplaceTempView(temp_table_name)

%sql

/* Query the created temp table in a SQL cell */

select * from `customer_addr_20180112_csv`

From the above sample snippet, one can see that the application is trying to parse the source data by providing all the parsing details & then use that csv as a table in SQL.

Let’s check step by step execution.

25. Working With Uploaded File

So, until this step, you can see that the application has successfully parsed the csv data.

And, finally, you can view the data –

25.1. Second Option

As the highlighted blue box shows that the application is using this csv file as a table. So, you have many options to analyze the information flexibly if you are familiar with SQL.

After your job run, make sure you terminate your cluster. Otherwise, you’ll receive a large & expensive usage bill, which you might not want!

So, finally, we’ve done it.

Let me know what do you think.

Till then, Happy Avenging! 😀

Note: All the data posted here are representational data & available over the internet & for educational purpose only.