Why Apache Spark & Azure Databricks are the Ideal Combo for Analytics Workloads

Share on facebook
Share on twitter
Share on linkedin
Share on email

As a data scientist or a manager looking to maximize analytics productivity, you’ve probably heard—or even said—this before: “It should be done soon, I’m just waiting for my experiment to finish running.” But why do you have to wait around for experiments to finish? In the age of automation and instant gratification, why can’t you speed this process up?

The likely problem: you’re operating on single machines. Single machine performance stopped improving drastically a decade ago; so, if you’re still running experiments via a single machine, you’re bound to wait a while. The good news? With a couple of helpful tools, you won’t have to stay stuck in these old ways.


What Can Be Done?

To boost computing time, consider distributing your analytics workloads across multiple machines. You might say, “What about GPUs?”  While graphics processing units (GPUs) do speed up model training times, a distributed cluster of machines with GPUs is much faster.  So, for those single machine enthusiasts out there, it’s time to learn how to process analytics workloads over a distributed cluster.


What Tools Should I Use?

While there are a few options out there, by far the largest and fastest-growing distributed analytics community is built around the open source tool Apache Spark (Spark).  Spark has been around since 2012 and has consistently made great strides in abstracting away technical overhead to deliver top-notch analytics.

Spark is designed to unify analytics, so you only have to manage one tool for just about every use case—like batch, streaming and graph analysis—instead of using a zoo of specialized tools.  Plus, you can combine Spark with additional features from Databricks and Azure to alleviate the need for low-level programming and cluster management, allowing you to focus on the important stuff: delivering state of the art analytics products.


How Do I Get Started?

First things first, you’ll need an environment to code in. For some, this can feel like a daunting process (downloading libraries, provisioning a cluster, etc.), but luckily, DataBricks and Azure make it effortless by managing the whole process for you. With just a few clicks through a registration process, you can get up and running in about five minutes.

To get started, you’ll need to create a trial Azure account and provision a Databricks cluster (for instructions, check out the Azure Databricks Getting Started Guide). Simply provide Azure Databricks some desired cluster information, and it will handle everything from provisioning resources to managing communication for you. Once your cluster is running, you can connect an Azure Databricks notebook to it and begin running a Jupyter-like coding environment.  And, with additional features provided by Azure, you can seamlessly connect to existing Azure resources and manage permissions.

There’s also great Azure Databricks documentation available to help you start coding.  To jump-start your learning, below is a simple script you can copy, paste and execute in cells of the Azure Databricks notebook.  If you’re familiar with the Scikit-Learn library and Pandas DataFrames, utilizing Spark in your model training workloads will be a breeze.


# imports sub libraries required

from pyspark.ml.linalg import Vectors

from pyspark.ml.feature import StringIndexer

from pyspark.ml.classification import GBTClassifier

from pyspark.ml.evaluation import MulticlassClassificationEvaluator


# get dataset from UCI on breast cancer detection

!wget https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data


# process dataset from raw file into Spark Dataframe (similar to Pandas DataFrame)

data = []


with open(‘wdbc.data’) as f:

for line in f:

attributes = line.rstrip(‘n’).split(‘,’)

label = attributes[1]

features = Vectors.dense([float(x) for x in attributes[2:]])

data.append((label, features))


inputDF = spark.createDataFrame(data, [‘label’, ‘features’])



# prepare label column for training

stringIndexer = StringIndexer(inputCol=’label’, outputCol=’labelIndexed’)

stringIndexer = stringIndexer.fit(inputDF)

df = stringIndexer.transform(inputDF)



# perform a train/test split on the data (to avoid overfitting)

train, test = df.randomSplit([0.70, 0.30], seed=42)


# fit gradient boosted tree model on the training set

model = GBTClassifier(labelCol=’labelIndexed’)

model = model.fit(train)


# make predictions with the trained model on the test set

predictions = model.transform(test)



# evaluate the model’s test set predictions

f1_scorer = MulticlassClassificationEvaluator(labelCol=’labelIndexed’, predictionCol=’prediction’, metricName=’f1′)

precision_scorer = MulticlassClassificationEvaluator(labelCol=’labelIndexed’, predictionCol=’prediction’, metricName=’weightedPrecision’)

recall_scorer = MulticlassClassificationEvaluator(labelCol=’labelIndexed’, predictionCol=’prediction’, metricName=’weightedRecall’)

f1 = f1_scorer.evaluate(predictions)

precision = precision_scorer.evaluate(predictions)

recall = recall_scorer.evaluate(predictions)


print(precision, recall, f1)


Want to learn more data science tips and best practices? Peruse our other blogs so you can stay on top of the latest and greatest on AI, machine learning and more.

Stay Up-To-Date