Scaling Data Science: How We Use CRISP-DM and Agile

Share on facebook
Share on twitter
Share on linkedin
Share on email

Most mature software teams acknowledge that guiding methodologies are powerful and, if followed, can help steer a team towards excellent results.  In the software development world, agile is widely used and appreciated.  But what most people don’t recognize is that data science—which often looks like it’s more R&D rather than software development—also benefits from a guiding methodology. At AgileThought, we use the CRISP-DM methodology to build great data science products for our clients.  

To explain how CRISP-DM applies, we’re going to use a fictional company called, “Mega Subscription Co.,” which recently lost $100,000 in revenue due to customer attrition. Using CRISP-DM, they want to come up with a strategy to recover their lost revenue.

So, let’s talk about how—and why—the CRISP-DM methodology works:

Start your machine learning initiative with Predictive Analytics Discovery

But First, What is CRISP-DM?

The CRISP-DM methodology—which stands for Cross-Industry Standard Process for Data Mining—is a product of a European Union funded project to codify the data mining process. Just as the agile mindset informs an iterative software development process, CRISP-DM conceptualizes data science as a cyclical process. And, while the cycle focuses on iterative progress, the process doesn’t always flow in a single direction. In fact, each step may cause the process to revert to any previous step, and often, the steps can run in parallel.

The cycle consists of the following phases:

Business Understanding: During this phase, the business objective(s) should be framed as one or more data science objectives, and there should be a clear understanding of how potential data science tasks fit within a broader system or process.

So, as we mentioned earlier, Mega Subscription Co. has a business objective to recover $100,000 of lost revenue caused by customer attrition. They might frame their data science objectives as 1.) predict the probability that a given customer will leave and 2.) predict the value of a given customer.  Knowing this, we would score customers by their likelihood of leaving and by the potential revenue loss that will result.

Now that we’ve framed our data science objectives, we need to consider how they fit within the overall system. Once we know which customers are likely to leave and of significant value to us, what do we do about it?  Mega Subscription Co. could plan to contact the top 100 scoring customers on a monthly basis with an incentive to stay. In addition to classifying customers, the data science solution in this scenario should also rank them and identify the target variable as the likelihood of a customer leaving in the next month.

Data Understanding: Now that we’ve defined the data science objectives, do we know what data we need? And, what it takes to acquire that data (required credentials, maintenance of custom ETL scripts, etc.)? During this phase, we should consider the type, quantity and quality of the data that we have on-hand today, as well as the data that’s worth investing in for the future. For Mega Subscription Co., we may discover historical billing data that tells us which customers stayed versus left, as well as the dollar value of each customer.  However, we may realize we don’t have data that suggests what incentive we could offer customers to convince them to stay. In the event this happens, we could apply A/B testing to various incentives to see which one produces a better customer response.

Data Preparation: During this phase, we transform our raw data—which can be images, text or tabular—into small floating-point numbers that can be used by our predictive modeling algorithms. For Mega Subscription Co., this means centering (making the mean of a continuous variable 0), scaling (making the standard deviation of a continuous variable 1), one hot encoding, discretizing, windowing or otherwise transforming the historical billing data.

Modeling: Once our data is prepped, we can run experiments to see which modeling techniques work and which ones don’t. Using an experimental approach, we should pare down the number of candidate models applied over several iterations so that we arrive at the best alternative.

Evaluation: In this phase, we’ll measure the performance of our models and compare with our performance goals to see how we did. For Mega Subscription Co., we’ll use the historical billing data to train our algorithm and then evaluate it using some test data and record a metric, like accuracy. To see how much better or worse our model is, we should compare our recorded metric with a metric from a baseline model, like predictions that customers will stay. This would be a good time to inspect some of the errors the model is making to give us clues as to how we could improve it.  In addition to individual performance measurement, we should measure our system’s performance by running it through testing scenarios to see if it could have saved some lost revenue.

Deployment: In this phase, the models produced in the preceding phases should be ready for production use. Keep in mind, there is more to deployment than simply making predictions with the model. Aside from the obvious DevOps and engineering work to scale the prediction engine, we also need to consider collecting production metrics. For Mega Subscription Co., we need to evaluate performance on ‘live’ data, since our performance on test data is just our best proxy for live results. It’s also useful to slice and dice performance across various segments. This may reveal that our model is great for long-standing customers but terrible for relatively new customers.

Why it Works

Generally, data science projects go hand-in-hand with traditional software development; however, while it’s tempting to completely integrate the two, I urge you to take a few things into consideration:  For starters, data science is much more experimental in nature than software development; therefore, you’ll spend significant time and effort experimenting with new ideas. This may seem wasteful in software development, but it is essential for a successful data science project.  It’s often advantageous to delay—or completely avoid, if possible—committing to a single data science technique.  Instead, data science related components should be built around experimentation and designed for flexibility if a better performing alternative becomes available.  This also implies that the data science components be de-coupled from traditional software components and linked via web service APIs that fulfill a ‘contract’ of what is to be sent and received.  With these considerations in mind, the data science components can be interchanged—in real time, if necessary—so long as the ‘contract’ between them and the software components are met.

How We Use it

At AgileThought, we blend the ideas of CRISP-DM with those of agile.  Our data science team members participate in the usual Scrum ceremonies (grooming, refinement, sprint planning, etc.).  While the original user story may look familiar, the backlog items for data science end up looking significantly different from those of traditional software.  For example, a user story for customer attrition might be “As a customer service representative, I want a monthly list of customers to contact with an incentive.”  Like we explained earlier using the Mega Subscription Co. example—which narrowed down its data science objectives to predicting likelihood of leaving and customer value—each one of these would need to be further broken down into testable experiments.  For example, we may refine ‘predicting likelihood’ to ‘predict likelihood using logistic regression, random forest, or gradient-boosted tree algorithms, with an F1 score of 0.75.’  To mark this story as done, the data scientist must evaluate these algorithms and record which algorithm produces the best performance score, or F1.  Then, the outcome is binarized to one of two outcomes: success or failure.  Assuming the best F1 from the experiment was 0.56 using logistic regression, we may decide to prioritize a story during the next Sprint to work on feature engineering so that we can improve our performance.

Want to learn how you can effectively apply the CRISP-DM methodology to your projects? Contact us to learn how our data scientists can help.

Interested to see a real-life example? Read our Predictive Analytics Discovery client story with Valpak.

Stay Up-To-Date