Amchi Mumbai Vs Dilwalon ki Delhi
Whose Dream will come true?

Since the beginning of the IPL in 2008, it has attracted viewership all around the world. Immense uncertainty and last moment nail biters and super overs have compelled fans to watch the matches. In a very short time period, IPL has become the highest revenue generating cricket league.

Analytics has been a part of the Sports industry for a long time. Data scientists accompany the team to give direct inputs to the coaches in specific departments to decipher the opponent’s bowler/batsman weakness and make match-winning strategy against key players.

During the matches, you must have also come across several instances where the chance of winning of a particular team is predicted through real-time online polls.

In this blog, we would be

  1. Predicting the score for the team batting first
  2. Predicting the winner of the finals — right after the toss, post completion of first innings, after completion of powerplay in the second inning.


The data science cycle involves a series of steps that ultimately lead to the ‘perfect’ information insight that a data analyst strives to get. So, let’s decipher each and every step and see how they have been practically implied in visualizing data regarding our beloved IPL’s data set.

  1. Business understanding:

Did you expect a direct jump into data analysis and stuff? Well, hold your horses, it is not that easy and quick as it seems to be.

Before you directly dive into the numbers and start analyzing them, knowing and understanding the business/project objective is of utmost importance. It is critical to comprehend the problem that you are trying to solve. But the question is, how to really go about it? According to , we typically use data science to answer five types of questions:

  • How much or how many? (regression)
  • Which category? (classification)
  • Which group? (clustering)
  • Is this weird? (anomaly detection)
  • Which option should be taken? (recommendation)

The variables that need to be predicted should be identified correctly. And then the focus should shift to developing a detailed problem statement and analyzing its effect on the targeted client/customer.

In our/this case, the objective is to analyze the past 12 years of data of IPL in order to attain some insights for the current season of IPL. Yes, you read it right, we will be using the past 12 years of IPL’s data to draw some inferences about season 13.

We will also analyze the data to determine the highest run-scorer, highest wicket-taker, highest boundary scorer team on a particular stadium, and the probability of winning a match if the team wins the toss for that match.

2. Data mining:

Just as a typical mining work involves ‘digging out’ of necessary materials from a concerned source, data mining can be related to similar lines.

Post setting up the detailed problem statement, now it is time for gathering data from various viable sources. But at this day and age of ‘information overload’, how to find the right data? That is also a mammoth task. Isn’t it? But don’t worry, we have your back. The following questions need to be answered while collecting data: -

  • What data do I need for my project?
  • Where does it live?
  • How can I obtain it?
  • What is the most efficient way to store and access all of it?

Ask these questions to yourself and mine what you need!

With respect to the data set we had, we used tools like Pivot Tables in order to determine the part of the data that will be of our use to arrive at the solution of our problem statement. Apart from this, the technique of Aggregation was used to collate the data points pertaining to the same category. This step really helps to get an overall idea about the data set.

3. Data cleaning:

If you were thinking that mining out data is the end of the game, well it is not, more is yet to come. In fact, now comes the most time-consuming step of all- data cleaning. According to , this process (also referred to as ‘data janitor work’) can often take 50 to 80 percent of their time.

Well, is it safe when we say that the Data Science Cycle suffers from Obsessive-Compulsive Disorder (OCD) as our mothers do? Anyone? No?. Let us explain this

It’s because data cleaning is a very important step. The data in its raw form cannot be used directly and is infused with problems like missing data values, outliers which totally skew the judgment when included in decision making.

In our analysis of the IPL dataset, mean Imputation has been used in order to find the missing values from the data set. When a particular value is missing from the column, the mean of the rest of the values of that column is taken and that is used as filler for the vacant position.

Further, to remove the outliers the InterQuartile Range(Quartile 3- Quartile 1) is multiplied by 1.5 and all the values above it{(Q3-Q1)*1.5} are removed from data as they become fit to be called outliers.

4. Data exploration and feature engineering:

Now once the data is clean, it is ready for use. This is the stage from which the analysis of data starts. The patterns, biases, and relations between the variables and constants are deciphered and understood. It can involve the creation of graphs, studying the outliers, and many other things. The hypotheses of the problem statement are established and are rejected/ not rejected based on our data exploration.

The data by now is totally fit for usage. It is complete as it has no missing values and is capable of providing right and meaningful insights.

Here, we separate the required data from the pool of cleaned data so that its analysis can be carried out and the required insights can be obtained which is called data extraction.

Data Extraction in Excel

When there is a lot of data and we use Find (Ctrl+F) to find all the cells that contain a certain code, but the results would have been all over the place. Manually finding each column can be very tedious. If you want to know how to extract all the finds to a target worksheet — value & column format, we have something called macros.

Let’s say you have some data in a range like this.

we can make macros to get data from the rows. If you need more background on macros, is a nice reference point.

By creating macros in excel using VBA, data extraction tasks can be automated. In order to tabulate the performance of a team, information was derived from the ball by ball data for each inning in a match.

We divided the phases in an innings as — powerplay (1–6 overs), middle (7–15 overs), and death (16–20 overs). Other than this, runs scored till the fall of 2nd wicket, a number of extras, and the number of boundaries scored was also calculated for each of the innings. Using a for loop for all the IPL matches of the last 12 seasons every inning’s data was compiled.

Macros Used:

1) Powerplay

The total number of runs scored and the number of wickets lost, during the overs 1–6 in an inning.

2) Middle overs (7–15)

The total number of runs scored and the number of wickets lost, during the overs 7–15 in an inning.

3) Death overs (16–20)

The total number of runs scored and the number of wickets lost, during the overs 16–20 in an inning.

4) Runs scored till the fall of 2nd wicket

The total number of runs scored till the fall of 2nd wicket in an inning.

5) Extras

The total number of runs accumulated due to extras in an inning.

6) Boundaries

The total number of runs scored and the number of wickets lost, during the overs 1–6 in an inning.

2) Middle overs (7–15)

Total number of runs scored and the number of wickets lost, during the overs 7–15 in the match

for instance, the below macro is used to find out the total runs scored till the 2nd wicket falls.

Here the column runs_top3 denotes the result of the macro.

Similarly, we can find out the number of runs and number of wickets in a powerplay using the below macro.

Now let’s go astep ahead…

Data analytics is not only about studying the existing data. There lie things beyond that.

5. Predictive modeling:

As the name suggests, predictive modeling attempts to answer the question, “what might possibly happen in the future.”, using a mathematical process. This is the stage where machine learning finally comes into the frame. It is basically the process of using already known results to create, process, and validate a model that can be used to forecast future outcomes.

Models for IPL Prediction

  • Predicting the score for the team batting first

Artificial neural networks (ANN) predict an output value as a function of the input parameters. To predict the score of the team batting first, we have used the ANN Regression model, which takes input parameters such as chasing team, the team batting first, venue, wickets in the powerplay, and runs in the power play. And the output variable would be the total number of runs scored. we are currently getting the mean absolute error of roughly 22, which means the predicted score has an error rate of ± 22.

  • Predicting the Match winner after the toss, first innings and powerplay of second innings

Here we are using logistic regression to predict the winning team. The input variables are the team batting first, chasing team, venue, team winning the toss, runs scored by batting team in the powerplay, wickets in the powerplay, total runs, and total wickets. The output variable is the winning team. We have used several logistic regression models like Logistic Regression from sklearn, Decision tree classifier, SVM, and Random Forest Classifier, we pick the best model based on the accuracy.

Best accuracies for each of the three models:

  1. Prediction after the toss: Accuracy of the Random Forest Classifier 0.6364
  2. Prediction after 1st innings: Accuracy of the Logistic Regression Classifier: 0.9000
  3. Prediction after powerplay of 2nd innings: Accuracy of the Logistic Regression Classifier: 0.9000

6. Data visualization:

Phew. Finally comes the most interesting part, the part where huge chunks of data are converted into easily comprehensible pieces of information, just for people like you and me.

But is it so simple as it looks to be? Perhaps not. Data visualization is a tricky field. This is because it involves not only statistics and mathematics, but also psychology, communication, and art. The different visual elements such as maps, charts, and graphs make it easy for us to understand trends, outliers, and patterns in data.

Here we used the Data Visualization software Tableau which makes it possible to depict all the insights drawn in the form of graphs. These steps make it easier to understand the insights and make it visually appealing as well. Here is a snippet of one of the graphs that we had created using tableau.

Summing it up

Analytics is a huge part of sports today, do follow up on and , available on Netflix and Amazon Prime Video respectively to see how the application is far beyond you ever imagined. If you need to deep dive into the backend code do get in touch with us. Follow our social media handles on , , to get real-time predictions of the match. Stay tuned for more interesting blogs from us in the future! :)

#DataAnalytics #DataScience #Prediction #MachineLearning #Cricket #IPL2020 #Dream11




The official data science society of Goa Institute of Management

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Visualizing “Assaults that resulted in Arrests” using Bokeh and Gmaps

Network Metrics Explained

Data warehouse vs Data Lake vs Data Lakehouse

From Vague to Value — Data Science & Analytics Practitioner Insights


Chicago Data Project Using SQL, Python, & Tableau

Data Pipelines With Python And Pandas

The Journey of Data.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Parigyan - The Data Science Society of GIM

Parigyan - The Data Science Society of GIM

The official data science society of Goa Institute of Management

More from Medium

Decision Making — With AI

Klaviyo Data Science Podcast EP 21 | Insight for Sore Eyes

Data Science-ing Policy Making — Automated Metric Collection, Dynamic Control, and Granular…

The Intersection of ML Observability and Quarterly Financial Updates