Computing for Data Sciences

Welcome to the 2017 edition of the course

Post Graduate Diploma in Business Analytics (PGDBA) - jointly offered by IIM Calcutta, IIT Kharagpur, and ISI Kolkata - aims to help shape the emerging profession of Business Analytics by delivering a cutting edge inter-disciplinary educational experience to its graduates.

Computing for Data Sciences (CDS), aka BAISI-4, is one of the five courses offered at ISI Kolkata during the First Semester of the PGDBA program. The Fall 2017 edition of the course -- CDS 2017 -- is taught by Sourav Sen Gupta from R C Bose Centre, Indian Statistical Institute, Kolkata.


Reach Sourav   :     sg.sourav@gmail.com   |     +91 94323 44852   |     Room 404, Deshmukh Building

  Information

Course : BAISI-4 (aka CDS)

Wednesday & Friday   @   11:00 - 13:00

Assignments = 30%     (mixed form)
Mid-Sem Eval. = 20%     (hackathon)
End-Sem Eval. = 50%     (incl. project)


Format for evaluations to be decided.

  Lectures

We have time for about 26 two-hour lectures during Fall 2017 -- that's 52 hours in total! We will try to distribute this time carefully between Classroom Lectures (about 36-40 hours), Invited Talks (about 8 hours), and Interactive Sessions (about 4-8 hours) -- as required for the course. The schedule of the Classroom Lectures, and all relevant references and resources are posted as follows.

To prevent unwanted execution, the Python files (sometimes other code files too) linked below will be downloaded as *.py.txt. Please remove the .txt extension, and use the files normally.

Lec Date Topic Reference Resources
01 26-07-2017 Introduction to Computing   Demaine and Devadas : Lec. 1   Binary Search
02 28-07-2017 Introduction to Complexity   Cormen et al. (Algo) : Ch. 1, 2, 3   Asymptotic Notation
28-07-2017 Introduction to Python   Python Class by Google   pythonIntro.py
03 02-08-2017 Introduction to Convergence   Demaine and Devadas : Lec. 11, 12   Newton's Method
04 04-08-2017 Finding Roots and Minima   Solomon (Numerical) : Ch. 8.1   Video 1   |   Video 2
05 09-08-2017 Linear Regression and Optimization   Andrew Ng : Lec. 2   Gradient Descent
06 10-08-2017 Linear Regression and Estimation   James et al. (ISLR) : Ch. 3.1
  Hastie and Tibshirani : Ch. 3 Lectures
  IntroLinReg.R
  Advertising.csv
07 16-08-2017 Linear Regression Fundamentals   James et al. (ISLR) : Ch. 3.1, 3.2
  Hastie and Tibshirani : Ch. 3 Lectures
  LinRegBasics.R
  Advertising.csv
08 17-08-2017 Linear Regression and Prediction   James et al. (ISLR) : Ch. 3.1, 3.2
  Hastie and Tibshirani : Ch. 3 Lectures
  LinRegPred.R
  The Intervals
09 23-08-2017 Linear Regression and Projection   Gilbert Strang : Lec. 10
  Andrew Ng : Lec. 2
  The Sub-Spaces
  Normal Equation
10 25-08-2017 Bias-Variance and Cross-Validation   James et al. (ISLR) : Ch. 5.1
  Hastie and Tibshirani : Ch. 5 Lectures
  CrossVal.R
  Bias-Variance
11 30-08-2017 Linear Regression Laboratory   James et al. (ISLR) : Ch. 3.6
  Hastie and Tibshirani : Ch. 3 Lectures
  LinRegLab.R
  Python Starter
12 01-09-2017 Model Selection and Regularization   James et al. (ISLR) : Ch. 6.1, 6.2
  Hastie and Tibshirani : Ch. 6 Lectures
  ModelSelect.R
  Regression by UW
13 06-09-2017 Classification and Logistic Regression   James et al. (ISLR) : Ch. 4.1 to 4.3
  Hastie and Tibshirani : Ch. 4 Lectures
  LogRegBasics.R
  Maximum Likelihood
19-09-2017 Cricket Analytics : Dr. Srinivas Bhogle   Slides on T20   |     Article on DL   Thesis on DL in T20
20-09-2017 Mid-Sem Examination Open-Resources Data Analytics Hackathon Note : Group Test
14 22-09-2017 Logistic Regression and Prediction   James et al. (ISLR) : Ch. 4.4, 4.5
  Hastie and Tibshirani : Ch. 4 Lectures
  LogRegBasics.R
  Intro to ROC Analysis
15 11-10-2017 Partition, Information, and Tree Models   James et al. (ISLR) : Ch. 8.1
  Hastie and Tibshirani : Ch. 8 Lectures
  Classification by UW : Week 3 Lectures
  DecisionTree.R
  Advertising.csv
  CarEvaluation.csv
16 13-10-2017 Bootstrap, Bagging and Random Forest   James et al. (ISLR) : Ch. 8.2
  Hastie and Tibshirani : Ch. 8 Lectures
  Classification by UW : Week 3 Lectures
  RandomForest.R
  Advertising.csv
  CarEvaluation.csv
17 17-10-2017 Tuning and Gradient Boosted Models   James et al. (ISLR) : Ch. 8.2
  Hastie and Tibshirani : Ch. 8 Lectures
  Classification by UW : Week 3 Lectures
  TuningBoosting.R
  Boosted Trees
  Boosting Visualized
18 18-10-2017 Hard & Soft Margins for Classification   James et al. (ISLR) : Ch. 9.1, 9.2
  Hastie and Tibshirani : Ch. 9 Lectures
  SupportVectors.R
  Margin Formulation
19 20-10-2017 Support Vector Machines and Kernels   James et al. (ISLR) : Ch. 9.3, 9.4
  Hastie and Tibshirani : Ch. 9 Lectures
  SupVecMachine.R
  SVMachineLab.R
20 25-10-2017 Unsupervised Learning and Clustering   James et al. (ISLR) : Ch. 10.3
  Hastie and Tibshirani : Ch. 10 Lectures
  Clustering.R
  k-Means Visualized
21 26-10-2017 Distances and Hierarchical Clustering   James et al. (ISLR) : Ch. 10.3
  Hastie and Tibshirani : Ch. 10 Lectures
  Clustering.R
  Hierarchical Clustering
22 27-10-2017 Unsupervised Learning and Embedding   James et al. (ISLR) : Ch. 10.2
  Hastie and Tibshirani : Ch. 10 Lectures
  Tutorial on PCA by Jonathon Shlens
  DimReduction.R
  joker.jpg
  t-SNE Resources
31-10-2017 Survival Analysis : Dr. Biswabrata Pradhan   Slides on Survival Analysis   Springate (2014)
01-11-2017 Recommendation Systems : Arkosnato Neogy   Koren Paper   |     Hinton Paper   Pointers to Resources
01-11-2017 Time Series Analysis : Dr. Diganta Mukherjee   Slides on Time Series Analysis   Shumway and Stoffer
03-11-2017 Survival Analysis : Dr. Biswabrata Pradhan   Slides on Survival Analysis   Springate (2014)
23 07-11-2017 Matrix Factorization and Recommendation   Leskovec et al. (MMDS) : Ch. 9.1, 9.4, 9.5
  Recommender Systems (Coursera)
  Recommender.R
  jesterRatings.csv
  Paper by Koren et al.
08-11-2017 Expectation-Maximization : Dr. Anil Ghosh   Theory and Use of the EM Algorithm   Mixture Models
08-11-2017 Time Series Analysis : Dr. Diganta Mukherjee   Slides on Time Series Analysis   Shumway and Stoffer
13-11-2017 Probability Distributions : Dr. Arnab Chakraborty   Slides on Probability Distributions   Seeing Theory
24 13-11-2017 EM-Clust, PageRank, Scraping, and Monte-Carlo   EM Algorithm for Gaussian Mixtures
  Leskovec et al. (MMDS) : Ch. 5.1, 5.2
  Tutorial on Data Scraping with rvest
  Tricks with Monte-Carlo Simulations
  mclust Tutorial
  PageRank.R
  DataScraping.R
  mcSimulation.R
29-11-2017 End-Sem Examination Closed-Book Standard Written Test Note : Individual Test
01-12-2017 Project Presentations Schedule to be Finalized and Posted Note : Group Talks
02-12-2017 Project Presentations Schedule to be Finalized and Posted Note : Group Talks

  Books and Compilations

Textbooks and Recommended References

There is no single textbook for CDS. The following books and compiled lecture notes treat, quite comprehensively, the topics that CDS broadly tries to cover -- highly recommended.

  • Chambers

    Software for Data Analysis: Programming with R

    John Chambers Springer (Second Printing), 2009
  • CormenEtal

    Introduction to Algorithms

    Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein The MIT Press (Third Edition), 2009
  • Solomon

    Numerical Algorithms

    Justin Solomon Available online at this link
  • ISL

    An Introduction to Statistical Learning

    Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani Available online at this link
  • ESL

    The Elements of Statistical Learning

    Trevor Hastie, Robert Tibshirani and Jerome Friedman Available online at this link
  • MMD

    Mining of Massive Datasets

    Jure Leskovec, Anand Rajaraman and Jeff Ullman Available online at this link

The books listed above are the basic references for CDS. During the course, we will refer to these books from time-to-time, as and when required, but we may not see through any of these books cover-to-cover.

  Lecture Videos and Notes

Adds a completely new dimension to Reading

There are several lecture notes and video lectures that perfectly complement the material that CDS plans to cover. The students are encouraged to follow these amazing resources.


The courses and video lectures listed above are closely related to the topics covered in CDS. During the course, we will refer to these lectures from time-to-time, as and when required, but we may not need to go through any of these courses end-to-end. In addition to the above, please feel free to explore Coursera.

  R Programming Resources

The main tool for computing used during CDS is R. The following resources may help you in getting yourself acquainted with the basics of R. Get comfortable, and get your hands dirty!

  Python Programming Resources

The most versatile language for Scientific Computation is Python. The following resources may help you in getting yourself acquainted with the basics of Python. Get used to it!

  Assignments

Worth 30% of the total grade

Assignments will be posted on a regular basis. Each group (sometimes individual) has to submit 5 assignments during the course of the semester, out of which the 0-th one will not be graded. All assignments are worth equal, that is, 7.5% of the total grade, each.

Assignment Posted Deadline Remarks
  Assignment 0 28-07-2017 09-08-2017 Not graded
  Assignment 1 07-08-2017 15-08-2017 Individual
  Assignment 2
  wineTrain.csv
31-08-2017 18-09-2017 Group
  Assignment 3
  smsdata.txt
20-10-2017 30-10-2017 Group
  Assignment 4
  cuisine.json
05-11-2017 15-11-2017 Group

  Projects

Worth 30% of the total grade

About 30% weightage (out of 50%) will be reserved in the End-of-Semeseter evaluation for the Term Project. Each group is supposed to deliver a Project Presentation (30 mins per group), including a Q&A session (5 mins per group), and a Project Report (theory/code).

Each group is at the liberty of choosing the topic for their Term Project. However, the term project chosen by each group must be approved by Sourav before they may be executed. The last date for finalizing the topics for the term projects is Thursday, 7 September 2017.


One may take part in a competition on Kaggle, DrivenData, CrowdAnalytix, InnoCentive, etc, or explore the open datasets made available by UNICEF, WHO, AWS, Google, EU, and US Govt.

Term Projects may also be inspired by earlier Stanford CS229 projects available at 2016, 2016 Spring, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, or by the diverse list of Business Case Studies posted by the Sloan School, MIT.

Potential topics for Term Projects may also include substantial extension (upon mutual discussion with Sourav) of previous years' projects available at CDS 2016, CDS 2015.