Computing for Data Sciences
Welcome to the 2017 edition of the course
Computing for Data Sciences (CDS), aka BAISI4, is one of the five courses offered at ISI Kolkata during the First Semester of the PGDBA program. The Fall 2017 edition of the course  CDS 2017  is taught by Sourav Sen Gupta from R C Bose Centre, Indian Statistical Institute, Kolkata.
Information
Course : BAISI4 (aka CDS)
Wednesday & Friday @ 11:00  13:00
Assignments = 30% (mixed form)
MidSem Eval. = 20% (hackathon)
EndSem Eval. = 50% (incl. project)
Lectures
We have time for about 26 twohour lectures during Fall 2017  that's 52 hours in total! We will try to distribute this time carefully between Classroom Lectures (about 3640 hours), Invited Talks (about 8 hours), and Interactive Sessions (about 48 hours)  as required for the course. The schedule of the Classroom Lectures, and all relevant references and resources are posted as follows.
To prevent unwanted execution, the Python files (sometimes other code files too) linked below will be downloaded as
Lec  Date  Topic  Reference  Resources 

01  26072017  Introduction to Computing  Demaine and Devadas : Lec. 1  Binary Search 
02  28072017  Introduction to Complexity  Cormen et al. (Algo) : Ch. 1, 2, 3  Asymptotic Notation 
28072017  Introduction to Python  Python Class by Google  pythonIntro.py  
03  02082017  Introduction to Convergence  Demaine and Devadas : Lec. 11, 12  Newton's Method 
04  04082017  Finding Roots and Minima  Solomon (Numerical) : Ch. 8.1  Video 1  Video 2 
05  09082017  Linear Regression and Optimization  Andrew Ng : Lec. 2  Gradient Descent 
06  10082017  Linear Regression and Estimation  James et al. (ISLR) : Ch. 3.1
Hastie and Tibshirani : Ch. 3 Lectures 
IntroLinReg.R
Advertising.csv 
07  16082017  Linear Regression Fundamentals  James et al. (ISLR) : Ch. 3.1, 3.2
Hastie and Tibshirani : Ch. 3 Lectures 
LinRegBasics.R
Advertising.csv 
08  17082017  Linear Regression and Prediction  James et al. (ISLR) : Ch. 3.1, 3.2
Hastie and Tibshirani : Ch. 3 Lectures 
LinRegPred.R
The Intervals 
09  23082017  Linear Regression and Projection 
Gilbert Strang : Lec. 10
Andrew Ng : Lec. 2 
The SubSpaces
Normal Equation 
10  25082017  BiasVariance and CrossValidation  James et al. (ISLR) : Ch. 5.1
Hastie and Tibshirani : Ch. 5 Lectures 
CrossVal.R
BiasVariance 
11  30082017  Linear Regression Laboratory  James et al. (ISLR) : Ch. 3.6
Hastie and Tibshirani : Ch. 3 Lectures 
LinRegLab.R
Python Starter 
12  01092017  Model Selection and Regularization  James et al. (ISLR) : Ch. 6.1, 6.2
Hastie and Tibshirani : Ch. 6 Lectures 
ModelSelect.R
Regression by UW 
13  06092017  Classification and Logistic Regression  James et al. (ISLR) : Ch. 4.1 to 4.3
Hastie and Tibshirani : Ch. 4 Lectures 
LogRegBasics.R
Maximum Likelihood 
19092017  Cricket Analytics : Dr. Srinivas Bhogle  Slides on T20  Article on DL  Thesis on DL in T20  
20092017  MidSem Examination  OpenResources Data Analytics Hackathon  Note : Group Test  
14  22092017  Logistic Regression and Prediction  James et al. (ISLR) : Ch. 4.4, 4.5
Hastie and Tibshirani : Ch. 4 Lectures 
LogRegBasics.R
Intro to ROC Analysis 
15  11102017  Partition, Information, and Tree Models  James et al. (ISLR) : Ch. 8.1
Hastie and Tibshirani : Ch. 8 Lectures Classification by UW : Week 3 Lectures 
DecisionTree.R
Advertising.csv CarEvaluation.csv 
16  13102017  Bootstrap, Bagging and Random Forest  James et al. (ISLR) : Ch. 8.2
Hastie and Tibshirani : Ch. 8 Lectures Classification by UW : Week 3 Lectures 
RandomForest.R
Advertising.csv CarEvaluation.csv 
17  17102017  Tuning and Gradient Boosted Models  James et al. (ISLR) : Ch. 8.2
Hastie and Tibshirani : Ch. 8 Lectures Classification by UW : Week 3 Lectures 
TuningBoosting.R
Boosted Trees Boosting Visualized 
18  18102017  Hard & Soft Margins for Classification  James et al. (ISLR) : Ch. 9.1, 9.2
Hastie and Tibshirani : Ch. 9 Lectures 
SupportVectors.R
Margin Formulation 
19  20102017  Support Vector Machines and Kernels  James et al. (ISLR) : Ch. 9.3, 9.4
Hastie and Tibshirani : Ch. 9 Lectures 
SupVecMachine.R
SVMachineLab.R 
20  25102017  Unsupervised Learning and Clustering  James et al. (ISLR) : Ch. 10.3
Hastie and Tibshirani : Ch. 10 Lectures 
Clustering.R
kMeans Visualized 
21  26102017  Distances and Hierarchical Clustering  James et al. (ISLR) : Ch. 10.3
Hastie and Tibshirani : Ch. 10 Lectures 
Clustering.R
Hierarchical Clustering 
22  27102017  Unsupervised Learning and Embedding  James et al. (ISLR) : Ch. 10.2
Hastie and Tibshirani : Ch. 10 Lectures Tutorial on PCA by Jonathon Shlens 
DimReduction.R
joker.jpg tSNE Resources 
31102017  Survival Analysis : Dr. Biswabrata Pradhan  Slides on Survival Analysis  Springate (2014)  
01112017  Recommendation Systems : Arkosnato Neogy  Koren Paper  Hinton Paper  Pointers to Resources  
01112017  Time Series Analysis : Dr. Diganta Mukherjee  Slides on Time Series Analysis  Shumway and Stoffer  
03112017  Survival Analysis : Dr. Biswabrata Pradhan  Slides on Survival Analysis  Springate (2014)  
23  07112017  Matrix Factorization and Recommendation  Leskovec et al. (MMDS) : Ch. 9.1, 9.4, 9.5
Recommender Systems (Coursera) 
Recommender.R
jesterRatings.csv Paper by Koren et al. 
08112017  ExpectationMaximization : Dr. Anil Ghosh  Theory and Use of the EM Algorithm  Mixture Models  
08112017  Time Series Analysis : Dr. Diganta Mukherjee  Slides on Time Series Analysis  Shumway and Stoffer  
13112017  Probability Distributions : Dr. Arnab Chakraborty  Slides on Probability Distributions  Seeing Theory  
24  13112017  EMClust, PageRank, Scraping, and MonteCarlo  EM Algorithm for Gaussian Mixtures
Leskovec et al. (MMDS) : Ch. 5.1, 5.2 Tutorial on Data Scraping with rvest Tricks with MonteCarlo Simulations 
mclust Tutorial
PageRank.R DataScraping.R mcSimulation.R 
29112017  EndSem Examination  ClosedBook Standard Written Test  Note : Individual Test  
01122017  Project Presentations  Schedule decided by Project Groups  Note : Group Talks  
02122017  Project Presentations  Schedule decided by Project Groups  Note : Group Talks 
Books and Compilations
Textbooks and Recommended References
There is no single textbook for CDS. The following books and compiled lecture notes treat, quite comprehensively, the topics that CDS broadly tries to cover  highly recommended.

Software for Data Analysis: Programming with R
John Chambers Springer (Second Printing), 2009 
Introduction to Algorithms
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein The MIT Press (Third Edition), 2009 
Numerical Algorithms
Justin Solomon Available online at this link 
An Introduction to Statistical Learning
Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani Available online at this link 
The Elements of Statistical Learning
Trevor Hastie, Robert Tibshirani and Jerome Friedman Available online at this link 
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman and Jeff Ullman Available online at this link
The books listed above are the basic references for CDS. During the course, we will refer to these books from timetotime, as and when required, but we may not see through any of these books covertocover.
Lecture Videos and Notes
Adds a completely new dimension to Reading
There are several lecture notes and video lectures that perfectly complement the material that CDS plans to cover. The students are encouraged to follow these amazing resources.

Algorithms  Demaine and Devadas
Lecture videos available online at this link 
Numerical Algorithms  Justin Solomon
Lecture videos available online at this link 
Linear Algebra  Gilbert Strang
Lecture videos available online at this link 
Statistical Learning  Hastie and Tibshirani
Lecture videos available online at this link 
Machine Learning  Univ. of Washington
Coursera course available online at this link 
Machine Learning  Andrew Ng
Lecture videos available online at this link 
Machine Learning  Nando de Freitas
Lecture videos available online at this link
The courses and video lectures listed above are closely related to the topics covered in CDS. During the course, we will refer to these lectures from timetotime, as and when required, but we may not need to go through any of these courses endtoend. In addition to the above, please feel free to explore Coursera.
R Programming Resources
The main tool for computing used during CDS is R. The following resources may help you in getting yourself acquainted with the basics of R. Get comfortable, and get your hands dirty!
 Introduction to R by DataCamp  Quite nice introduction, and even better, free!
 R by Example  By Ajay Shah  Quick illustrative examples of features.

Advanced topics in R and Understanding Memory in R  By Hadley Wickham
Really good articles on the advanced technical nittygritties of R. 
R and Data Mining  By Yanchang Zhao
Wonderful book with a number of cool applications of Data Mining.  Introducing
Monte Carlo Methods with R  By C.P. Robert and G. Casella
Lovely introduction to Monte Carlo simulations with handson applications in R. 
Monte Carlo Simulations in R  By Will Kurt
Nice blogpost introducing Monte Carlo simulations using R.
Python Programming Resources
The most versatile language for Scientific Computation is Python. The following resources may help you in getting yourself acquainted with the basics of Python. Get used to it!
 Python Class by Google  Decent online class to learn the basics of Python.
 Think Python  By Allen B. Downey  Teaches CS through Python.
 Introduction to Python for Data Science by DataCamp  Really nice free course.

scikitlearn  Machine Learning in Python
Undoubtedly the best Python package for Data Analytics and Machine Learning.
Visual Map for ML Algorithms  Video Tutorials from Data School 
Building Machine Learning Systems with Python
Wonderful book with a number of cool applications of Machine Learning.
By Willi Richert and Luis Pedro Coelho  Source Codes from GitHub  A Programmer's Guide to Data Mining  By Ron Zacharski  Really nice!
Assignments
Worth 30% of the total grade
Assignments will be posted on a regular basis. Each group (sometimes individual) has to submit 5 assignments during the course of the semester, out of which the 0th one will not be graded. All assignments are worth equal, that is, 7.5% of the total grade, each.
Assignment  Posted  Deadline  Remarks 

Assignment 0  28072017  09082017  Not graded 
Assignment 1  07082017  15082017  Individual 
Assignment 2
wineTrain.csv 
31082017  18092017  Group 
Assignment 3
smsdata.txt 
20102017  30102017  Group 
Assignment 4
cuisine.json 
05112017  15112017  Group 
Projects
Worth 30% of the total grade
About 30% weightage (out of 50%) will be reserved in the EndofSemeseter evaluation for the Term Project. Each group is supposed to deliver a Project Presentation (30 mins per group), including a Q&A session (5 mins per group), and a Project Report (theory/code).
Each group is at the liberty of choosing the topic for their Term Project. However, the term project chosen by each group must be approved by Sourav before they may be executed. The last date for finalizing the topics for the term projects is Thursday, 7 September 2017.
One may take part in a competition on Kaggle, DrivenData, CrowdAnalytix, InnoCentive, etc, or explore the open datasets made available by UNICEF, WHO, AWS, Google, EU, and US Govt.
Term Projects may also be inspired by earlier Stanford CS229 projects available at 2016, 2016 Spring, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, or by the diverse list of Business Case Studies posted by the Sloan School, MIT.
Potential topics for Term Projects may also include substantial extension (upon mutual discussion with Sourav) of previous years' projects available at CDS 2016, CDS 2015.