## Computing for Data Sciences

### Welcome to the Fall 2016 edition of the course

The Post Graduate Diploma in Business Analytics (PGDBA) - jointly offered by IIM Calcutta, IIT Kharagpur, and ISI Kolkata - aims to help shape the emerging profession of Business Analytics by delivering a cutting edge inter-disciplinary educational experience to its graduates.

Computing for Data Sciences (CDS), aka BAISI-4, is one of the five courses offered at ISI Kolkata during the First Semester of the PGDBA program. The Fall 2016 edition of the course -- CDS 2016 -- is taught by Sourav Sen Gupta from R C Bose Centre, Indian Statistical Institute, Kolkata.

Reach Sourav   :     sg.sourav@gmail.com   |     +91 94323 44852   |     Room 404, Deshmukh Building

## Lectures

We have time for about 28 two-hour lectures during Fall 2016 -- that's a whopping 56 hours! We will try to distribute this time carefully between Classroom Lectures (about 36-40 hours), Invited Talks (about 8 hours), and Interactive Sessions (about 8-12 hours) -- as required for the course.

The basic outline of the Classroom Lectures, and all relevant references and resources will be posted regularly on this website. The corresponding lecture notes will be authored and posted by the students taking the course -- in the form of blog articles -- at the CDS 2016 Lecture Notes blog.

### First Half (Pre Mid-Sem) : 12 Lectures

We discussed the format for the course, and the administrative issues thereof. Format for the groups were specified for assignments and the term-project, and the resources residing on the CDS website were pointed out. The computing platform and requirements for the course was defined as well, and installation of R (link), RStudio (link) and Python 2.7.x (link) was recommended.

We discussed the primary goal of the course -- to understand the fundamental notions of Regression, Classification and Clustering on big-data -- with support from techniques like dimensionality reduction. The role of efficient algorithm design was mentioned in this regard, and we started basic discussion on time complexity of an algorithm in terms of different classes of functions and their mutual comparison. Preliminary notions of $\theta$, $O$, $\Omega$, $o$ and $\omega$ in terms of complexity were informally introduced as analogues to 'equality' ($=$), 'less than or equal to' ($\leq$), 'greater than or equal to' ($\geq$), 'strictly less than' ($<$) and 'strictly greater than' ($>$), respectively, in terms of such functions.

Homework : Plot $f(n)$ against $n$ for various functions $1, 2, 1000$; $\log(n), 2\log(n), 1000\log(n)$; $n, 2n, 1000n$; $n^2, 2n^2, 1000n^2$; $n^3, 2n^3, 1000n^3$; $2^n, 3^n$; etc., and try to observe patterns of clustering, if any, for large values of $n$.

We continued our discussion on the basic notions of computational complexity by classifying $f(n)$, the function denoting the time complexity of an algorithm with input size $n$, in various categories -- constant or $O(1)$; logarithmic or $O(\log(n))$; linear or $O(n)$; linearathmic or $O(n\log(n))$; quadratic or $O(n^2)$; polynomial or $O(n^c)$; exponential or $O(c^n)$; factorial or $O(n!)$.

We took a few examples -- addition and multiplication of two numbers (where $n$ is the maximum bitsize of the input numbers), finding the maximum or minimum in a list of numbers (where $n$ is the size of the list), sorting a bunch of playing cards (where $n$ is the number of cards), multiplication of a matrix to a vector (where $m \times n$ are the dimensions of the matrix) -- to illustrate various time complexities.

Reading : Chapter 3 of "Introduction to Algorithms" (by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein)

We motivated the vector-space modelling of data, taking the classic example of representing books as term-frequency vectors on a high-dimensional space, where each term/word represents a new dimension and the values are frequencies of those terms/words in the book. In this context, we learned the notion of distance -- in the sense of norm -- and discussed the three major norms, namely, $l_1$, $l_2$ and $l_{\infty}$.

We tried to analyze the action of a matrix on a vector during multiplication, and figured that an $m \times n$ matrix defined over the reals ($\mathbb{R}$) generally rotates, scales or reflects a vector. We discussed the effect of scaling in terms of the growth of Fibonacci numbers by writing the recurrence relation $F_n = F_{n-1} + F_{n-2}$ in terms of a matrix, and figured that the eigenvalues of this matrix control the scaling.

In fact, we observed that if we decompose a square matrix into its eigenvalues and eigenvectors using the eigen() function in R, then the original matrix can be written as $M = V \Lambda V^{-1}$, where $V$ is the matrix with the eigenvectors of $M$ as its columns, and $\Lambda$ is a diagonal matrix with the eigenvalues of $M$ as its diagonal elements. This gave us the nice relation $M^kv = (V \Lambda^k V^{-1}) v$ for any vector $v$.

We noted that the action of $M$ on a vector $v$ can be naturally decomposed as $Mv = (V \Lambda V^{-1}) v$, which is a three-step process where $V^{-1}$ and $V$ take care of two steps of rotation (being orthonormal matrices), and $\Lambda$ takes care of the entire scaling. We hinted that such a decomposition is possible for any $m \times n$ matrix; it is called the Singular Value Decomposition, computed using the svd() function in R.

Reading : Lectures 1, 2, 3 of "Linear Algebra" by Gilbert Strang (link)   |   "Geometric Review of Linear Algebra" by Simoncelli (link)

In this lecture, we tried to view an $m \times n$ matrix as a linear operator (analogous to a function) from $\mathbb{R}^n$ to $\mathbb{R}^m$, and attempted an analysis of matrices in terms of linear functions. We learned the notions of linear dependence and independence of a set of vectors, and tried to view, geometrically, the span of a collection of vectors as the space generated by taking their linear combinations. Using the notion of span, we tried to identify the range of the matrix as a subspace of $\mathbb{R}^m$, and figured that the range of an $m \times n$ matrix as a linear operator is essentially its column space, that is, the span of its column vectors. Evidently, $\dim(\textrm{col}(A_{m \times n})) \leq m$.

It seemed quite natural to look at the $n \times m$ matrix $A^T$ as a linear operator from $\mathbb{R}^m$ to $\mathbb{R}^n$ (it is NOT the inverse operator of $A$). It was immediately obvious that the row space of the matrix $A$ is identical to the column space of $A^T$, that is, $\textrm{row}(A) = \textrm{col}(A^T)$. Thus, we concluded that $\dim(\textrm{row}(A)) = \dim(\textrm{col}(A^T)) \leq n$, and that $\textrm{row}(A)$ is a subspace of $\mathbb{R}^n$.

Next, we viewed the operation of the matrix $A$ on a vector $v \in \mathbb{R}^n$ as a collection of dot products of the rows of $A$ with $v$, and concluded that the solutions of $Av = 0$ must all be orthogonal to every row of $A$, and hence, must be orthogonal to $\textrm{row}(A)$. The collection of all such vectors is called the null space of $A$, and the above analysis illustrated that $\textrm{null}(A) \perp \textrm{row}(A)$. We also observed that $\textrm{null}(A)$ is non-empty as it always contains the zero-vector ($0$) in $\mathbb{R}^n$, and that $\dim(\textrm{null}(A)) \leq n$. Similarly, we identified the null space of $A^T$ as a subspace of $\mathbb{R}^m$, and observed that $\textrm{null}(A^T) \perp \textrm{row}(A^T)$, that is, $\textrm{null}(A^T) \perp \textrm{col}(A)$.

In effect, we identified four subspaces for the $m \times n$ matrix $A$ in this lecture -- the row space of $A$ (same as the column space of $A^T$), the column space of $A$ (same as the row space of $A^T$), the null space of $A$, and the null space of $A^T$, as shown in the figure below.

Reading : Paper 1 and Paper 2 by Gilbert Strang   |   "Geometric Review of Linear Algebra" by Simoncelli (link)
Homework : Determine the dimensions of all the four subspaces identified in this lecture, and find their relationship with $m$ and $n$.

Introduced linear regression from the notion of solving linear equations. Detailed abstract and further resources to come.

Introduced the gradient descent algorithm to solve the problem of linear regression. Detailed abstract and further resources to come.

Discussed the batch gradient descent and stochantic gradient descent algorithms to solve the problem of linear regression. Detailed abstract and further resources to come.

Discussed the hypothesis testing and parameter estimation point-of-view of linear regression, in terms of estimating the underlying linear model in data. Detailed abstract and further resources to come.

Discussed the singular value decomposition and principal component analysis in connection with explaining variance in the data. Detailed abstract and further resources to come.

Introduced the notion of classification as another supervised learning problem, and discussed decision trees as a model for solving the problem. Detailed abstract and further resources to come.

Introduced the notion of random forests as an ensemble model based on decision trees, in connection with solving the problem of classification. Detailed abstract and further resources to come.

This was a Lab Session, involving hands-on application of linear models and tree-based models to a dataset. Discussed variable importance factor obtained from random forests. Detailed abstract and further resources to come.

### Mid-Semester Examination

On 22 September 2016, we had our Mid-Semester Examination for the course. It was a (roughly) six-hour Hackathon, targeted towards the comprehensive EDA and Multiple Regression Analysis of a given dataset (private, non-shareable). It was a Kaggle-like group competition.

### Second Half (Post Mid-Sem)

Discussed (recap) various notions of distance, and introduced the concept of nearest neighbor search in context of text-documents. Introduced TF-IDF for documents, and kd-Tree as a data structure for efficient neighborhood search. Detailed abstract and further resources to come.

Introduced the concept of unsupervised learning and discussed various algorithms for clustering data points. Specifically, discussed k-Means and hierarchical clustering algorithms and their usage. Detailed abstract and further resources to come.

We had a wonderful guest lecture by Bodhisattwa, Robin, Ayan and Jayanta (from the PGDBA senior batch), introducing the concept of neural networks and deep learning in context of their participation and experience in the Data Science Game 2016. Detailed abstract and further resources to come.

Discussed the concept of bias-variance tradeoff in context of overfitting and underfitting a model to a given dataset. Introduced regularized models like ridge regression and lasso to illustrate the optimization of bias and variance. Detailed abstract and further resources to come.

## Projects

Adequate weightage will be reserved in the End-of-Semeseter evaluation (50%) for the Term Project. Each group is supposed to deliver a Project Presentation (30 mins per group), including a Q&A session (10 mins per group), and a Project Report (theory/code) in the form of a blog-article.

Each group is at the liberty of choosing the topic for their Term Project. However, the Project chosen by each group must be approved by Sourav before they may be executed. Potential choices for the term projects may be one from this list of suggested topics, a substantial extension of one of the projects from CDS 2015, or any other practical and/or theoretical project relevant to the course, upon mutual agreement with Sourav.

The last date for finalizing the topic for your project is 7 October 2016. Project presentations will be scheduled on 2 and 3 December 2016.

## Information

### Course : BAISI-4 (aka CDS)

Tuesday & Thursday   @   11:00 - 13:00

Assignments + Lecture Scribes = 20%
Mid-Sem Exam = 30%   |   Hackathon
End-Sem Exam = 50%   |   Term Project

## Assignments

Assignments constitute 20% of the total marks, including group scribing for lecture notes (approx. 5%).

### Assignment 1

To be posted. May be on Programming.

### Assignment 2

To be posted. May be a Competition.