Big Data Algorithms course 

This page contains syllabus, reading material, and exam for the course "Big Data Algorithms".

For any questions or comments regarding the lecture or this web site, please contact  Rajendra Akerkar.

Course Syllabus

In many modern applications, the size of the input data is extremely large to fit within memory. In this course, we discuss algorithms and analysis techniques aimed at addressing such massive data situations. Standard algorithmic solutions do not address such challenges, and new algorithmic techniques are needed.  This course looks at a number of algorithmic responses to problems, namely, algorithms with linear or sub-linear running time, algorithms where the data arrives in a stream, and computational models where memory is organized hierarchically .

Learning Outcomes

By the end of the course, the student must be able to:

Project Topics

The project topics can be found here.

Presentation Topics

The student presentation topics:

  1. Online Bipartite Matching
  2. MapReduce Programming Model
  3. Communication Complexity
  4. Least squares regression
  5. Dimensionality Reduction for k-Means Clustering
  6. Cell Probe Model
  7. Minimum Spanning Tree Algorithms

Lectures

This will be 2 weeks course. Each class session will be of 1 hour 30 min. duration. Classes will comprise of lecture discussion etc. Students will be encouraged to participate in class-discussion and will make at least one presentation during the course.

Each lecture has associated readings and reference material. Basic reading is required material for the course.

Streaming algorithms.
These algorithms extract only a small amount of information about the dataset, which approximately preserves its key properties. Such algorithms are typically allowed to make only one pass over the data or limited passes. The aim of algorithm design is to minimize the number of passes and space, while achieving the best possible approximation. [Notes]

Sub-Linear Time Algorithms
We will discuss algorithms that have access to the whole dataset. But, the size of the data is excessively large so that we can only make a small number of carefully chosen queries to it. The aim of algorithm design is to approximate parameters of the dataset and study its properties while minimizing the number of queries and running time. [Notes]

Linear Algebraic Models
We will focus on large scale linear algebra. It also covers a variety of algorithms for large matrices and vectors, subspace embeddings and compressed sensing. approximation matrix multiplication. [Notes]

Other Models
We will discuss a broader spectrum of computational models such as cell-probe model, MapReduce programming model, Crowd-sourcing and some aspects of computational complexity. [Notes]

Reading Material

Books
Mining of Massive Datasets by Anand Rajaraman, Jure Leskovec and Jeffrey D. Ullman, 2nd edition.

Lecture notes will be provided for each session.

Materials from the following related courses might be useful in various parts of the course:
  1. Algorithms for Big Data 2013 Jelani Nelson (Harvard)
  2. Sublinear Algorithms for Big Data: Qin Zhang (University of Indiana Bloomington)
  3. Sublinear algorithms: Piotr Indyk, Ronitt Rubinfeld (MIT).

Exam

Details about the end of semester exam will be available here.



This website is licensed under a Creative Commons Attribution-Share Alike 3.0 License.