This page contains syllabus, reading
material, and exam for the course "Big Data Algorithms".

For any questions or comments regarding the lecture or this web site, please contact Rajendra Akerkar.

In many modern applications, the
size of the input data is extremely large to fit within memory. In
this course, we discuss algorithms and analysis techniques aimed
at addressing such massive data situations. Standard algorithmic
solutions do not address such challenges, and new algorithmic
techniques are needed. This course looks at a number of
algorithmic responses to problems, namely, algorithms with linear
or sub-linear running time, algorithms where the data arrives in a
stream, and computational models where memory is organized
hierarchically .

**Learning Outcomes **

- Design efficient algorithms for variations of problems discussed in class
- Analyse space/time/communication complexity of randomized algorithms
- Prove space/time/communication lower bounds for variations of
problems

- Choose an appropriate algorithmic tool for big data problem at hand

The project topics can be found here.

The student presentation topics:

- Online Bipartite Matching
- MapReduce Programming Model
- Communication Complexity
- Least squares regression
- Dimensionality Reduction for k-Means Clustering
- Cell Probe Model
- Minimum Spanning Tree Algorithms

This will be 2 weeks course. Each class session will be of 1 hour 30 min. duration. Classes will comprise of lecture discussion etc. Students will be encouraged to participate in class-discussion and will make at least one presentation during the course.

Each lecture has
associated readings and reference material. Basic reading is
required material for the course.

Streaming algorithms.

These algorithms extract only a small amount of information about the dataset, which approximately preserves its key properties. Such algorithms are typically allowed to make only one pass over the data or limited passes. The aim of algorithm design is to minimize the number of passes and space, while achieving the best possible approximation. [Notes]

Sub-Linear Time Algorithms

We will discuss algorithms that have access to the whole dataset. But, the size of the data is excessively large so that we can only make a small number of carefully chosen queries to it. The aim of algorithm design is to approximate parameters of the dataset and study its properties while minimizing the number of queries and running time. [Notes]

Linear Algebraic Models

We will focus on large scale linear algebra. It also covers a variety of algorithms for large matrices and vectors, subspace embeddings and compressed sensing. approximation matrix multiplication. [Notes]

Other Models

We will discuss a broader spectrum of computational models such as cell-probe model, MapReduce programming model, Crowd-sourcing and some aspects of computational complexity. [Notes]

Mining of Massive Datasets by Anand Rajaraman, Jure Leskovec and Jeffrey D. Ullman, 2nd edition.

Lecture notes will be provided for each session.

Materials from the following related courses might be useful in various parts of the course:

- Algorithms for Big Data 2013 Jelani Nelson (Harvard)
- Sublinear Algorithms for Big Data: Qin Zhang (University of Indiana Bloomington)
- Sublinear algorithms: Piotr Indyk, Ronitt Rubinfeld (MIT).

Details about the end of semester exam will be available here.

This website is licensed under a Creative Commons Attribution-Share Alike 3.0 License.