Seung Woo Son

EECE.5540: Data Intensive Computing

Credits: 3

Contact Hours: 3 hours of lecture

Instructor: Seung Woo Son

Textbook: There are NO textbooks required for this course. We will read the papers published in conferences and journals. PDF files will be available through Learning Management Systems such as Blackboard or others.

Other supplemental materials: All supplemental materials can be found on the UMass Lowell Blackboard portal (https://www.uml.edu/blackboard/). These materials include lecture slides, handouts, recordings, assignments, quizzes, and other documentation.

Course Catalog Description:

The past decade has been a period of remarkable growth of data sets, offering enormous opportunities for important discoveries not only in science and engineering (such as nanoscience, combustion, astrophysics, cosmology, fusion, climate prediction, and biology) but also in commercial domains (such as social media analytics, machine learning, and deep learning). Data-intensive computing is a class of parallel computing paradigms that apply a data-parallel approach to process “big data”, a term popularly used for describing datasets so large or complex that traditional data processing applications are inadequate to deal with them. Many new insights from this big data are often hard to realize because of lack of scalable programming languages, tools, and applications, as well as inadequate performance in storage, I/O, available interfaces, analysis capabilities, and runtime systems.

This course covers in-depth research topics in data-intensive computing. We first explore state-of-the-art techniques to build parallel systems and applications for scalable data analysis on a massive and complex dataset, those from scientific and engineering problems. We will learn how to write efficient parallel applications as well as studying various research topics in distributed data-intensive computing. We will also discuss several key design choices for building large-scale computing systems to enable data-intensive computing.

Prerequisites: EECE.2160 ECE Application Programming (formerly 16.216), EECE.4810 Operating Systems (formerly 16.481) or EECE.4820 Computer Architecture and Design (former16.482).

Grading: Attendance and scribble note (10%), Assignments (40%), Course project (50%)

Required or elective? This course may be used as a technical elective for Computer Engineering majors.

Course Outcomes:

By the end of this course, students will understand and be able to use all of the following:

Recognize a data intensive problem and challenges in various disciplines.
Assess the scale of computing and data requirements for data-intensive applications and computing platforms.
Understand various data-parallel computing frameworks
Understand scalable file/storage systems and data models for data-intensive computing

Course Topics

Basics of data-intensive computing: Definition of data intensive computing, data science, and big data; 5Vs
Data-intensive computing platforms: notion of parallel and distributed computing systems used in both HPC and commercial domains; similarities and differences in hardware and software stacks.
Data parallel frameworks: runtime and high-level programming frameworks in data-intensive computing; MapReduce/Hadoop, MPI/OpenMP/OpenCL, Spark, etc.
File and storage systems: parallel I/O and storage systems for handling large amount of data efficiently; GFS/HDFS vs. Lustre/GPFS/PVFS
NoSQL data models: key-value store (Dynamo), column store (BigTable, HBase, Cassendra), document-based (MongoDB, CouchDB), object store, graph-based, etc.
Virtualization: data intensive computing under virtualization environment; OpenStack vs. Docker.

Schedule

Note that below is a tentative schedule based on 14 meetings.

Week	Topic	Reading	Assignment
1	Syllabus; Course overview; Introduction to data-intensive computing	MGI report	Getting an account on MGHPCC
2	Getting started with data-intensive computing systems; Discussion of project ideas	MGHPCC tutorial	Lab 1 out
3	Data parallel frameworks 1/2	MapReduce/Hadoop	Project proposal
4	Data parallel frameworks 2/2	MPI/OpenMP/OpenCL	Lab 1 due; Lab 2 out
5	High-level data parallel frameworks	Spark/R/Pig
6	Distributed storage	HDFS/GFS	Lab 2 due
7	Parallel storage	Lustre/GPFS/PVFS	Project intermediate report due; Lab 3 out
8	Large graphs 1/2	GPS/Pregel
9	Large graphs 2/2	Giraph/Graphlab	Lab 3 due
10	NoSQL 1/3	Key-value store (Dynamo)
11	NoSQL 2/3	Column store (BigTable, HBase)
12	NoSQL 3/3	Document store (MongoDB)	Lab4 due
13	Virtualization: OpenStack, Docker
14	Project presentation		Lab 5 due