Credits: 3
Contact Hours: 3 hours of lecture
Instructor: Seung Woo Son
Textbook: There are NO textbooks required for this course. We will read the papers published in conferences and journals. PDF files will be available through Learning Management Systems such as Blackboard or others.
Other supplemental materials: All supplemental materials can be found on the UMass Lowell Blackboard portal (https://www.uml.edu/blackboard/). These materials include lecture slides, handouts, recordings, assignments, quizzes, and other documentation.
Course Catalog Description:
The past decade has been a period of remarkable growth of data sets, offering enormous opportunities for important discoveries not only in science and engineering (such as nanoscience, combustion, astrophysics, cosmology, fusion, climate prediction, and biology) but also in commercial domains (such as social media analytics, machine learning, and deep learning). Data-intensive computing is a class of parallel computing paradigms that apply a data-parallel approach to process “big data”, a term popularly used for describing datasets so large or complex that traditional data processing applications are inadequate to deal with them. Many new insights from this big data are often hard to realize because of lack of scalable programming languages, tools, and applications, as well as inadequate performance in storage, I/O, available interfaces, analysis capabilities, and runtime systems.
This course covers in-depth research topics in data-intensive computing. We first explore state-of-the-art techniques to build parallel systems and applications for scalable data analysis on a massive and complex dataset, those from scientific and engineering problems. We will learn how to write efficient parallel applications as well as studying various research topics in distributed data-intensive computing. We will also discuss several key design choices for building large-scale computing systems to enable data-intensive computing.
Prerequisites: EECE.2160 ECE Application Programming (formerly 16.216), EECE.4810 Operating Systems (formerly 16.481) or EECE.4820 Computer Architecture and Design (former16.482).
Grading: Attendance and scribble note (10%), Assignments (40%), Course project (50%)
Required or elective? This course may be used as a technical elective for Computer Engineering majors.
Course Outcomes:
By the end of this course, students will understand and be able to use all of the following:
Course Topics
Schedule
Note that below is a tentative schedule based on 14 meetings.
Week | Topic | Reading | Assignment |
---|---|---|---|
1 | Syllabus; Course overview; Introduction to data-intensive computing | MGI report | Getting an account on MGHPCC |
2 | Getting started with data-intensive computing systems; Discussion of project ideas | MGHPCC tutorial | Lab 1 out |
3 | Data parallel frameworks 1/2 | MapReduce/Hadoop | Project proposal |
4 | Data parallel frameworks 2/2 | MPI/OpenMP/OpenCL | Lab 1 due; Lab 2 out |
5 | High-level data parallel frameworks | Spark/R/Pig | |
6 | Distributed storage | HDFS/GFS | Lab 2 due |
7 | Parallel storage | Lustre/GPFS/PVFS | Project intermediate report due; Lab 3 out |
8 | Large graphs 1/2 | GPS/Pregel | |
9 | Large graphs 2/2 | Giraph/Graphlab | Lab 3 due |
10 | NoSQL 1/3 | Key-value store (Dynamo) | |
11 | NoSQL 2/3 | Column store (BigTable, HBase) | |
12 | NoSQL 3/3 | Document store (MongoDB) | Lab4 due |
13 | Virtualization: OpenStack, Docker | ||
14 | Project presentation | Lab 5 due |