Stat 5526 Fall 2020 HPC/R Topics
Ok, 2020 threw us a bunch of curve balls, but one thing remains: we are in the age of the Data Scientist. Check out these facts:
- 1.7MB of data is created every second by every person during 2020.
- In the last two years alone, the astonishing 90% of the world’s data has been created.
- 2.5 quintillion bytes of data are produced by humans every day.
- 463 exabytes of data will be generated each day by humans as of 2025.
- 95 million photos and videos are shared every day on Instagram.
- By the end of 2020, 44 zettabytes will make up the entire digital universe.
- Every day, 306.4 billion emails are sent, and 5 million Tweets are made.
As the article points out, there are 18 zeros in quintillion. Sooner or later, you will find a data set too large to compute on your local computer/laptop. At that point, you have three options:
- Buy a larger computer
- Scale via Cloud
- Scale via HPC
I would argue these last two platforms/methods are merging. Here, we discuss computing on the high performance computing (HPC) clusters managed by Advanced Research Compting at Virginia Tech. Broad topic areas are listed below:
Topics
- HPC
- Cluster organization and access (Lecture 1A)
- Interacting with a scheduler (Lecture 1B)
- Software (Lecture 1B)
- Storage (Lecture 1B)
- Containers (Lecture 1C)
- R (Lecture 1C)
- Parallelization (Lecture 1D)
- MKL/OpenBLAS vectorization
- OpenMP
- MPI
- Monte Carlo (Lecture 2A)
- Neural Networks (Lecture 2B)
- R and Keras (Lecture 2C)
- Homework