Stat 5526 Fall 2020 HPC/R Topics

Ok, 2020 threw us a bunch of curve balls, but one thing remains: we are in the age of the Data Scientist. Check out these facts:

  • 1.7MB of data is created every second by every person during 2020.
  • In the last two years alone, the astonishing 90% of the world’s data has been created.
  • 2.5 quintillion bytes of data are produced by humans every day.
  • 463 exabytes of data will be generated each day by humans as of 2025.
  • 95 million photos and videos are shared every day on Instagram.
  • By the end of 2020, 44 zettabytes will make up the entire digital universe.
  • Every day, 306.4 billion emails are sent, and 5 million Tweets are made.

As the article points out, there are 18 zeros in quintillion. Sooner or later, you will find a data set too large to compute on your local computer/laptop. At that point, you have three options:

  1. Buy a larger computer
  2. Scale via Cloud
  3. Scale via HPC

I would argue these last two platforms/methods are merging. Here, we discuss computing on the high performance computing (HPC) clusters managed by Advanced Research Compting at Virginia Tech. Broad topic areas are listed below:

Topics

  • HPC
    • Cluster organization and access (Lecture 1A)
    • Interacting with a scheduler (Lecture 1B)
    • Software (Lecture 1B)
    • Storage (Lecture 1B)
  • Containers (Lecture 1C)
  • R (Lecture 1C)
  • Parallelization (Lecture 1D)
    • MKL/OpenBLAS vectorization
    • OpenMP
    • MPI
  • Monte Carlo (Lecture 2A)
  • Neural Networks (Lecture 2B)
  • R and Keras (Lecture 2C)
  • Homework