What is High Performance Computing(HPC)?

This article includes what HPC is and what markets and applications are using it, the process that’s used to solve these big data problems, what tools are available to solve them, and what format the data comes in. Finally, we’ll discuss what accelerators or play a role in the process.

We’re all familiar with user devices such as PCs, tablets, and phones and they will remain an important part of Intel and will continue to invest in maximizing returns in these businesses moving forward. These devices are welcoming billions and billions of things to the Internet by 2020, 50 billion devices, and 212 billion sensors will join the Internet. At this point, 47 percent of total devices and connections will be machine-to-machine. Truly the rise of the machines. These things will generate tremendous amounts of data. Considering this in 2020, it’s expected that the average Internet user will generate approximately 1.5 gigabytes of traffic per day; that’s up from 650 megabytes in 2015. Certainly, a huge amount of data until you consider the machines – A smart hospital will generate 3,000 gigabytes per day, self-driving cars will generate over 4,000 gigabytes per, a connected plane will generate 40,000 gigabytes per day and a connected factory will generate 1 million gigabytes per day.

This data needs to be analyzed and interpreted in real-time and what’s what we are focusing on one of the best examples of Intel’s AI strategy is at automated driving data produced by autonomous vehicles is immense that’s 4 terabytes per data per day. Every day compute required for autonomous vehicles is even more astounding. Approximately 1 car in one hour of driving will require 5 exaflops of computing to safely keep itself on the road supporting just 20000 automated vehicles for one day is estimated to require one exaflop of sustained compute or a billion floating-point operations per second so as we can see really big data’s problem is that there’s so much data.

How are you going to process it? Do you have enough compute power? Enough storage to store it? or the infrastructure to move it around between the compute and the storage devices?

In short what HPC or high performance computing is just simply leveraging distributed compute resources to solve complex problems with a large data set so when we say large data set we mean terabytes petabytes even Zeta bytes of data that need to be processed and it needs to be processed as close to real-time in many cases as possible but certainly minutes to hours, not days weeks or even months. So what typically happens is the user will submit a job to the cluster manager that cluster manager will then run the workload on the distributed different pieces of resources like CPUs, FPGAs, GPUs, and disk drives all interconnected by a network then they get the results back and they can analyze those results to make decisions. Typical workloads that are common today range across many different vertical markets such as Life Sciences, astrophysics, genomics, bioinformatics, molecular dynamics, weather, and climate prediction and artificial intelligence crosses many of these industries and is one of the hottest topics.

Today cybersecurity and financial analysis are also becoming more and more popular the rest fall into kind of a big lump sum of big data analytics umbrella which is really a lot of things to a lot of people let’s go through the process of solving a big data analytics problem like the ones discussed on the previous slide step one define the question to the problem being examined for example what products sell best to at Christmas to men ages 18 to 25 or what’s the optimal traffic pattern of vehicles around the downtown area of a major city another could be how does this chemical reaction with another step two is to ingest and store the data to be used to answer that question.

So for example the image-net data set can be used for training of deep learning topologies or the kitten Kitty vision benchmark suite can be used for autonomous driving or the USDA food composition database and many many more. Data is king with High Performance Computing while there are many publicly available datasets for a variety of different markets many organizations spend years accumulating data for their private use to solve those questions. Step three is to take the data clean that data prepare it and transform the data into a format that can be used in processed by the HPC workload. This could mean resizing images to be processed or formatting tables to be queried data can be reduced to make it more manageable and more organized. Step four is to actually perform the analysis on the data, step five is to receive the results in their context and finally step six is that basically, the last step in the chain was now that you have that data and you want to make decisions or actions based on those results so data comes in many forms in these data sets it could be structured data like a spreadsheet data that exists as a relational database organized and easily addressable rows and columns.

There’s also semi-structured data where the data is not formatted but still has associated information that makes it more amenable to processing than just plain raw data such as key in value and finally there’s unstructured raw data that doesn’t map well to mainstream relational databases such as text documents webpages audio and video files and many others so there’s a little bit of terminology that we want to use.

About where data gets stored the first is this concept of a Data Lake which takes the raw data in from a variety of sources in its native format it uses a flat-file architecture to store the data and it’s not limited to just unstructured data the data warehouse is a storage repository that stores structured data in a table or tabular format using files and other hierarchical folder structures the raw data from the Data Lake may be needed to be extracted transformed and then loaded into a data warehouse there are many open-source tools that are available today to make processing the data easier they provide tools and infrastructure for processing both structured and unstructured data. Apache Hadoop is one of the most common HPC frameworks with it at Apache Spark becoming equally as popular there are things like Apache Cassandra which is a no SQL database management system while postGRES SQL is an object-relational database management system and SAP HANA is an in-memory column-oriented relational database management system with many other systems available we’re not going into detail on what these are they’ll be follow-on training and of course you can always Google to find out more information about each one of these.

One of the most useful and common functions that are used in preparing the data is called MapReduce this breaks down a massive task into smaller pieces and allows them to be processed in parallel an example would be something as simple as a word counter. In this example, the input data comes in and the first step is to split that data into individual rows effectively smaller pieces then mapping that data to a key and a value shuffling the data into groups the data can then be reduced into a key-value table that’s easily queried in this case the table represents the number of times the word dog-cat car and house are present in the initial data set given that an HPC the large problems are broken down into smaller ones. In order to run more in parallel and get better aggregate performance accelerators can be used in many stages of the process for relational databases such as PostgreSQL and the data access compression filtering or replication of the data memory mapping and data caching are areas where accelerators can be used to offload the host processor and provide better system throughput for non SQL or no SQL systems like Cassandra and for i/o acceleration within the system accelerators can be used with the networking interconnects to perform processing inline or with data compression encryption and decryption data encoding and decoding or things like hashing are all areas where accelerators like FPGAs can be used to provide higher bandwidth and lower latency results so in summary with the larger and larger datasets being developed and acquired they require distributing of computing resources to solve these problems in a timely manner. It’s not realistic to just do this in a dedicated device it needs to be distributed across the data center or cloud environments for scale so there are several open-source HPC frameworks that are available today to be used to do this and they’re evolving constantly providing easier and faster methods to organize the data to process the data with those large data sets and with these methods of distributing the data processing across compute nodes breaking the problems down into smaller sub problems and distributing them there are many areas for where accelerators can be used to improve performance.

What is High Performance Computing(HPC)?

Published by Techzone 007

Leave a comment Cancel reply

Share this:

Related

Published by Techzone 007

Leave a comment Cancel reply