Schlagwort-Archive: HPC

Can Programs Be Made Faster?

Short answer: No. But, more efficient.

A happy new year to all of you! This is the first post in 2014 and it is a (not so) short post about a topic which follows me all the time during discussions about high performance computing. During discussions and in projects I get asked about how programs can be programmed to run faster. The problem is, that this mind set is misleading. It always takes me some minutes to explain the correct mind set: Programs cannot run faster, but more efficient to save time.

If we neglect that we can scale vertically by using faster CPUs, faster memory and faster disks, the speed of a computer is constant (by also neglecting CPUs which change there speed so save power). All programs run always with the same speed and we cannot do anything to speed them up by just changing the programming. What we can do is, to use the hardware we got as efficient as possible. The effect is: We get more done in less time. This reduces the program run time and the software seem to run faster. That is what people mean, but looking on efficiency brings the mind set to find the correct leverages on how to decrease run time.

A soon as a program returns the correct results it is effective, but there is also the efficiency which is to be looked at. Have a look to my post about effectiveness and efficiency for more details about the difference between effectiveness and efficiency. To gain efficiency, we can do the following:

Use all hardware available

All cores of a multi-core CPU can be utilized and all CPUs of the system if we have more than one CPU in the system. GPU or physical accelerator cards can be used for calculation if present.

Especially in brown field projects, where the original code comes from single core systems (before 2005 or so) or system which did not have appropriate GPUs (before 2009), developers did not pay attention multi-threaded, heterogeneous programming. These programs have a lot of potential for performance gains.

Look out for:

CPU utilization

Introduce mutli-thread programming into your software. Check the CPU utilization during an actual run and look for CPU idle tines. If there are any, check your software whether it can do something at the time the idle times occur.

GPU utilization

Introduce OpenCL or CUDA into your software to utilize the GPU board or physics accelerator cards if present. Check the utilization of the cards during calculation and look for optimizations.

Data partitioning for optimal hardware utilization

If a calculation does not need too much data, everything should be loaded into memory to have the data present there for efficient access. Data can also organized to have access in different modes for sake of efficiency. But, if there are calculations with amounts of data which do not fit into memory, a good strategy is needed for not to perform calculations on disk.

The data should be partitioned into smaller pieces. These pieces should fit into memory and the calculations on these pieces should run in memory completely. The bandwidth CPU to memory is about 100 to 1000 faster than CPU to disk. If you have done this, check with tools for cache misses and check whether you can optimize this.

Intelligent, parallel data loading

The bottle neck for calculations are CPU and/or GPU. They need to be utilized, because only they bring relevant results. All other hardware a facilities around that. So, do everything to keep the CPUs and/or GPUs busy. It is not a good idea to load all data into memory (and let CPU/GPU idle), then start a calcuation (everything is busy) to store the results afterwards (and have the CPU/GPU idle again). Develop you software with dynamic data loading. During the time calculations run, new data can be caught from disk to prepare the next calculations. The next calculations can run during the time the former results are written onto disk.This maybe keeps a CPU core busy with IO, but the other cores do meaningful work and the overall utilization increases.

Do not do unnecessary things

Have a look to my post about the seven muda to get an impression about wastes. All these wastes can be found in software and these lead into inefficiency. Everything which does not directly contribute to the expected results of the software needs to be questioned. Everything which uses CPU power, memory bandwidth and disk bandwidth, but is not directly connected to the requested calculation may be treated as potential waste.

To have a starter look for, check and optimize:

Decide early

Decide early, when to abort loops, what calculations to do and how to proceed. Some decisions are made in code on a certain position, but sometimes these checks can be done earlier in code or before loops, because the information is already present. This is something to be checked. During refactorings there might be other, more efficient positions for these checks. Look out for them.

Validate economically

Do not check in functions the validity of your parameters. Check the model parameters at the beginning of the calculations. Do it once and thoroughly. If these checks are sufficient, there should be no illegal state afterwards related to the input data. So they do not need to be checked permanently.

Let it crash

Check only input parameters of functions or methods if a fail of those be fatal (like returning wrong results). Let there be a NullPointerException, IllegalArgumentException and what so ever if something happens. This is OK and exceptions are meant for situations like that. The calculation can be aborted that way and the exception can be caught in a higher function to abort the software or the calculation gracefully, but the cost to check everything permanently is high. On the other side: What will you do when a negative value come into a square root function with double output or the matrix dimensions do not fit in a matrix multiplication? There is no meaningful way to proceed, but to abort the calculation. Check the input model and everything is fine.

Crash early

Include sanity checks in your calculations. As soon as the calculation is not bringing more precision, runs into a wrong result, gives the first nan or inf values or behaves strangely in any way, abort the calculation and let the computer compute something more meaningful. It is a total waste of resources to let a program run, which does not do anything meaningful anymore. It is also very social to let other people calculate stuff in the meantime.

Organize data for efficient access

I have seen software which looks up data in arrays element wise by scanning from the first element to the position where the data is found. This leads into linear time behavior O(n) for the search. This can be done with binary search for instance which brings logarithmic time behavior O(log(n)). Sometimes, it is also possible to hold data in memory in a not normalized way to have access to it in different ways. Sometimes a mapping is needed from index to data and sometimes the other way around. If memory is not an issue, think about keeping the data in memory twice for optimized access.


I hope, I could show how the focus on efficiency can bring the right insights on how to reduce software run times. The correct mind set helps to identify the weak points in software and the selection of the points above should point out some directions to look into software to find inefficiencies. A starting point is presented, but the way to go is different for every project.

Thoughts on High Performance Computing

During my work as consultant, I was asked about high performance computing (HPC) and how to implement it. As always,  one of the strongest constraints is a tight budget.

In the last years, techniques for HPC changed as the hardware changed. Several years before HPC was only possible on computers  made of special HPC processors like NEC’s vector CPUs or a large mainframe was installed with thousands of standard CPUs which work together to run in an astonishing speed. Sometimes, combinations of that was installed.The complexity to program such machines is massive and special knowledge is needed about the programming paradigms and the hardware to get optimal results.

Today the situation is a little different due to several  factors:

  1. Standard CPU will not get faster significantly. The physical constraints are reached and downsizing the chips is not that easy anymore or even impossible. In some dimensions production specifications are around atoms. As long as we do not want to split atoms, we can not reduce some dimensions.
  2. Due to the constrains in the point above, CPU architectures changes. The most significant change are the multi core processors. Moore’s law on speed is extended by multiplying the number of cores in a process.
  3. Gaming industry and industry for graphics processing have let the computer industry into a development of high performance graphics cards. As it turns out, with some minor constraints, these cards are very well suited for HPC. Even on my „old“ nVidia GeForce 8600 GTS, I found 4 multi core processors with 8 cores per processor.

Possibilities for HPC

I do not want to write about special computer hardware and special designed machines. The standard PC technologies are presented here for customers with small budgets where the purchase of a HPC server with thousands of cores is not an option.

Therefore, the following possibilities for HPC are available today:

  1. Even if it is an older approach, cluster computing with PVM or MPI is still a valid possibility. In cluster computing several PCs or servers are interconnected with a standard Ethernet network. The big drawback are latencies of and the speed  in the network. If large computations can be run in parallel where the time consumption of the latency and the bandwidth are much smaller than the computation time, the approach can and should be used. A very prominent example is movie rendering. The scenery information is sent to a client and the calculation is performed on the client. Hundreds of clients can share the work and speed up the whole process dramatically.
  2. Multi Core and Multi Processor parallelization machine is a common choice today. The current number of cores in a standard PC are limited from 2 to 8. Multi core processors with more cores can be expected within the next years, that is for sure. The total speed up of a software is therefore limited on the number of available cores. Even if not HPC is done, the parallelization of software should be a topic, because customers want their machines running as fast as possible and the investment should be used efficiently. For HPC itself it is not a real option, because standard software should use it, too. So it is not special high performance about it.
  3. Real HPC can be done with GPU programming. One constraint of GPUs are the limitation to single precision floating point operations. It is quite ok for calculation of 3D graphics, but for some scientific calculations it is not good enough. nVidia has met this demand by creating the so called Tesla cards. These cards contain up to 448 cores with 6GB RAM and operate in double precision mode. Programmed with nVidias CUDA framework or the OpenCL language high speed ups can be achieved. This is a real low budget HPC solution for many customers.


For a small test with OpenCL, I programmed a small C program which has to perform a simple matrix multiplication. In C a classical sequential matrix multiplication lools like:

I assumed here, that we have quadratic matrices with a size of MATRIX_SIZE in each direction. For a size of 1024 this algorithm needs about 51.9 seconds on my AMD Operton 2600.

The same algorithm was implemented in OpenCL. The Kernel code looks like:

Started is the kernel on my nVidia GeForce 8600 GTS after copying the needed matrix data into the graphics card RAM with:

This leads into a start of 1,048,576 threads which are started on 32 cores. The whole operation is finished in roughly 3.3 seconds. This is a total speed up of 15.7.

One of the specialties to be taken into account is, that GPU processors are not cached and that therefore, no cache coherence is to be expected. All processors write directly into RAM. The host process has to take care for concurrency and to avoid it. In the example above the two index variables for the results matrix are independent and the calculation itself, too. So we could create independent threads for these two variables. The third variable is dependent and can not be parallelized without additional locking mechanisms.

The situation on graphics cards are much more interesting as soon as we take the different memories into account which exist on a graphics card, too. In the example above I used the global memory which is accessible for reading and writing by all processors and the private memory which is private for each core. The private variable r was used due to fast read and write capabilities of the private memory. It’s faster to sum up the result first in private memory and to set the result in global memory later on. We also have a read only memory for constants on the graphics boards (read only for the GPU processors, but writable by the host), texture memory and some more…


As shown above, massive parallel GPU programming and OpenCL is a big chance for HPC on a small budget. Taken into account that my graphics card is not state of the art anymore and the nVidia Tesla cards with their performance, HPC is possible for Science and Research Institutes and Organizations with strong budget constraints.