Tag Archives: programming

Java: Performance issues with FileInputStream


During the studies of database design with the accompanying toy project DuctileDB, I found some disturbing performance issues.

During the implementation of a needed log structured storage, I found that the performance is terribly slow and only about 1/100 of the performance expected. A simple check of the provided source code of FileInputStream reveals the issue: There is absolutely no optimization included.

As widely known (at least I expect it to be widely known), file systems are based on blocks and all files are stored cut into this blocks. That’s why a hard disk is also called a block device. Block devices are normally read buffered. To get a good performance, the buffer size needs to be at least the block size of the file system.

Surprisingly, at least for me, the FileInputStream is just reading every byte as single byte in the read() method. A poor performance was not a surprise anymore after the finding.

The first test was, to just decorate the FileInputStream with a BufferedInputStream. The default buffer size of BufferedInputStream is 8192 bytes or 8kB. I checked and for my systems, the block size is 512 bytes only. The performance increase was astonishing.

As a studied physicist, I wanted to know it better. So I started some measurements…

Measurement Results

The measurements took place on the systems I had at hand. The measurements were performed with a simple JUnit test. The test is part of PureSol Technologies’ Steaming Library. The class is called SequentialReadPerformanceTest.

The test creates a simple 1MB file and reads the files byte by byte. The measurement was performed ten times and a simple statistics was calculated which returns a minimum, maximum and of course a mean throughput. For the reading, the buffer size is varied from 0 bytes (a pure FileInputStream) up to 256kB. In the charts the 0 bytes buffer is drawn as 0.1kB, because of the logarithmic scale, a 0 was not shown.

I had three systems at hand:

Windows 10 on SSD:

BufferSize minThroughput maxThroughput meanThroughput sigmaThroughput
0 0.478313481960232 0.560977316112137 0.53931803772942 0.026823943405206
1 20.7921956295732 159.661168284835 106.530929541767 41.5452866636502
2 135.890172930229 199.804763831118 159.301353868548 21.9993284291508
4 114.080929031247 185.829790831254 162.480645695998 20.4335777478615
8 119.472762811058 207.246561853822 173.82688161485 22.5238524242766
12 152.112263334608 212.048153787975 174.07568558316 18.3455536745854
16 160.007641966543 206.408535265054 177.837585133614 15.110967892046
24 122.760600929588 200.196611853492 174.685818018924 22.1811746856282
32 130.191143000353 205.116448207889 175.602014380743 23.4884718567785
40 189.830287548959 238.60181506275 217.714568123906 14.6494451455874
48 179.301964860046 257.090941811432 211.133906654611 24.3405587930466
56 180.374229186133 253.724040842617 220.05957106526 22.9934585645487
64 169.576206649008 252.181203292139 214.61583945847 22.28789046186
128 172.901240725885 242.260583721361 214.784464231356 22.0707625968981
256 158.073193513394 235.072835911773 198.832874007776 28.4353501136663

This is my development laptop with a WIndows 10 installed on a SSD. Filesystem is NTFS with a block size of 512 bytes.

VirtualBox on SSD:

BufferSize minThroughput maxThroughput meanThroughput sigmaThroughput
0 3.30897090085418 3.4685246511367 3.37156905146268 0.058287703014056
1 45.8442138958278 204.489511433226 149.196065250552 54.546756256114
2 169.028569641574 213.338498388938 191.078173076755 16.1087752022564
4 175.832342723626 220.551930904206 204.223397588209 14.7572527618542
8 167.40312316305 212.853555391402 200.302253142487 14.2208722932322
12 174.043085529545 221.548785267493 200.891618472135 15.2731165631726
16 178.626872017884 223.56519816547 204.200373947012 15.2758384736513
24 181.218954353416 225.566526761738 206.516876975593 13.137885697191
32 223.296922708024 265.152151154852 245.022329558141 13.661324942683
40 216.037271505894 256.377367956888 240.406641413718 13.4568412728096
48 194.605412997549 258.278920524087 238.66016565359 18.617845389739
56 218.857124728457 268.919769748526 241.338112649669 16.2068776298014
64 215.220983192018 262.003958883976 240.601463050768 14.6838561989163
128 207.162624999802 258.216081521909 237.674988725484 14.2341770113763
256 202.084528616352 262.459082128095 234.696119688756 19.2935202838983

This VirtualBox runs on the Windows 10 system mentioned above. As guest system an Ubuntu 17.10 is installed. The filesystem is btrfs with a block size of 512 bytes.

KVM on magnetic disks:

BufferSize minThroughput maxThroughput meanThroughput sigmaThroughput
0 3.50369275983678 3.64641950704795 3.59516953215679 0.042635735226103
1 41.0512347657029 256.14289221404 210.779922949005 78.5887297940257
2 232.373524058612 285.055321462553 263.672741286731 12.7495562706978
4 255.566173924001 272.880340559129 264.701869296095 5.75897219156574
8 256.293085158907 278.836772514138 268.961324482104 6.65276931366453
12 255.882054365566 276.077129595719 267.372542695553 7.72968634110117
16 258.613027035372 275.235002420108 269.684282122623 5.18769137033597
24 171.856825532156 276.072041550241 251.919547286073 27.995974992746
32 252.29570040848 270.142097067889 264.173763360173 5.0979184809125
40 289.992165109599 304.345453637775 295.727761674292 3.72938830994006
48 290.484623991654 299.758266839562 295.794604575198 2.85277669207739
56 291.98249401528 305.188024108252 297.294492472613 3.86242932963834
64 292.089367018439 299.632351890619 294.804240737189 2.2082451746197
128 255.519590730416 302.032878151745 294.733138187711 13.2476788765596
256 285.858694354284 302.92348363624 295.621763218591 5.44402902250886

This is an Ubuntu 16.04 guest inside of  KVM running on another Ubuntu 16.04. The host has magnetic disks running in RAID1 soft raid with a btrf filesystem with 512 bytes block size. The guest has a XFS file system with a block size of 512 bytes.

Summary and Recommendations

There several are some findings in the graphs above:

  1. A FileInputStream without any additional buffer provides a terrible performance. Astonishingly, the difference of the performance of a pure FileInputStream and the maximum performance is almost two orders of magnitude. At least, for me it is surprising.
  2. On the virtual environments, the minimum performance of FileInputStream is not as bad as on the ‘bare metal’ Windows 10. I guess, it is related to additional buffering between the host and guest environment. More investigations would help here. Not sure, whether I can do that in future.
  3. The maximum performance is for buffers of 32kB or 40kB. As the block size is only 512 bytes, this is surprising for me as well. At the moment, I do not have a concise explanation for that.
  4. The default buffer size of BufferedInputStream of 8 kB, provides already a good performance gain.

These observations lead to some recommendations of mine:

  1. As soon as you read from file in Java, always use a BufferedInputStream to dramatically speed up reading.
  2. As long as the best buffer size is not known, the default buffer size of BufferedInputStream can be used.
  3. I did not do measurements here, but I know from experience, that for every FileOutputStream a BufferedOutputStream also provides a much better performance. Without a better knowledge, we can assume that the throughput speed ups are similar. Therefore, for every FileOutputStream, put a BufferedOutputStream decorator around it.

If I find some time, I will also try to find out what the correct behavior of FileOutputStream is.

Based on the findings and recommendations, I will develop a special FileOutputStream in PureSol Technologies’ Streaming Library. The buffer size will be configurable via a system property. Additionally, I also think about a mechanism to automatically find the best performance during static initialization to maximize performance.

So, have a look from time to time at the library and this blog to find more information on this topic and news on the progress of the library.

Hadoop Client in WildFly – A Difficult Marriage

(This article was triggered by a question “Hadoop Jersey conflicts with Wildfly resteasy” on StackOverflow, because I hit the same wall…)

For a current project, I evaluate the usage of Hadoop 2.7.1 for handling data. The current idea is to use Hadoop’s HDFS, HBase and Spark to handle bigger amount of data (1 TB range, no real Big Data).

The current demonstrator implementation uses Cassandra and Titan as databases. Due to some developments with Cassandra and Titan (Aurelius was aquired by DataStax.), the stack seems not to be future-proof. An alternative is to be evaluated.

The first goal is to use the Hadoop client

in WildFly 9.0.1. (The content of this artical should be also valid for WildFly >=8.1.0.) HDFS is to be used at first to store and retrieve raw files.

Setting up Hadoop in a pseudo distributed mode as it is described in “Hadoop: The Definitive Guide” was a breeze. I was full of hope and added the dependency above to an EJB Maven module and wanted to use the client to connect to HDFS to store and retrieve single files. Here it is, where the problems started…

I cannot provide all stack traces and error messages anymore, but roughly, this is what happend (one after another; when I removed one obstacle, the next came up):

  • Duplicate providers for Weld where brought up as errors due to multiple providers in the Hadoop client. Several JARs are loaded as implicit bean archives, because JavaEE annotations are included. I did not expect that and it seems strange to have it in a client library which is mainly used in Java SE context.
  • The client dependency is not self contained. During compile time an issue arised due to missing libraries.
  • The client libraries contains depencencies which provide web applications. These applications are also loaded and WildFly try to initialize them, but fails due to missing libraries which are set to provided, but not included in WildFly (but maybe in Geronimo?). Again, I am puzzled, why something like that is packaged in a client library.
  • Due to providers delivered in sub-dependencies of Hadoop client, the JSON provider was switched from Jackson 2 (default since WildFly 8.1.0) back to Jackson 1 causing infinite recursions in trees I need to marshall into JSON, because the com.fasterxml.jackson.*  annotations were not recognized anymore and the org.codehaus.jackson.*  annotations were not provided.

The issues are manyfold and  is caused by a very strange, no to say messy packaging of Hadoop client.

Following are the solutions so far:

Broken implicte bean archives

Several JARs contain JavaEE annotations which leads to an implicit bean archive loading (see: http://weld.cdi-spec.org/documentation/#4). Implicit bean archive support needs to be switched off. For WildFly, it looks like this:

Change the Weld settings in WildFly’s standalone.conf from

to

This switches the implicit bean archive handling off. All libraries used for CDI need to have a /META-INF/beans.xml  file now. (After switching off the implicit archives, I found a lot of libraries with missing beans.xml  files.)

Missing dependencies

I added the following dependencies to fix the compile/linkage issues:

Services provided, but not working

After switching off the implicit bean archives and added new dependencies to get the project compilied, I run into issues during deployment. Mostly, the issues were missing runtime dependencies due to missing injection providers.

The first goal was to shut off all (hopefully) not needed stuff which was disturbing. I excluded the Hadoop MapReduce Client App and JobClient (no idea what these are for). Additionally, I excluded Jackson dependencies, beacause they are already provided in the WildFly container.

Broken JSON marshalling in RestEasy

After all the fixes above, the project compiled and deployed successfully. During test I found that JSON marshalling was broken due to infinite recursions I got during marshalling of my file trees. I drove me cracy to find out the issue. I was almost sure that WildFly 9 switched the default Jackson implementation back to Jackson 1, but I did not find any release note for that. After a long while and some good luck, I found a  YarnJacksonJaxbJsonProvider class which forces the container to use Jackson 1 instead of Jackson 2 messing up my application…

That was the final point to decide (maybe too late), that I need a kind of calvanic isolation. Hadoop client and WildFly need to talk through a proxy of some kind not sharing any dependencies except of one common interface.

Current Solution

I created now one Hadoop connector EAR archive which contains the above mentioned and fixed Hadoop client dependencies. Additionally, I create a Remote EJB and add it to the EAR which provides the proxy to use Hadoop. The proxy implements a Remote Interface which is also used by the client. The client performs a lookup on the remote interface of the EJB. That setup seems to work so far…

The only drawback in this scenario at the moment is, that I cannot use stream throug EJB, because streams cannot be serialized. I think about creating a REST interface for Hadoop, but I have no idea about the performance. Additionally, will the integration of HBase be as difficult as this!?

For the next versions, maybe a fix can be in place. I found a Jira ticket HDFS-2261 “AOP unit tests are not getting compiled or run”.

JavaEE: Arquillian Tests support Multi-WAR-EARs

Before version 1.0.2 Arquillian did not support EAR deployments which contained multiple WAR files. The issue was the selection of the WAR into which the arquillian artifacts are to be placed for testing. The trick until then was to remove all WAR files not needed and to leave only one WAR within the EAR.

Starting from version 1.0.2 Arquillian does support Multi-WAR-EAR-Deployments as described in http://arquillian.org/blog/2012/07/25/arquillian-core-1-0-2-Final.

If you have an EAR which is used via

you can select a WAR with the following lines:

That’s it. After this selection Arquillian stops complaining about multiple WAR files within the EAR and the selected WAR is enriched and tested.

JavaEE WebSockets and Periodic Message Delivery


For a project I had the need to implement a monitoring functionality based on HTML5 and WebSockets. It is quite trivial with JavaEE 7, as I will explain below.

Let us assume the easy requirement of a simple monitoring which sends periodic status information to web clients. The web client shall show the information on a web page (inside a <div>…</div> for instance). For that scenario, the technical details are shown below…

The JavaScript code is quite easy and can be taken from JavaScript WebSocket books and tutorials. (A good introduction is for instance Java WebSocket Programming by Oracle Press). A simple client might look like:

The functions for onopen, onclose and onerror are neglected, because we want to focus on JavaEE. The important stuff is shown above: We connect with new WebSocket to the URL which shall provide the periodic updated and with onmessage we put the data somewhere into our web page. That’s it from the client site.

For JavaEE, there is a lot of documentation which shows how to create @ServerEndpoint classes. For instance:

But, how to make it send periodic messages easily? After some testing on WildFly 8.2, I came to this simple solution:

The trick is to make the @ServerEndpoint class also an EJB @Singleton. The @Singleton assures that only one instance is living at a time and this instance can keep also the session provided during @OnOpen. In other words: The actual server endpoint instance is exactly the same where the scheduler is running on. If it would not be @Singleton, multiple instances will or may exist and the session field is not set in @Schedule and might lead to a NullPointerException if not checked for.

Can Programs Be Made Faster?

Short answer: No. But, more efficient.

A happy new year to all of you! This is the first post in 2014 and it is a (not so) short post about a topic which follows me all the time during discussions about high performance computing. During discussions and in projects I get asked about how programs can be programmed to run faster. The problem is, that this mind set is misleading. It always takes me some minutes to explain the correct mind set: Programs cannot run faster, but more efficient to save time.

If we neglect that we can scale vertically by using faster CPUs, faster memory and faster disks, the speed of a computer is constant (by also neglecting CPUs which change there speed so save power). All programs run always with the same speed and we cannot do anything to speed them up by just changing the programming. What we can do is, to use the hardware we got as efficient as possible. The effect is: We get more done in less time. This reduces the program run time and the software seem to run faster. That is what people mean, but looking on efficiency brings the mind set to find the correct leverages on how to decrease run time.

A soon as a program returns the correct results it is effective, but there is also the efficiency which is to be looked at. Have a look to my post about effectiveness and efficiency for more details about the difference between effectiveness and efficiency. To gain efficiency, we can do the following:

Use all hardware available

All cores of a multi-core CPU can be utilized and all CPUs of the system if we have more than one CPU in the system. GPU or physical accelerator cards can be used for calculation if present.

Especially in brown field projects, where the original code comes from single core systems (before 2005 or so) or system which did not have appropriate GPUs (before 2009), developers did not pay attention multi-threaded, heterogeneous programming. These programs have a lot of potential for performance gains.

Look out for:

CPU utilization

Introduce mutli-thread programming into your software. Check the CPU utilization during an actual run and look for CPU idle tines. If there are any, check your software whether it can do something at the time the idle times occur.

GPU utilization

Introduce OpenCL or CUDA into your software to utilize the GPU board or physics accelerator cards if present. Check the utilization of the cards during calculation and look for optimizations.

Data partitioning for optimal hardware utilization

If a calculation does not need too much data, everything should be loaded into memory to have the data present there for efficient access. Data can also organized to have access in different modes for sake of efficiency. But, if there are calculations with amounts of data which do not fit into memory, a good strategy is needed for not to perform calculations on disk.

The data should be partitioned into smaller pieces. These pieces should fit into memory and the calculations on these pieces should run in memory completely. The bandwidth CPU to memory is about 100 to 1000 faster than CPU to disk. If you have done this, check with tools for cache misses and check whether you can optimize this.

Intelligent, parallel data loading

The bottle neck for calculations are CPU and/or GPU. They need to be utilized, because only they bring relevant results. All other hardware a facilities around that. So, do everything to keep the CPUs and/or GPUs busy. It is not a good idea to load all data into memory (and let CPU/GPU idle), then start a calcuation (everything is busy) to store the results afterwards (and have the CPU/GPU idle again). Develop you software with dynamic data loading. During the time calculations run, new data can be caught from disk to prepare the next calculations. The next calculations can run during the time the former results are written onto disk.This maybe keeps a CPU core busy with IO, but the other cores do meaningful work and the overall utilization increases.

Do not do unnecessary things

Have a look to my post about the seven muda to get an impression about wastes. All these wastes can be found in software and these lead into inefficiency. Everything which does not directly contribute to the expected results of the software needs to be questioned. Everything which uses CPU power, memory bandwidth and disk bandwidth, but is not directly connected to the requested calculation may be treated as potential waste.

To have a starter look for, check and optimize:

Decide early

Decide early, when to abort loops, what calculations to do and how to proceed. Some decisions are made in code on a certain position, but sometimes these checks can be done earlier in code or before loops, because the information is already present. This is something to be checked. During refactorings there might be other, more efficient positions for these checks. Look out for them.

Validate economically

Do not check in functions the validity of your parameters. Check the model parameters at the beginning of the calculations. Do it once and thoroughly. If these checks are sufficient, there should be no illegal state afterwards related to the input data. So they do not need to be checked permanently.

Let it crash

Check only input parameters of functions or methods if a fail of those be fatal (like returning wrong results). Let there be a NullPointerException, IllegalArgumentException and what so ever if something happens. This is OK and exceptions are meant for situations like that. The calculation can be aborted that way and the exception can be caught in a higher function to abort the software or the calculation gracefully, but the cost to check everything permanently is high. On the other side: What will you do when a negative value come into a square root function with double output or the matrix dimensions do not fit in a matrix multiplication? There is no meaningful way to proceed, but to abort the calculation. Check the input model and everything is fine.

Crash early

Include sanity checks in your calculations. As soon as the calculation is not bringing more precision, runs into a wrong result, gives the first nan or inf values or behaves strangely in any way, abort the calculation and let the computer compute something more meaningful. It is a total waste of resources to let a program run, which does not do anything meaningful anymore. It is also very social to let other people calculate stuff in the meantime.

Organize data for efficient access

I have seen software which looks up data in arrays element wise by scanning from the first element to the position where the data is found. This leads into linear time behavior O(n) for the search. This can be done with binary search for instance which brings logarithmic time behavior O(log(n)). Sometimes, it is also possible to hold data in memory in a not normalized way to have access to it in different ways. Sometimes a mapping is needed from index to data and sometimes the other way around. If memory is not an issue, think about keeping the data in memory twice for optimized access.

Conclusion

I hope, I could show how the focus on efficiency can bring the right insights on how to reduce software run times. The correct mind set helps to identify the weak points in software and the selection of the points above should point out some directions to look into software to find inefficiencies. A starting point is presented, but the way to go is different for every project.

Software Development and Licensing Issues: The License Maven Plugin

There a some common pitfalls in software development. Most can be overcome by the developers them self, but when it comes to software licensing, they are completely lost (all developers I met at least). For me, it is the same and I tried to find a lawyer who would know about that, but I did not find any, yet. I still have some questions, but I cannot get them answered professionally, but some website and books give at least some information and hints…

One good source is the book “Understanding Open Source and Free Software Licensing” (http://shop.oreilly.com/product/9780596005818.do). Here some information is presented on what licenses can be used to produce Open Source Software and proprietary software and which are to be avoided. Terms like Copyright and copy-left are explained. But, do you always know what licenses are used libraries you use and link against?Are you sure, that the licenses of these libraries are correct? May it be possible that an Apache License licensed library uses a GPL licensed library? In that case, the used library is also copy-left… 🙁

At least for Java in combination with Maven there is a possibility to check that automatically, and it is advised to do this regularly with the build system. Maven and maven artifacts should contain a license hint in their poms about the used licenses and the dependencies of them should contain the same hint. So it is possible to check the licenses transitively for all dependencies in your project.

For an easy implementation have a look to https://github.com/RickRainerLudwig/license-maven-plugin and http://oss.puresol-technologies.com/license-maven-plugin.

Quality Assurance in Software Development from its Root

I checked the V-Model of software development again a couple of days ago during my holidays. (For example have a look to Wikipedia: http://en.wikipedia.org/wiki/V-Model_%28software_development%29). The model in general is very helpful to explain different levels of testing and their meaning. It is a difference whether I use unit testing to check different functions of the software or integration testing to check the system as whole to check the different parts working together. From functional point of view this V-Model is a good model, but it lacks the most basic building block of software: The source code itself.

Let us take car manufacturing as an example. The V-Model tells us to specify the basic requirements like the car needs to be able to go from A to B with four passengers and a big trunk. We specify then the system design like it is a person car which has a certain size and shape and go down via engine design to details like the different parts of the engine. We take the requirements and design everything from top to bottom by separating more and more smaller building blocks which are separated again until we end up with the most basic building blocks like screws, nuts and bolts. The test goes in production the other way around. First the screws, nuts and bolts are checked for their correct size before they are put into an engine for example. The engine is tested before it is put into the car and so on.

But, when we look to the car manufacturing, there is something which is still missing. It is the part where engineers think about the maintainability of the car. They already think during design phase of new car models about their maintainability. It would be a catastrophe if a mechanic has to dis-assemble the whole car just to change the front light bulb. (In my car I need to walk my way up to the front light starting from the front wheel! So they did not do a good job here…)

In software development it is the same. We think about the correct behavior of the code, but we forget, that we need to maintain the code in future and we also need to develop it further. So maintainability is a big deal, too. Here we need to pay more attention on software architecture, design and source quality to not block our development for further functionality. It is even more obvious, that we need to do this, when we think about the fact, that the source code of the current product is the base for all following products.

I18n and L10n in current software developments

When I have a look to current software developments and have in my mind the current process of a world becoming smaller and smaller, I start to wonder, why there is no more attention drawn to internationalization (I18n) and localization (L10n).

Why should be software be internationalized?

As long as their are only engineers and managers using my software, it might be OK to assume that my potential users are speaking English. On the opposite, my experience is that especially in Asian countries managers and engineers are not speaking very well. Some of the countries are not open for global business long enough that their  ‘older’ managers and engineers had no chance to learn and train English a long time. Young managers and engineers are to be expected to be better in speaking English, but as soon as one knows the significant differences in eastern and western languages, one knows, that there will still be a significant amount of people not speaking English so well. I know from experience what it means to lean a sino-tibetean language by learning Vietnamese myself.

It is even worse for software used by operators and fabrication staff. Imagine a MES system not localized to the language of the country. Even in Germany operators are not well educated in English. Should I prefer an English speaking MES system or a German speaking one? If the price, the functionality and the performance are almost equal, I prefer a German speaking system. If I would purchase the software for an international working cooperation, I would invest in multi-lingual software.

Marketing requirements

If I would develop software for a global market, I need to have to pay attention to meet the demand of as many people as possible to increase the number of potential customers to increase my sales. One of the easiest things to do is to present cutting edge software to customers in their mother tongue and physical units and time presentation in a way the customers are used to. If I worked in Vietnam, I would like to be able to switch my computer from Vietnamese to English (or even better: German). I would have a chance to use my computer in a more efficient way, to make less mistakes and to deal with the software in a more reliable way. I guess, this is also valid for a lot more people.

Quality Management

“Usability” is also a quality requirement. Have a look to PureSol Technologies website for some more information.The positively influenced characteristics are “Understandability”, “Learn ability” and “Operable”. For quality focused software projects this is also a factor to take into account.

Simple Conclusion: Do it

In current software developments we should always start projects with an additional focus on I18n and L10n. The additional time and work to enable my software to change the language and the output units is not that high. Using good frameworks for that, it’s very easy to achieve.

A simple framework example: I18n4Java

Let’s have a look to I18n4Java. It’s a small framework I created for my own applications. It’s stable and usable, but not very user friendly, yet. The L10n functionality is taken from the Java API. It’s a very good and reliable framework. Only the suggestopm for I18n implementation I did not like. The SUN recommendation can be found at http://download.oracle.com/javase/tutorial/i18n. The basic information is: Put all messages in property files and change them in reference to the language needed to be displayed.

This approach has some issues:

  1. The message to be display is separated from the source code. I need to look into additional files to find what is actually display. When I need to change messages, I need to open additional files.
  2. The situation is worse as soon as I need message with parameters. Without seeing the actual message, I do not know in which order I need to put the parameters into the message.
  3. I need extra codes for opening the properties files, additional error handling during loading errors and extra code to read the messages.

This limitations are overcome by I18n4Java. Additionally, it is very easy to use in code.

The only things to do is to add a translator to a class which needs I18n. In each class one just needs to create a Translator object:

Afterwards within the class I only need to put every message to be displayed in to the translator and to show the translated message:

The message is then automatically translated into the language selected with

and if an translation file is found for the selected language.

That’s all what is to be done to enable I18n in Java programmed software. The rest is a simple configuration file to set the language needed. I18n4Java automatically sets itself to the currently set OS locale and tries to translate everything in the language the OS is set to.

Thoughts on High Performance Computing

During my work as consultant, I was asked about high performance computing (HPC) and how to implement it. As always,  one of the strongest constraints is a tight budget.

In the last years, techniques for HPC changed as the hardware changed. Several years before HPC was only possible on computers  made of special HPC processors like NEC’s vector CPUs or a large mainframe was installed with thousands of standard CPUs which work together to run in an astonishing speed. Sometimes, combinations of that was installed.The complexity to program such machines is massive and special knowledge is needed about the programming paradigms and the hardware to get optimal results.

Today the situation is a little different due to several  factors:

  1. Standard CPU will not get faster significantly. The physical constraints are reached and downsizing the chips is not that easy anymore or even impossible. In some dimensions production specifications are around atoms. As long as we do not want to split atoms, we can not reduce some dimensions.
  2. Due to the constrains in the point above, CPU architectures changes. The most significant change are the multi core processors. Moore’s law on speed is extended by multiplying the number of cores in a process.
  3. Gaming industry and industry for graphics processing have let the computer industry into a development of high performance graphics cards. As it turns out, with some minor constraints, these cards are very well suited for HPC. Even on my “old” nVidia GeForce 8600 GTS, I found 4 multi core processors with 8 cores per processor.

Possibilities for HPC

I do not want to write about special computer hardware and special designed machines. The standard PC technologies are presented here for customers with small budgets where the purchase of a HPC server with thousands of cores is not an option.

Therefore, the following possibilities for HPC are available today:

  1. Even if it is an older approach, cluster computing with PVM or MPI is still a valid possibility. In cluster computing several PCs or servers are interconnected with a standard Ethernet network. The big drawback are latencies of and the speed  in the network. If large computations can be run in parallel where the time consumption of the latency and the bandwidth are much smaller than the computation time, the approach can and should be used. A very prominent example is movie rendering. The scenery information is sent to a client and the calculation is performed on the client. Hundreds of clients can share the work and speed up the whole process dramatically.
  2. Multi Core and Multi Processor parallelization machine is a common choice today. The current number of cores in a standard PC are limited from 2 to 8. Multi core processors with more cores can be expected within the next years, that is for sure. The total speed up of a software is therefore limited on the number of available cores. Even if not HPC is done, the parallelization of software should be a topic, because customers want their machines running as fast as possible and the investment should be used efficiently. For HPC itself it is not a real option, because standard software should use it, too. So it is not special high performance about it.
  3. Real HPC can be done with GPU programming. One constraint of GPUs are the limitation to single precision floating point operations. It is quite ok for calculation of 3D graphics, but for some scientific calculations it is not good enough. nVidia has met this demand by creating the so called Tesla cards. These cards contain up to 448 cores with 6GB RAM and operate in double precision mode. Programmed with nVidias CUDA framework or the OpenCL language high speed ups can be achieved. This is a real low budget HPC solution for many customers.

Example

For a small test with OpenCL, I programmed a small C program which has to perform a simple matrix multiplication. In C a classical sequential matrix multiplication lools like:

I assumed here, that we have quadratic matrices with a size of MATRIX_SIZE in each direction. For a size of 1024 this algorithm needs about 51.9 seconds on my AMD Operton 2600.

The same algorithm was implemented in OpenCL. The Kernel code looks like:

Started is the kernel on my nVidia GeForce 8600 GTS after copying the needed matrix data into the graphics card RAM with:

This leads into a start of 1,048,576 threads which are started on 32 cores. The whole operation is finished in roughly 3.3 seconds. This is a total speed up of 15.7.

One of the specialties to be taken into account is, that GPU processors are not cached and that therefore, no cache coherence is to be expected. All processors write directly into RAM. The host process has to take care for concurrency and to avoid it. In the example above the two index variables for the results matrix are independent and the calculation itself, too. So we could create independent threads for these two variables. The third variable is dependent and can not be parallelized without additional locking mechanisms.

The situation on graphics cards are much more interesting as soon as we take the different memories into account which exist on a graphics card, too. In the example above I used the global memory which is accessible for reading and writing by all processors and the private memory which is private for each core. The private variable r was used due to fast read and write capabilities of the private memory. It’s faster to sum up the result first in private memory and to set the result in global memory later on. We also have a read only memory for constants on the graphics boards (read only for the GPU processors, but writable by the host), texture memory and some more…

Conclusion

As shown above, massive parallel GPU programming and OpenCL is a big chance for HPC on a small budget. Taken into account that my graphics card is not state of the art anymore and the nVidia Tesla cards with their performance, HPC is possible for Science and Research Institutes and Organizations with strong budget constraints.

The Curse of GOTO

When I started to learn programming professionally in 1991 at the Schülerrechenzentrum Dresden I was always told: “Do not use GOTO statements! It is never needed due to other possibilities which are available. GOTOs do disturb the program flow, the understanding of the program becomes more difficult and the program is not well structured anymore.” I started to program some years before that on a C64 and there was practically no other way than using GOTOs and therefore, it sounded strange to me at first.

At the Schülerrechenzentrum Dresden I learned to program Turbo-Pascal and I was taught to use never ever GOTOs, but to think about a good program flow and to implement decisions and jumps with IF statements and calls of functions and procedures. Over the years I always followed this advice, but from time to time I started thinking about the usage again, but I never found a situation where a GOTO statement would be a better choice than other possibilities of implementation.

Following some facts and taught why not to use GOTOs.

Complexity

GOTOs increase source code’s complexity or seem to increase it. If there is a label in source one never knows exactly what GOTOs and how much will jump there in which circumstances and where are they located. If there is a GOTO around, the search for the corresponding label starts. The flow is not obvious any more and a label is not always distinct.

In the name of complexity GOTOs should be replaced by IF statements and function calls. These are easy to understand and with the right indentation and naming it’s quite obvious what the program does and what the intentions are.

Maintainable Code and Human Understanding

Try to understand code which is written with GOTOs. You always have to find out, where the fitting label is and under which circumstances the GOTO is invoked and what the current status of all variables are. It becomes more difficult in languages where a GOTO is allowed to jump out of the current function into another one or where the GOTO is allowed to jump into loops from outside. It’s very difficult to see and understand what the initial values are after the jump and how the program will proceed. A GOSUB and a RETURN is the same mess.

“Any fool can write code that a computer can understand. Good programmers write code that humans can understand. “
– Martin Fowler –

For maintainable code, source code is needed without such difficulties. A clean source code exposes all its conditions and its program flow. A GOTO messes this up and should therefore never used.

Refactoring

In Martin Fowler book “Refactoring – Improving the Design of Existing Code” a lot of techniques and patterns are described for improving and developing good code. Refactoring, improving the code, can only be done, if the program flow is obvious and the behavior can be foreseen for any change. If a GOTO is within the area of the refactored code, the change or move of the corresponding label can dramatically change the programs behavior and it’s not that obvious in the first place.

In the name of maintainable code, refactoring should be possible. GOTOs lead to source code which is not easily refactorable and it should therefore be avoided under any circumstance.

For further information about clean code, have a look to the book Clean Code by Robert C. Martin.