Tag Archives: high performance programming

Java: Performance issues with FileInputStream


During the studies of database design with the accompanying toy project DuctileDB, I found some disturbing performance issues.

During the implementation of a needed log structured storage, I found that the performance is terribly slow and only about 1/100 of the performance expected. A simple check of the provided source code of FileInputStream reveals the issue: There is absolutely no optimization included.

As widely known (at least I expect it to be widely known), file systems are based on blocks and all files are stored cut into this blocks. That’s why a hard disk is also called a block device. Block devices are normally read buffered. To get a good performance, the buffer size needs to be at least the block size of the file system.

Surprisingly, at least for me, the FileInputStream is just reading every byte as single byte in the read() method. A poor performance was not a surprise anymore after the finding.

The first test was, to just decorate the FileInputStream with a BufferedInputStream. The default buffer size of BufferedInputStream is 8192 bytes or 8kB. I checked and for my systems, the block size is 512 bytes only. The performance increase was astonishing.

As a studied physicist, I wanted to know it better. So I started some measurements…

Measurement Results

The measurements took place on the systems I had at hand. The measurements were performed with a simple JUnit test. The test is part of PureSol Technologies’ Steaming Library. The class is called SequentialReadPerformanceTest.

The test creates a simple 1MB file and reads the files byte by byte. The measurement was performed ten times and a simple statistics was calculated which returns a minimum, maximum and of course a mean throughput. For the reading, the buffer size is varied from 0 bytes (a pure FileInputStream) up to 256kB. In the charts the 0 bytes buffer is drawn as 0.1kB, because of the logarithmic scale, a 0 was not shown.

I had three systems at hand:

Windows 10 on SSD:

BufferSize minThroughput maxThroughput meanThroughput sigmaThroughput
0 0.478313481960232 0.560977316112137 0.53931803772942 0.026823943405206
1 20.7921956295732 159.661168284835 106.530929541767 41.5452866636502
2 135.890172930229 199.804763831118 159.301353868548 21.9993284291508
4 114.080929031247 185.829790831254 162.480645695998 20.4335777478615
8 119.472762811058 207.246561853822 173.82688161485 22.5238524242766
12 152.112263334608 212.048153787975 174.07568558316 18.3455536745854
16 160.007641966543 206.408535265054 177.837585133614 15.110967892046
24 122.760600929588 200.196611853492 174.685818018924 22.1811746856282
32 130.191143000353 205.116448207889 175.602014380743 23.4884718567785
40 189.830287548959 238.60181506275 217.714568123906 14.6494451455874
48 179.301964860046 257.090941811432 211.133906654611 24.3405587930466
56 180.374229186133 253.724040842617 220.05957106526 22.9934585645487
64 169.576206649008 252.181203292139 214.61583945847 22.28789046186
128 172.901240725885 242.260583721361 214.784464231356 22.0707625968981
256 158.073193513394 235.072835911773 198.832874007776 28.4353501136663

This is my development laptop with a WIndows 10 installed on a SSD. Filesystem is NTFS with a block size of 512 bytes.

VirtualBox on SSD:

BufferSize minThroughput maxThroughput meanThroughput sigmaThroughput
0 3.30897090085418 3.4685246511367 3.37156905146268 0.058287703014056
1 45.8442138958278 204.489511433226 149.196065250552 54.546756256114
2 169.028569641574 213.338498388938 191.078173076755 16.1087752022564
4 175.832342723626 220.551930904206 204.223397588209 14.7572527618542
8 167.40312316305 212.853555391402 200.302253142487 14.2208722932322
12 174.043085529545 221.548785267493 200.891618472135 15.2731165631726
16 178.626872017884 223.56519816547 204.200373947012 15.2758384736513
24 181.218954353416 225.566526761738 206.516876975593 13.137885697191
32 223.296922708024 265.152151154852 245.022329558141 13.661324942683
40 216.037271505894 256.377367956888 240.406641413718 13.4568412728096
48 194.605412997549 258.278920524087 238.66016565359 18.617845389739
56 218.857124728457 268.919769748526 241.338112649669 16.2068776298014
64 215.220983192018 262.003958883976 240.601463050768 14.6838561989163
128 207.162624999802 258.216081521909 237.674988725484 14.2341770113763
256 202.084528616352 262.459082128095 234.696119688756 19.2935202838983

This VirtualBox runs on the Windows 10 system mentioned above. As guest system an Ubuntu 17.10 is installed. The filesystem is btrfs with a block size of 512 bytes.

KVM on magnetic disks:

BufferSize minThroughput maxThroughput meanThroughput sigmaThroughput
0 3.50369275983678 3.64641950704795 3.59516953215679 0.042635735226103
1 41.0512347657029 256.14289221404 210.779922949005 78.5887297940257
2 232.373524058612 285.055321462553 263.672741286731 12.7495562706978
4 255.566173924001 272.880340559129 264.701869296095 5.75897219156574
8 256.293085158907 278.836772514138 268.961324482104 6.65276931366453
12 255.882054365566 276.077129595719 267.372542695553 7.72968634110117
16 258.613027035372 275.235002420108 269.684282122623 5.18769137033597
24 171.856825532156 276.072041550241 251.919547286073 27.995974992746
32 252.29570040848 270.142097067889 264.173763360173 5.0979184809125
40 289.992165109599 304.345453637775 295.727761674292 3.72938830994006
48 290.484623991654 299.758266839562 295.794604575198 2.85277669207739
56 291.98249401528 305.188024108252 297.294492472613 3.86242932963834
64 292.089367018439 299.632351890619 294.804240737189 2.2082451746197
128 255.519590730416 302.032878151745 294.733138187711 13.2476788765596
256 285.858694354284 302.92348363624 295.621763218591 5.44402902250886

This is an Ubuntu 16.04 guest inside of  KVM running on another Ubuntu 16.04. The host has magnetic disks running in RAID1 soft raid with a btrf filesystem with 512 bytes block size. The guest has a XFS file system with a block size of 512 bytes.

Summary and Recommendations

There several are some findings in the graphs above:

  1. A FileInputStream without any additional buffer provides a terrible performance. Astonishingly, the difference of the performance of a pure FileInputStream and the maximum performance is almost two orders of magnitude. At least, for me it is surprising.
  2. On the virtual environments, the minimum performance of FileInputStream is not as bad as on the ‘bare metal’ Windows 10. I guess, it is related to additional buffering between the host and guest environment. More investigations would help here. Not sure, whether I can do that in future.
  3. The maximum performance is for buffers of 32kB or 40kB. As the block size is only 512 bytes, this is surprising for me as well. At the moment, I do not have a concise explanation for that.
  4. The default buffer size of BufferedInputStream of 8 kB, provides already a good performance gain.

These observations lead to some recommendations of mine:

  1. As soon as you read from file in Java, always use a BufferedInputStream to dramatically speed up reading.
  2. As long as the best buffer size is not known, the default buffer size of BufferedInputStream can be used.
  3. I did not do measurements here, but I know from experience, that for every FileOutputStream a BufferedOutputStream also provides a much better performance. Without a better knowledge, we can assume that the throughput speed ups are similar. Therefore, for every FileOutputStream, put a BufferedOutputStream decorator around it.

If I find some time, I will also try to find out what the correct behavior of FileOutputStream is.

Based on the findings and recommendations, I will develop a special FileOutputStream in PureSol Technologies’ Streaming Library. The buffer size will be configurable via a system property. Additionally, I also think about a mechanism to automatically find the best performance during static initialization to maximize performance.

So, have a look from time to time at the library and this blog to find more information on this topic and news on the progress of the library.

Can Programs Be Made Faster?

Short answer: No. But, more efficient.

A happy new year to all of you! This is the first post in 2014 and it is a (not so) short post about a topic which follows me all the time during discussions about high performance computing. During discussions and in projects I get asked about how programs can be programmed to run faster. The problem is, that this mind set is misleading. It always takes me some minutes to explain the correct mind set: Programs cannot run faster, but more efficient to save time.

If we neglect that we can scale vertically by using faster CPUs, faster memory and faster disks, the speed of a computer is constant (by also neglecting CPUs which change there speed so save power). All programs run always with the same speed and we cannot do anything to speed them up by just changing the programming. What we can do is, to use the hardware we got as efficient as possible. The effect is: We get more done in less time. This reduces the program run time and the software seem to run faster. That is what people mean, but looking on efficiency brings the mind set to find the correct leverages on how to decrease run time.

A soon as a program returns the correct results it is effective, but there is also the efficiency which is to be looked at. Have a look to my post about effectiveness and efficiency for more details about the difference between effectiveness and efficiency. To gain efficiency, we can do the following:

Use all hardware available

All cores of a multi-core CPU can be utilized and all CPUs of the system if we have more than one CPU in the system. GPU or physical accelerator cards can be used for calculation if present.

Especially in brown field projects, where the original code comes from single core systems (before 2005 or so) or system which did not have appropriate GPUs (before 2009), developers did not pay attention multi-threaded, heterogeneous programming. These programs have a lot of potential for performance gains.

Look out for:

CPU utilization

Introduce mutli-thread programming into your software. Check the CPU utilization during an actual run and look for CPU idle tines. If there are any, check your software whether it can do something at the time the idle times occur.

GPU utilization

Introduce OpenCL or CUDA into your software to utilize the GPU board or physics accelerator cards if present. Check the utilization of the cards during calculation and look for optimizations.

Data partitioning for optimal hardware utilization

If a calculation does not need too much data, everything should be loaded into memory to have the data present there for efficient access. Data can also organized to have access in different modes for sake of efficiency. But, if there are calculations with amounts of data which do not fit into memory, a good strategy is needed for not to perform calculations on disk.

The data should be partitioned into smaller pieces. These pieces should fit into memory and the calculations on these pieces should run in memory completely. The bandwidth CPU to memory is about 100 to 1000 faster than CPU to disk. If you have done this, check with tools for cache misses and check whether you can optimize this.

Intelligent, parallel data loading

The bottle neck for calculations are CPU and/or GPU. They need to be utilized, because only they bring relevant results. All other hardware a facilities around that. So, do everything to keep the CPUs and/or GPUs busy. It is not a good idea to load all data into memory (and let CPU/GPU idle), then start a calcuation (everything is busy) to store the results afterwards (and have the CPU/GPU idle again). Develop you software with dynamic data loading. During the time calculations run, new data can be caught from disk to prepare the next calculations. The next calculations can run during the time the former results are written onto disk.This maybe keeps a CPU core busy with IO, but the other cores do meaningful work and the overall utilization increases.

Do not do unnecessary things

Have a look to my post about the seven muda to get an impression about wastes. All these wastes can be found in software and these lead into inefficiency. Everything which does not directly contribute to the expected results of the software needs to be questioned. Everything which uses CPU power, memory bandwidth and disk bandwidth, but is not directly connected to the requested calculation may be treated as potential waste.

To have a starter look for, check and optimize:

Decide early

Decide early, when to abort loops, what calculations to do and how to proceed. Some decisions are made in code on a certain position, but sometimes these checks can be done earlier in code or before loops, because the information is already present. This is something to be checked. During refactorings there might be other, more efficient positions for these checks. Look out for them.

Validate economically

Do not check in functions the validity of your parameters. Check the model parameters at the beginning of the calculations. Do it once and thoroughly. If these checks are sufficient, there should be no illegal state afterwards related to the input data. So they do not need to be checked permanently.

Let it crash

Check only input parameters of functions or methods if a fail of those be fatal (like returning wrong results). Let there be a NullPointerException, IllegalArgumentException and what so ever if something happens. This is OK and exceptions are meant for situations like that. The calculation can be aborted that way and the exception can be caught in a higher function to abort the software or the calculation gracefully, but the cost to check everything permanently is high. On the other side: What will you do when a negative value come into a square root function with double output or the matrix dimensions do not fit in a matrix multiplication? There is no meaningful way to proceed, but to abort the calculation. Check the input model and everything is fine.

Crash early

Include sanity checks in your calculations. As soon as the calculation is not bringing more precision, runs into a wrong result, gives the first nan or inf values or behaves strangely in any way, abort the calculation and let the computer compute something more meaningful. It is a total waste of resources to let a program run, which does not do anything meaningful anymore. It is also very social to let other people calculate stuff in the meantime.

Organize data for efficient access

I have seen software which looks up data in arrays element wise by scanning from the first element to the position where the data is found. This leads into linear time behavior O(n) for the search. This can be done with binary search for instance which brings logarithmic time behavior O(log(n)). Sometimes, it is also possible to hold data in memory in a not normalized way to have access to it in different ways. Sometimes a mapping is needed from index to data and sometimes the other way around. If memory is not an issue, think about keeping the data in memory twice for optimized access.

Conclusion

I hope, I could show how the focus on efficiency can bring the right insights on how to reduce software run times. The correct mind set helps to identify the weak points in software and the selection of the points above should point out some directions to look into software to find inefficiencies. A starting point is presented, but the way to go is different for every project.