Category Archives: Software Engineering

This posts are about software engineering and computer science topics.

Local Indices on Partitioned Tables

I wrote an article lately about a simple model to understand indices in databases. It may help to read the other article first.

In a data processing project I worked on, the data is put into an Oracle database. The data is queried differently based on questions to be answered. Even worth, the queries are not known upfront, so a quick enough access is to be provided for all kind of usage scenarios.

During the design phase some decisions were made based on the information at the time. As the main point of view onto the data was time series based, the decision was made to use partitioned tables with local indices based on the time stamps. In the aftermath, that was a mistake, which was corrected.

Let’s assume, we want to store two years of data and the partition key is based on the year and week. For instance week 50 in 2017 has  the partition key 1750. Let’s assume we store data now for 2016 and 2017, we need about 104 partitions with partition keys 1601, 1602, …, 1652, 1701, 1702, …1751 and 1752. For the sake of simplicity for the next calculations, we assume we have in total 100 partitions. Each partition has local indices on different fields.

What is wrong with this scenario now? As long as we can directly access the table with partition keys, there is nothing wrong. We can create selects like:

Oracle knows in which partition to look at and which index to use to find the name. But, what happens in case we want to look up the name for the whole time? The query could be something like:

In this case we get terribly slow. But why?

Let’s calculate the time characteristics in Big O Notation:

Let’s assume we have a total number of N samples in the whole time series. In trivial, unpartitioned case, without any additional index, we have the time characteristics:

t(N) = O(N)

If we put an index onto the table which can be utilized, we get a much better time characteristics of:

t(N) = O(log2(N))

That’s actually, why indices are used. The speedup S is :

S(N) = O(N) / O(log2(N)) ~ N / log2(N)

If N is only 1, there is not speedup at all, but for let’s say 1000 samples, the speedup is already about 100. That’s great.

But, what happens in case of partitioning? Let’s assume we have in total P partitions. If we can use the partition key only without any additional index, Oracle just accesses the one partition and we get a behavior of

t(N) = O(N/P)

with a total speedup of

S(N) = O(N) / O (N/P) ~ P

In our example above, this is a speed up of a factor of 100 resulting in a 1/100 of query time, because we only need to access 1/100th of the data. Not bad at all.

The same characteristics we can calculate with an additional index. In this case we get:

t(N) = O(log(N/P))

What happens now in case we search for name over the whole time series? Because, we need to read the whole table again over all partitions with a local index only, this comes down to:

t(N) = O(P * log(N/P))

The problem is, we have to look P times into each partition and can use the local index only. This is P time utilizing an index. Compared with an index on unpartitioned table or with a global index, the speed up is is much smaller that one. In our example above with 100 partitions, this leads to:

S(N) = O(log(N)) / O(100*log( N/100) ~ 1/ 100

The reason is that the partition number increases linearly, but the gain of the performance due to the index is increasing much slower.

This calculations show, that in cases of random access on partitioned tables, global indices are recommend to not slow down the queries.

Simple Model for Understanding Database Indices

I had a lot of discussions in the past about databases and indices. The type of the database is not  important, as the general design of indices are always the same. The following model can be applied to relational databases, document stores, big table databases, columnar stores and also to graph databases.

The following model is used to explain what indices are and why they work so well and especially why there are certain restrictions to them:

  1. Multiple single value indices do not help efficiently in where clauses with AND conditions.
  2. For combined indices, the where clauses need to have certain AND clauses to get fully utilized. Missing some of the clauses may lead into not using the index.
  3. The order of values in combined indices are important, when time series or continuous data is involved. It is shown what the correct order is to get good performance.

These three points, I have to discuss over and over again, as most people are not database specialists and the inner works of databases are not well known. The following simple model shall provide some insights.

Storing Data in General

Storing data itself is normally quite simple, just put the stuff to disk. I know, it is not meaningful, but as starting point, let’s assume it for now. It is also not important whether the data is put onto disk, into a storage like Hadoop or something else.

For the different databases we can assume we have basic data structures like

  1. Relational Database: Row = Map<ColumnName, Value>
  2. Big Table Database: ColumnGroup = Map<Key, Value>
  3. Document Store: Document = <any data structure like JSON>
  4. Graph Database: Vertex, Edge = <any structure>

Important for now is, we have different kinds of data structures, but it is not important for now, how they look like. We just put them onto the disk as a simple binary file.

If we do this, there is only one thing missing: We can’t distinguish them, that’s why we define a data store with Object IDs:

DataStore = Tuple<ID, Object>

We have an object ID which is unique defining a certain Object. We can read now the store Object by Object and can find the correct one by comparing the ID.

Primary Index

With the store above, we can store any data and can retrieve it via the unique ID. The issue is, to find the related Object, we need to read the whole store. This leads to a time characteristics in Big O Notation with a total number of N samples of:


This is a poor characteristics. This can be improved dramatically by adding a primary index. This primary index maps every ID to the storage position of the Object:

PrimaryIndex = Map<ID, DataFileAndPosition>

Do not mistake this primary index with a primary key from a relational database. Here we talk about a technical index to retrieve data from disk efficiently. The primary key in a relational database is logical construct on the database content for data organization and modelling.

As soon as we have a primary index in place, the time characteristics changes to:

t(N) = O(log(N))

We assume here, that the index is ordered by the IDs and searchable with a binary search.

The speedup S is:

S(N) = O(N) / O(log(N)) ~ N / log(N)

For N = 1000 is around 100 and for N = 1,000,000 it is around 50,000.

For a simple key values store, this would be good enough. But, what do we do if we look for a specific Object without knowing its ID? We still have to search the whole data store.

Single Value Secondary Indices

To find objects with certain attributes, values or properties (the name depends on the type of database), indices are used. For this to work, we need to have the possibility to extract this Values out of the stored Object:

Value = Object(Name)

The Name in a relational database is for instance the column name and for a graph database a property name. To find the value, we need to have an index like:

Index = Map<Value, Set<ID>>

What about the time characteristics? It was

t(N) = O(N)

before. Now, we can do an index lookup first, which has a time characteristics like the primary index:

t(N) = O(log(N))

This is pretty good and with the additional time characteristics of the primary index, we get an overall time characteristic of:

t(N) = O(log(N)) + O(log(N))= 2 * O(log(N)) = O(log(N))

That’s great, isn’t it? Again, we assume, we can access the Value with a binary search.

Definition of Cardinality and Selectivity

For a better understanding of the next chapters and for the sake of simplicity, we need to explain two terms: Cardinality and Selectivity.

Cardinality C, is the number of unique values. That means, C <= N for a single attribute. In case the Cardinality is known (an index should keep track of it),the number of returned results can be estimated as:

NumerOfResults = N / C

The simple idea is, that the number of each unique value is equal. The factor 1 / C can be defined as Selectivity like:

Selectivity = 1 / C

Single value indices in AND clauses

With the explanations above in mind, we can explain why multiple single value indices cannot be combined. Let’s assume a SQL query like

Additionally, we assume to have single value indices on both columns. What can the database do now?

The first implementation would normally find that both columns have a single value index and uses the index with the better Selectivity (which filters most and returns least number of results). It takes the IDs out of it, reads the Objects and filters the results again with the second conditions. In this case we have the same behavior as with one index only.

A second implementation could check both available indices and takes the ID sets out of both. From both sets the intersection is calculated and with the result the Objects are read. Depending on the amount of effected IDs and the overall number of Objects, this might be less efficient, than the first approach.

A real solution is only a combined index.

Combined Secondary Indices

A combined index is just an extension of a single value index with multiple values:

CombinedIndex = Map<Attribute1Value, Map<Attribute2Value, … Map<AttributeNValue, Set<ID>>…>>

Combined Indices and Where Clause

With the definition of the combined index above, we see why it is important to add the correct where clause to a query. Let us define a combined index over three attributes:

CombinedIndex = Map<Attribute1Value, Map<Attribute2Value, Map<AttributeNValue, Set<ID>>>>

As long as the AND clauses in the query are added from attribute1 to attribute3 in the correct order (the order in the query statement itself should not matter), the index can be utilized. For instance, the quere

is optimal. It runs the query directly on the index and finds all relevant rows.

The same holds true for the queries:

The only difference is, that only the first two or the first map can used to find the set. The remaining maps are scanned. All relevant rows are found with the index and the Objects can be read.

But, what about the following queries?

The index cannot be used efficiently, because the first Map needs to be scanned completely. Some benefit may be gained from the other maps, but it is not said, that the database can do it.

That’s why it is important to put some thoughts into the correct definition is combined indices. If the last queries are dominant, the order of the attributes should be changed. Maybe, an additional index might make sense. It heavily depends on the use cases expected.

Time Series and Continuous Data

For time series data and continuous data, a special look is needed. For this kind of data, a check for equality makes sense only in rare cases, but creating an index make sense anyways.

Continuous data (also time series) is queried with inequality to get defined ranges of the data. In case an index was created, a range scan can take place. With the same binary search algorithm the start of the range can be found and afterwards all values before the end are used to retrieve the IDs. We assume here again, that the Values are ordered in the index and a binary search can take place.

Continuous Data in Combined Secondary Indices

Let’s assume we have a table with two attributes Time and Name. What is the best order for the combined index? The query could look like:

For this example let’s assume the selectivity of name is 10% and also the time range is 10%.

If we create the index in the order

Index = Map<NameValue, Map<TimeValue, Set<ID>>>

the database can narrow the search with the name already to 10% of the index and can perform a range scan over the second map. This results into an index read of 1% of the size only.

If we create the index in the other order

Index = Map<TimeValue, Map<NameValue, Set<ID>>>

the database can perform a range scan over the first map. But, because time is a continuous value, the name will be most likely unique and for each time value the name needs to be checked again. This leads into a less efficient retrieval of data and a read of 10% of the index.

In case continuous data is to be put into an index, it is to be expected that it is most efficient if it is put at last position.

Short Summary

The explanations above are maybe not the most accurate, but for me they served quite well to understand some effects I experienced.

Another article about partitioned tables is in preparation. With these some other effects are to be considered.

Java: Performance issues with FileInputStream

During the studies of database design with the accompanying toy project DuctileDB, I found some disturbing performance issues.

During the implementation of a needed log structured storage, I found that the performance is terribly slow and only about 1/100 of the performance expected. A simple check of the provided source code of FileInputStream reveals the issue: There is absolutely no optimization included.

As widely known (at least I expect it to be widely known), file systems are based on blocks and all files are stored cut into this blocks. That’s why a hard disk is also called a block device. Block devices are normally read buffered. To get a good performance, the buffer size needs to be at least the block size of the file system.

Surprisingly, at least for me, the FileInputStream is just reading every byte as single byte in the read() method. A poor performance was not a surprise anymore after the finding.

The first test was, to just decorate the FileInputStream with a BufferedInputStream. The default buffer size of BufferedInputStream is 8192 bytes or 8kB. I checked and for my systems, the block size is 512 bytes only. The performance increase was astonishing.

As a studied physicist, I wanted to know it better. So I started some measurements…

Measurement Results

The measurements took place on the systems I had at hand. The measurements were performed with a simple JUnit test. The test is part of PureSol Technologies’ Steaming Library. The class is called SequentialReadPerformanceTest.

The test creates a simple 1MB file and reads the files byte by byte. The measurement was performed ten times and a simple statistics was calculated which returns a minimum, maximum and of course a mean throughput. For the reading, the buffer size is varied from 0 bytes (a pure FileInputStream) up to 256kB. In the charts the 0 bytes buffer is drawn as 0.1kB, because of the logarithmic scale, a 0 was not shown.

I had three systems at hand:

Windows 10 on SSD:

BufferSize minThroughput maxThroughput meanThroughput sigmaThroughput
0 0.478313481960232 0.560977316112137 0.53931803772942 0.026823943405206
1 20.7921956295732 159.661168284835 106.530929541767 41.5452866636502
2 135.890172930229 199.804763831118 159.301353868548 21.9993284291508
4 114.080929031247 185.829790831254 162.480645695998 20.4335777478615
8 119.472762811058 207.246561853822 173.82688161485 22.5238524242766
12 152.112263334608 212.048153787975 174.07568558316 18.3455536745854
16 160.007641966543 206.408535265054 177.837585133614 15.110967892046
24 122.760600929588 200.196611853492 174.685818018924 22.1811746856282
32 130.191143000353 205.116448207889 175.602014380743 23.4884718567785
40 189.830287548959 238.60181506275 217.714568123906 14.6494451455874
48 179.301964860046 257.090941811432 211.133906654611 24.3405587930466
56 180.374229186133 253.724040842617 220.05957106526 22.9934585645487
64 169.576206649008 252.181203292139 214.61583945847 22.28789046186
128 172.901240725885 242.260583721361 214.784464231356 22.0707625968981
256 158.073193513394 235.072835911773 198.832874007776 28.4353501136663

This is my development laptop with a WIndows 10 installed on a SSD. Filesystem is NTFS with a block size of 512 bytes.

VirtualBox on SSD:

BufferSize minThroughput maxThroughput meanThroughput sigmaThroughput
0 3.30897090085418 3.4685246511367 3.37156905146268 0.058287703014056
1 45.8442138958278 204.489511433226 149.196065250552 54.546756256114
2 169.028569641574 213.338498388938 191.078173076755 16.1087752022564
4 175.832342723626 220.551930904206 204.223397588209 14.7572527618542
8 167.40312316305 212.853555391402 200.302253142487 14.2208722932322
12 174.043085529545 221.548785267493 200.891618472135 15.2731165631726
16 178.626872017884 223.56519816547 204.200373947012 15.2758384736513
24 181.218954353416 225.566526761738 206.516876975593 13.137885697191
32 223.296922708024 265.152151154852 245.022329558141 13.661324942683
40 216.037271505894 256.377367956888 240.406641413718 13.4568412728096
48 194.605412997549 258.278920524087 238.66016565359 18.617845389739
56 218.857124728457 268.919769748526 241.338112649669 16.2068776298014
64 215.220983192018 262.003958883976 240.601463050768 14.6838561989163
128 207.162624999802 258.216081521909 237.674988725484 14.2341770113763
256 202.084528616352 262.459082128095 234.696119688756 19.2935202838983

This VirtualBox runs on the Windows 10 system mentioned above. As guest system an Ubuntu 17.10 is installed. The filesystem is btrfs with a block size of 512 bytes.

KVM on magnetic disks:

BufferSize minThroughput maxThroughput meanThroughput sigmaThroughput
0 3.50369275983678 3.64641950704795 3.59516953215679 0.042635735226103
1 41.0512347657029 256.14289221404 210.779922949005 78.5887297940257
2 232.373524058612 285.055321462553 263.672741286731 12.7495562706978
4 255.566173924001 272.880340559129 264.701869296095 5.75897219156574
8 256.293085158907 278.836772514138 268.961324482104 6.65276931366453
12 255.882054365566 276.077129595719 267.372542695553 7.72968634110117
16 258.613027035372 275.235002420108 269.684282122623 5.18769137033597
24 171.856825532156 276.072041550241 251.919547286073 27.995974992746
32 252.29570040848 270.142097067889 264.173763360173 5.0979184809125
40 289.992165109599 304.345453637775 295.727761674292 3.72938830994006
48 290.484623991654 299.758266839562 295.794604575198 2.85277669207739
56 291.98249401528 305.188024108252 297.294492472613 3.86242932963834
64 292.089367018439 299.632351890619 294.804240737189 2.2082451746197
128 255.519590730416 302.032878151745 294.733138187711 13.2476788765596
256 285.858694354284 302.92348363624 295.621763218591 5.44402902250886

This is an Ubuntu 16.04 guest inside of  KVM running on another Ubuntu 16.04. The host has magnetic disks running in RAID1 soft raid with a btrf filesystem with 512 bytes block size. The guest has a XFS file system with a block size of 512 bytes.

Summary and Recommendations

There several are some findings in the graphs above:

  1. A FileInputStream without any additional buffer provides a terrible performance. Astonishingly, the difference of the performance of a pure FileInputStream and the maximum performance is almost two orders of magnitude. At least, for me it is surprising.
  2. On the virtual environments, the minimum performance of FileInputStream is not as bad as on the ‘bare metal’ Windows 10. I guess, it is related to additional buffering between the host and guest environment. More investigations would help here. Not sure, whether I can do that in future.
  3. The maximum performance is for buffers of 32kB or 40kB. As the block size is only 512 bytes, this is surprising for me as well. At the moment, I do not have a concise explanation for that.
  4. The default buffer size of BufferedInputStream of 8 kB, provides already a good performance gain.

These observations lead to some recommendations of mine:

  1. As soon as you read from file in Java, always use a BufferedInputStream to dramatically speed up reading.
  2. As long as the best buffer size is not known, the default buffer size of BufferedInputStream can be used.
  3. I did not do measurements here, but I know from experience, that for every FileOutputStream a BufferedOutputStream also provides a much better performance. Without a better knowledge, we can assume that the throughput speed ups are similar. Therefore, for every FileOutputStream, put a BufferedOutputStream decorator around it.

If I find some time, I will also try to find out what the correct behavior of FileOutputStream is.

Based on the findings and recommendations, I will develop a special FileOutputStream in PureSol Technologies’ Streaming Library. The buffer size will be configurable via a system property. Additionally, I also think about a mechanism to automatically find the best performance during static initialization to maximize performance.

So, have a look from time to time at the library and this blog to find more information on this topic and news on the progress of the library.

Thoughts on the Different Kinds of Test

General Remarks

Testing shall assure the quality of a product and that development does not break the already implemented functionality. The quality is defined in ISO 9126 for example (see: Testing is a proof of quality and tests are accepted delivery criteria.

Tests can be separated roughly into different kinds of tests:

  1. Functional tests
  2. GUI tests
  3. Non-functional tests
  4. Penetration tests

Functional Tests

Functional tests are all intended to run without any GUI interaction. The checks are on functional level and use data to check all kinds of transformation, calculation and handling. These tests assure the correct behavior and results of the application. There are three different kinds of functional tests which are explained below:

  1. Unit tests
  2. Component tests
  3. Integration tests

Unit Tests

Unit tests are intended to test single methods and classes in total isolation on a functional level. Here the implementation of single methods and functions are checked like correct calculations, and correct error handling for invalid parameters. Tests should be performed for valid parameters with a check on the returned results, invalid parameters and the correct error behavior and for corner cases which are to be checked for correct behavior, too.

These tests should be provided by the developers. Unit tests can be created during development as simple ways to check the development code. Test Driven Development (TDD) would be best. Except of file IO no other external infrastructure must be used. Mocking is allowed, if needed, but as it is about single methods and classes, it will not be necessary in most cases.

Since in an optimal setup each method should have at least three unit tests (valid, invalid and corner case parameters), a lot of unit tests are to be expected. Because of this high number of tests, a time constraint should be kept of 0.1s for instance. 10.000 tests (which is around 3000 method, which is 500 classes) would take about 20min. This is a lot of time for a continuous integration test, which uses unit tests.

Component Tests

Component tests test several classes in relationship to each other. In component tests no environment should be involved. The environment is simulated with simple mock-ups. Only the filesystem should be involved in testing to load sample data for instance. For mocking the Mockito framework is used in a lot of projects at PureSol Technologies.

Ideally, these tests would be provided by the developers, too. The scope is also functionality, but with the interference of other classes. Component tests would also be a good check during development.

Component tests can and should be run in a continuous integration way, too. That’s why component tests should also have  a time constraint of 0.1s, too.

Integration Tests

Integration tests check the functionality of the product with environment interaction. These tests also include databases and other server systems which will be in place during normal product usage. The environment for integration tests should be as close as possible to the target environment.

Integration tests tend to run a longer time than component tests and unit tests. These tests should run after the normal continuous integration build separately as a single test stage. Integration tests do not have a fixed time constraint.

GUI Tests

The tests following in this section are tests on GUI level. The tests are either pure UI tests or end-product tests.

Pure UI tests  just check the correct appearance and functioning of the UI itself. These tests should be quick. Together with the integration test of interfaces like REST, the functionality of the final product is quite likely, but not completely proven.

That’s why, product tests can be performed via UI. These tests check the correct functionality and are intended to test the final product like the customer is going to experience it. The test environment setup and configuration shall be as close as possible to the customer installation.

There are just two possibilities for GUI tests:

  1. Automated GUI tests
  2. Manual GUI tests

Automated GUI Tests

Automated GUI tests check the standard behavior of the GUI and are a good tool for end-product testing.Here the most common use cases shall be tests and not only the happy tests, but also the error behavior.

The tests’ maintainability is quite expensive due to a changing GUI. The tests are easily broken and the repair needs to be done carefully. Therefore, these tests should only check the GUI and not the functionality. The functionality is checked with non-GUI Tests.

Manual GUI Tests

These tests are manual test. These tests are very expensive due to the binding of a lot of people for the task. These tests should be minimized to check the aesthetics of the product like colors, graphs, prints and layout. These tests can not be automated.

Non-Functional Tests

Strictly speaking, GUI tests are also non-functional, but here we need to test everything which is not related to the use cases of the product itself.

The possible tests are:

  1. Endurance tests
  2. Performance tests
  3. Resource tests

Endurance Tests (Iteration Tests)

Endurance tests are tests which test one functionality for a large number of times. For example a certain calculation can be performed 1000 times one after another. You can see several effects here which might influence the quality:

  1. Memory Leaks: After a large number of iterations one can see memory leaks. The function should be tested at least as often as it is expected to be run in a single application startup on client customer site.
  2. Stability: One can see how stable the function under test is. This can be a mathematical stability as long as the formula relys on some environmental data or the stability in a communication module and so on.

Endurance tests can be coupled with performance tests to save time. During the endurance test the time can be measured for function execution and a statistics can be provided.

Performance Tests

Performance tests test the speed and answer time of applications. Some applications are time critical or time is a measure of service quality. Performance tests assure the correct performance behavior.

Load Tests

Load tests test software applications on heavy load. These tests are important to check how stable an application is on load due to memory consumption, concurrency effects and bottle necks within the applications. Some of these issues would be undetectable without such tests.

Penetration Tests

These tests are done mostly by humans and not robots due to its creative nature. Penetration tests try to break a product through use cases, to show the stability and reliability. The goal is to break the application and to provide information about the breaking mechanism to stabilize and harden the software application.

Other Test-like Tasks

These kind of tests are explained below. Additionally, there are two other kind of tests: Static Source Code Analysis and Code Reviews. Both can be performed by Developers and/or QA and help to keep the maintainability high.

How to handle objects with a life cycle

In C++ and other programming languages, constructors are paired with destructors. These help to close file handles, database connections and the like in case the object is destroyed. Handling these objects in clear life cycles is a good idea anyways, but the need to do it is reduced, because there is a clearly defined way to clean up states.

In Java, destructors are not provided. Whether this is good or bad is a matter of debate and personal taste. The provided fallback with a finalize() method is discouraged as described in “Effective Java”.

But, what can be done now? We create objects with a clear life cycle!

Creating a life cycle is first a matter of API design. It needs to be clear (and documented) how the object is instantiated, initialized, started, opened, closed, stopped, and destroyed. Sometimes, other states  might be needed. Not all states are needed in one object. In case, there are all of these present, the class’s design needs to be questioned. In most cases the life cycle is handled by methods with the same name like the life cycle state (or with a verb of the name).

As example, let’s take this:

The two object we need to take care of are the returned Response and the InputStream.

Rule 1: The creator handles the life cycle

In the case above, it is trivial. The objects are created and closed in the same method or code block.

In case the object is returned from the method, the caller gets the responsibility to close the object It was is original request and gets therefore the responsibility delegated.

What about the case above in which two objects need to be closed? The response object’s life cycle is bound to the InputStream and we need to handle the life cycle in case of an issue in the method:

Rule 2: Methods must not manage life cycle of parameters

The sample above shows another issue: The input stream may be used as parameter for another method (for instance a parser). In cases like that, the close() method must not be called by this method.

Exception for Rule 2: Decorators can manage the life cycle

An input stream which is put into a decorator can be closed by the decorator. This can be found in the implementation of several input stream decorators in Java.

eXtended Objects: ORM for No-SQL Databases in JavaEE

Since the appearance of object relational mappers (ORM) the programming of persistence layers became easier and less error prone. ORMs store objects in SQL databases by interpreting the object’s fields as columns of a table to store them. With the standardization of the Java Persistence API the success was programmed and a lot of Java EE and also Java SE projects rely on ORM.

For about ten years now, a new class (or better new classes) of databases appear which were developed in large companies like Amazon, Google and Facebook to solve storage issues which cannot be handled by SQL databases efficiently. These databases are divided into the classes: Big Table Databases, Graph Databases, Key-Value Databases and Document Databases.

This article is not about the databases itself, but about another issue: ORM is for SQL databases and does not work very well with No-SQL databases. Additionally, the ORM implementations for Java are, as far as I know, all JDBC based, which is also not very handy when it comes to migration to other databases which do not have a JDBC like driver implementation. A new approach needs to be thought of and also some short comings need to be extended to overcome struggles still in ORM.

A new approach comes with a project called eXtended Objects. Version 0.4.0 of eXtended Objects was released on Github: and even it is a quite stable release and the first implementations for databases are out, it is worth mentioning it.

eXtended Objects in JavaEE

For a PureSol Technologies’ project the Titan database ( is used and therefore, a Titan storage back end was developed ( for eXtended Objects to make it work with Titan. The current release  is version 0.1.1 and it is still based on eXtended Objects 0.3.0, but a new release may follow quickly for version 0.4.0.

Due to the release is hosted in Maven Central, a simple

is suitable to get eXtended Objects for Titan for Maven.

eXtended Objects is well suitable for JavaSE, but in JavaEE the current approach to get the mapper working is not suitable for that purpose we want to use @Inject to get a new xoManager. This can be done via a producer method (and also a destroyer method to close the xoManager). This may be another blog in future.

Supported Databases

As described in the features section already, several data store implementations are already out, in development or planned. One of the first implementation was done for another graph database to check the eXtended Objects SPI and to proof the concept.Due to the current usage the first  implementation was done for the Titan graph database which is provided by PureSol Technologies.

Other implementations are planned or already available in development state. Here is the current list of (known) projects:

  1. eXtended Objects for Neo4j by Buschmais GbR (stable): (included in eXtended Objects base library)
  2. eXtended Objects for JSON by Buschmais GbR (development): (included in eXtended Objects base library)
  3. eXtended Objects for Titan by PureSol Technologies  (development): (project page:
  4. eXtended Objects for Blueprints (development):

Test Approches for Green and Brown Field Projects

The picture  on the right side show the relationships between the different test levels. The tests are a kind of test stages in green field projects. Usually, the stages are done from bottom to top and from different parties or members of the time.



Green Field Projects (and also new code in brown field if possible)

Here it is to be pointed out, that the usual order of testing in Green Field Projects (and code writing) is done in the following direction:

  1. Writing Code (developer)
  2. Checking Code (developer and CI system with plugins)
  3. Writing Unit Tests (performed as block in TDD with steps 1 and 2)
  4. Writing Component Tests (developer and QA)
  5. Writing Integration Tests (developer and QA)
  6. Setting up automated GUI tests (QA)
  7. Manual Testing (QA and if needed, developers conducted by QA)
  8. Endurance Tests (QA)
  9. Performance Tests
  10. Load Tests (Stress Tests)
  11. Decoupled: Penetration Tests (continuously by all, but systematically by QA)

The first three steps which go hand in hand and in TDD it seems that the numbering is reorder to 3, 1 and 2, but the code of the tests is also code.

The tests become from one step to the next less detailed, but go more into use cases and real world scenarios. The tests become also more and more realistic in the configuration and setup in relationship to the target environment. Even patch levels need to be controlled, if needed.

Brown Field Projects

In Brown Field Projects it is the other way around. A legacy code base is available and not under test. The approach can not go here to write unit tests first, because it is too much work to do and in most cases it takes a long time to understand the old legacy code. To get a good unit test suite, one calculates roughly to have at least 1 unit test for every 100 lines. In a legacy code base for only 100,000 lines it would mean to write at least 1,000 meaningful unit tests. This work would really suck…

The approach goes here to write integration tests first to assure the basic functionality. The basic use cases need to be right and the functionality is assured. With integration tests not all paths of working can be checked, but the basic behavior, the correct results for the most important use cases and the error handling for the most common error scenarios can be checked.

As soon as new functionality is developed, functionality changed or extended, new integration tests are to be implemented first to assure the still needed functionality stays unchanged and the new functionality is tested as it is required. Unit test and other tests are implemented as needed and practical.

As soon as an integration check fails, the failure analysis will reveal the details of the issue. These details are then check with additionally developed component and unit tests. As long as this procedure is used, more and more meaningful unit tests are developed and the more test coverage is reached.

The same is with GUI tests. At first one has only a chance to perform manual GUI tests. Later on the most common use cases and the general GUI behavior can be tested automatically. This also leads to a better test coverage.

In Brown Field Projects the tests in different tests groups are also developed in parallel. Integration tests are developed in parallel with the performing of manual testing which assures the current ability to deliver of the software. The current situation at hand dictates what is to be done.

Programming is both: A Craft and Science

There is a latent conflict in software development. I experienced it several times in different contexts. It is no always obvious, but nevertheless it is there… There are two different approaches to software development: An academic and a craftsman approach. One needs to be a good craftsman to write solid, maintainable, and efficient code, but one also needs to be an academic to understand IT systems and their specialties.

The academic approach is all about theory. Students in universities get told that theory. These students become very good and they are very knowledgeable, but they lack the experience and practice to develop real world software. They also learn a lot of mathematics like calculus, numerical mathematics, statistics, and linear algebra. For academics it is often difficult to realize that real world software cannot be perfect (too expensive), needs to be maintainable (software needs to be maintained and bug fixed and also developed on in future) and cost efficient, robust solutions might not be modern, cool or state-of-the-art. Academics also tend to fall in love with some problems and try to solve them in the most beautiful way possible. This is economically meaningless and too expensive.

The other approach is, that software is a craft. Real craftsmanship needs a lot of time, practice and guidance by a senior. It is like building stuff. The first trials will be messy, unusable, ugly… But, the more practice a craftsman get the better the outcome is. But, good code does not solve hard and complex issues. These need an academic approach. For example, some problems need to be converted into problems of Linear Algebra to get them performed efficiently. A pure programmer who only learn to program is not able to do that.

The problem is now: Students are intellectually trained very well, but they are no craftsman. Pure craftsman can write good code, but may not be intellectually prepared for complex systems. As a consultant I see both sites. I train scientists and engineers which are experts in their specific domain to write good, maintainable code, but I also see people who write good code, but need some input on how to tackle some hard problems.

What can be done!?

Academics: Science and engineering is ruled by laws and intellectual challenges. It is fine, but there is more to be done in Software. For Academics I recommend to read some books like “The Pragmatic Programmer” by Andrew Hunt et. al. ( and think about initiatives like Software Craftsmanship ( and observe the own daily doing. Academics in most cases are very good in self reflection. It is easily understood what needs to be done to not hassle later on with bad code, unmaintainable software. (For questions, I can be booked, too… ;-)) The main concern should be: How can the job be done in a way with maximum of automation, minimum of repetitive work and as maintainable as possible.

Craftsman: A craft is doing, evaluation and redoing until perfection is reached. It is a good approach, but some problems cannot be solved so easily. The easiest way to proceed is to team up with a domain expert, the second best thing is to read about the domain of expertise which is missing to become an expert in this field to a certain level which is needed. The main concern should be: How can the problem be solved efficiently? The most problems are already solved to a certain level. This can be copied and studied. Only the remaining parts need to be invented.

The best software developers combine skills of academics and craftsman. Both sites are challenging on its own, but the combination needs to be developed over time. The result is worth it: The software developed by these people is amazing: Maintainable, understandable, simple, efficient…

SSL in Real Life

We live in the Post Snowden Era and we all need to think about what shall happen with our Privacy, with the Internet and the way we communicate. The situation is fundamentally different to what we had about 20 years before…

All the laws which should protect us were made in times were surveillance needed physical access to place phone taps and microphones. Communication was bound to a distributed physical medium which needed direct access to break into it. Additionally, communication was from point to point on shorted path.

In the internet everything is connected now and on infrastructure based on TCP/IP serves everything. The internet is not as distributed as most people think. Because some very large internet knots bundle a lot of communication on them. Their number is limited and some institutions are able to hook into them as revealed…

One of the recommendations is to encrypt everything. From simple web pages over email to all other forms of communications. As long as only some communication is encrypted, there is still to much information open, so everybody should encrypt it messages (even the unimportant ones) to increase the noise in the net to make it attackers more difficult to find out what are important communications and what is not. With this in mind, we decided at PureSol Technologies to provide all out Website via HTTPS (SSL encryption via HTTP). As soon as the decision was made, the issues started…

Trusted Communication only via Root CA

One of the current major issues is that only Root CAs are able to issue good certificate for SSL which bring Browsers to the “Green Lock Mode” which shows the site as trusted without any warning. Otherwise, browsers show a warning about not trusted certificates, because the authentication was not performed and nobody knows whether the certificate was provided by the correct organization or person. This is correct in so far, that I could create certificates which claim to be issued from Microsoft Corportation. To avoid this, some so called Root CAs are authorized to check the certificate to be authentic. A good idea so far…

Problem 1: There are not so much Root CAs and the prices are too high for small companies like PureSol Technologies. The prices for certificates can go up to the 1kEUR per year. I better spend this money for developers.

Problem 2: There were already issues with security breaches in Root CAs. The most prominent one was the erronous issues certificates to an individual in the name of Microsoft. The issue, this individual was not in any kind related to Microsoft. See Mircosoft database:

Why shouldn’t I provide my own certificates? I can create my keys and certificates on a completely disconnected computer and copy the stuff via Sneakernet. OK, there is still the issue, that I claim to be me, but knowbody knows whether this is correct or not… But to pay thousands of Euros for that kind of service which should be a one time event?

Not handled Certificate Revokation Lists

It is still not standardized when, how and by whom the certificates are to be rechecked for validity. The above mentioned issue with the VeriSign create Microsoft certificates showed, that even Microsoft had to create a hot patch to contain that issue, because Windows itself could not deal with certificate revokation lists. It is still a place of construction and a week spot in SSL and the network of trust.

So, Root CAs provide an expensive service to provide security, but it is not really clear, how the revokation lists are handled. The problem here is manyfold. What happens, if the service for revokation check is down? How often is to be checked?

Running an Own Certificate Authority

After checking prices, services and how much work it is to deal with SSL and certificates on our own, we decided to run our own Certificate Authority.

You can find the needed public information at

There is only one issue: The PureSol Technologies Certificate Authority is not a pre-installed CA in browsers. So, if an HTTPS connection is opened to PureSol Technologie site, the browsers will download the certificates and during the check they find, that no known Root CA has checked and signed these.

To overcome this, we provide the PureSol Technologies Certificate Authority Website for certificate download. To check the authenticity we provide the certificates to clients via Sneakernet for local installation.

Additionally, site which are “read-only” for clients are provided via HTTP and HTTPS. Only sites which enforce user interaction like logins, we provide via HTTPS only. All HTTP connections on these sites are automatically redirected to its HTTPS variant.

Hadoop Client in WildFly – A Difficult Marriage

(This article was triggered by a question “Hadoop Jersey conflicts with Wildfly resteasy” on StackOverflow, because I hit the same wall…)

For a current project, I evaluate the usage of Hadoop 2.7.1 for handling data. The current idea is to use Hadoop’s HDFS, HBase and Spark to handle bigger amount of data (1 TB range, no real Big Data).

The current demonstrator implementation uses Cassandra and Titan as databases. Due to some developments with Cassandra and Titan (Aurelius was aquired by DataStax.), the stack seems not to be future-proof. An alternative is to be evaluated.

The first goal is to use the Hadoop client

in WildFly 9.0.1. (The content of this artical should be also valid for WildFly >=8.1.0.) HDFS is to be used at first to store and retrieve raw files.

Setting up Hadoop in a pseudo distributed mode as it is described in “Hadoop: The Definitive Guide” was a breeze. I was full of hope and added the dependency above to an EJB Maven module and wanted to use the client to connect to HDFS to store and retrieve single files. Here it is, where the problems started…

I cannot provide all stack traces and error messages anymore, but roughly, this is what happend (one after another; when I removed one obstacle, the next came up):

  • Duplicate providers for Weld where brought up as errors due to multiple providers in the Hadoop client. Several JARs are loaded as implicit bean archives, because JavaEE annotations are included. I did not expect that and it seems strange to have it in a client library which is mainly used in Java SE context.
  • The client dependency is not self contained. During compile time an issue arised due to missing libraries.
  • The client libraries contains depencencies which provide web applications. These applications are also loaded and WildFly try to initialize them, but fails due to missing libraries which are set to provided, but not included in WildFly (but maybe in Geronimo?). Again, I am puzzled, why something like that is packaged in a client library.
  • Due to providers delivered in sub-dependencies of Hadoop client, the JSON provider was switched from Jackson 2 (default since WildFly 8.1.0) back to Jackson 1 causing infinite recursions in trees I need to marshall into JSON, because the com.fasterxml.jackson.*  annotations were not recognized anymore and the org.codehaus.jackson.*  annotations were not provided.

The issues are manyfold and  is caused by a very strange, no to say messy packaging of Hadoop client.

Following are the solutions so far:

Broken implicte bean archives

Several JARs contain JavaEE annotations which leads to an implicit bean archive loading (see: Implicit bean archive support needs to be switched off. For WildFly, it looks like this:

Change the Weld settings in WildFly’s standalone.conf from


This switches the implicit bean archive handling off. All libraries used for CDI need to have a /META-INF/beans.xml  file now. (After switching off the implicit archives, I found a lot of libraries with missing beans.xml  files.)

Missing dependencies

I added the following dependencies to fix the compile/linkage issues:

Services provided, but not working

After switching off the implicit bean archives and added new dependencies to get the project compilied, I run into issues during deployment. Mostly, the issues were missing runtime dependencies due to missing injection providers.

The first goal was to shut off all (hopefully) not needed stuff which was disturbing. I excluded the Hadoop MapReduce Client App and JobClient (no idea what these are for). Additionally, I excluded Jackson dependencies, beacause they are already provided in the WildFly container.

Broken JSON marshalling in RestEasy

After all the fixes above, the project compiled and deployed successfully. During test I found that JSON marshalling was broken due to infinite recursions I got during marshalling of my file trees. I drove me cracy to find out the issue. I was almost sure that WildFly 9 switched the default Jackson implementation back to Jackson 1, but I did not find any release note for that. After a long while and some good luck, I found a  YarnJacksonJaxbJsonProvider class which forces the container to use Jackson 1 instead of Jackson 2 messing up my application…

That was the final point to decide (maybe too late), that I need a kind of calvanic isolation. Hadoop client and WildFly need to talk through a proxy of some kind not sharing any dependencies except of one common interface.

Current Solution

I created now one Hadoop connector EAR archive which contains the above mentioned and fixed Hadoop client dependencies. Additionally, I create a Remote EJB and add it to the EAR which provides the proxy to use Hadoop. The proxy implements a Remote Interface which is also used by the client. The client performs a lookup on the remote interface of the EJB. That setup seems to work so far…

The only drawback in this scenario at the moment is, that I cannot use stream throug EJB, because streams cannot be serialized. I think about creating a REST interface for Hadoop, but I have no idea about the performance. Additionally, will the integration of HBase be as difficult as this!?

For the next versions, maybe a fix can be in place. I found a Jira ticket HDFS-2261 “AOP unit tests are not getting compiled or run”.