Local Indices on Partitioned Tables

I wrote an article lately about a simple model to understand indices in databases. It may help to read the other article first.

In a data processing project I worked on, the data is put into an Oracle database. The data is queried differently based on questions to be answered. Even worth, the queries are not known upfront, so a quick enough access is to be provided for all kind of usage scenarios.

During the design phase some decisions were made based on the information at the time. As the main point of view onto the data was time series based, the decision was made to use partitioned tables with local indices based on the time stamps. In the aftermath, that was a mistake, which was corrected.

Let’s assume, we want to store two years of data and the partition key is based on the year and week. For instance week 50 in 2017 has  the partition key 1750. Let’s assume we store data now for 2016 and 2017, we need about 104 partitions with partition keys 1601, 1602, …, 1652, 1701, 1702, …1751 and 1752. For the sake of simplicity for the next calculations, we assume we have in total 100 partitions. Each partition has local indices on different fields.

What is wrong with this scenario now? As long as we can directly access the table with partition keys, there is nothing wrong. We can create selects like:

Oracle knows in which partition to look at and which index to use to find the name. But, what happens in case we want to look up the name for the whole time? The query could be something like:

In this case we get terribly slow. But why?

Let’s calculate the time characteristics in Big O Notation:

Let’s assume we have a total number of N samples in the whole time series. In trivial, unpartitioned case, without any additional index, we have the time characteristics:

t(N) = O(N)

If we put an index onto the table which can be utilized, we get a much better time characteristics of:

t(N) = O(log2(N))

That’s actually, why indices are used. The speedup S is :

S(N) = O(N) / O(log2(N)) ~ N / log2(N)

If N is only 1, there is not speedup at all, but for let’s say 1000 samples, the speedup is already about 100. That’s great.

But, what happens in case of partitioning? Let’s assume we have in total P partitions. If we can use the partition key only without any additional index, Oracle just accesses the one partition and we get a behavior of

t(N) = O(N/P)

with a total speedup of

S(N) = O(N) / O (N/P) ~ P

In our example above, this is a speed up of a factor of 100 resulting in a 1/100 of query time, because we only need to access 1/100th of the data. Not bad at all.

The same characteristics we can calculate with an additional index. In this case we get:

t(N) = O(log(N/P))

What happens now in case we search for name over the whole time series? Because, we need to read the whole table again over all partitions with a local index only, this comes down to:

t(N) = O(P * log(N/P))

The problem is, we have to look P times into each partition and can use the local index only. This is P time utilizing an index. Compared with an index on unpartitioned table or with a global index, the speed up is is much smaller that one. In our example above with 100 partitions, this leads to:

S(N) = O(log(N)) / O(100*log( N/100) ~ 1/ 100

The reason is that the partition number increases linearly, but the gain of the performance due to the index is increasing much slower.

This calculations show, that in cases of random access on partitioned tables, global indices are recommend to not slow down the queries.

Simple Model for Understanding Database Indices

I had a lot of discussions in the past about databases and indices. The type of the database is not  important, as the general design of indices are always the same. The following model can be applied to relational databases, document stores, big table databases, columnar stores and also to graph databases.

The following model is used to explain what indices are and why they work so well and especially why there are certain restrictions to them:

1. Multiple single value indices do not help efficiently in where clauses with AND conditions.
2. For combined indices, the where clauses need to have certain AND clauses to get fully utilized. Missing some of the clauses may lead into not using the index.
3. The order of values in combined indices are important, when time series or continuous data is involved. It is shown what the correct order is to get good performance.

These three points, I have to discuss over and over again, as most people are not database specialists and the inner works of databases are not well known. The following simple model shall provide some insights.

Storing Data in General

Storing data itself is normally quite simple, just put the stuff to disk. I know, it is not meaningful, but as starting point, let’s assume it for now. It is also not important whether the data is put onto disk, into a storage like Hadoop or something else.

For the different databases we can assume we have basic data structures like

1. Relational Database: Row = Map<ColumnName, Value>
2. Big Table Database: ColumnGroup = Map<Key, Value>
3. Document Store: Document = <any data structure like JSON>
4. Graph Database: Vertex, Edge = <any structure>

Important for now is, we have different kinds of data structures, but it is not important for now, how they look like. We just put them onto the disk as a simple binary file.

If we do this, there is only one thing missing: We can’t distinguish them, that’s why we define a data store with Object IDs:

DataStore = Tuple<ID, Object>

We have an object ID which is unique defining a certain Object. We can read now the store Object by Object and can find the correct one by comparing the ID.

Primary Index

With the store above, we can store any data and can retrieve it via the unique ID. The issue is, to find the related Object, we need to read the whole store. This leads to a time characteristics in Big O Notation with a total number of N samples of:

t(N)=O(N)

This is a poor characteristics. This can be improved dramatically by adding a primary index. This primary index maps every ID to the storage position of the Object:

PrimaryIndex = Map<ID, DataFileAndPosition>

Do not mistake this primary index with a primary key from a relational database. Here we talk about a technical index to retrieve data from disk efficiently. The primary key in a relational database is logical construct on the database content for data organization and modelling.

As soon as we have a primary index in place, the time characteristics changes to:

t(N) = O(log(N))

We assume here, that the index is ordered by the IDs and searchable with a binary search.

The speedup S is:

S(N) = O(N) / O(log(N)) ~ N / log(N)

For N = 1000 is around 100 and for N = 1,000,000 it is around 50,000.

For a simple key values store, this would be good enough. But, what do we do if we look for a specific Object without knowing its ID? We still have to search the whole data store.

Single Value Secondary Indices

To find objects with certain attributes, values or properties (the name depends on the type of database), indices are used. For this to work, we need to have the possibility to extract this Values out of the stored Object:

Value = Object(Name)

The Name in a relational database is for instance the column name and for a graph database a property name. To find the value, we need to have an index like:

Index = Map<Value, Set<ID>>

What about the time characteristics? It was

t(N) = O(N)

before. Now, we can do an index lookup first, which has a time characteristics like the primary index:

t(N) = O(log(N))

This is pretty good and with the additional time characteristics of the primary index, we get an overall time characteristic of:

t(N) = O(log(N)) + O(log(N))= 2 * O(log(N)) = O(log(N))

That’s great, isn’t it? Again, we assume, we can access the Value with a binary search.

Definition of Cardinality and Selectivity

For a better understanding of the next chapters and for the sake of simplicity, we need to explain two terms: Cardinality and Selectivity.

Cardinality C, is the number of unique values. That means, C <= N for a single attribute. In case the Cardinality is known (an index should keep track of it),the number of returned results can be estimated as:

NumerOfResults = N / C

The simple idea is, that the number of each unique value is equal. The factor 1 / C can be defined as Selectivity like:

Selectivity = 1 / C

Single value indices in AND clauses

With the explanations above in mind, we can explain why multiple single value indices cannot be combined. Let’s assume a SQL query like

Additionally, we assume to have single value indices on both columns. What can the database do now?

The first implementation would normally find that both columns have a single value index and uses the index with the better Selectivity (which filters most and returns least number of results). It takes the IDs out of it, reads the Objects and filters the results again with the second conditions. In this case we have the same behavior as with one index only.

A second implementation could check both available indices and takes the ID sets out of both. From both sets the intersection is calculated and with the result the Objects are read. Depending on the amount of effected IDs and the overall number of Objects, this might be less efficient, than the first approach.

A real solution is only a combined index.

Combined Secondary Indices

A combined index is just an extension of a single value index with multiple values:

CombinedIndex = Map<Attribute1Value, Map<Attribute2Value, … Map<AttributeNValue, Set<ID>>…>>

Combined Indices and Where Clause

With the definition of the combined index above, we see why it is important to add the correct where clause to a query. Let us define a combined index over three attributes:

CombinedIndex = Map<Attribute1Value, Map<Attribute2Value, Map<AttributeNValue, Set<ID>>>>

As long as the AND clauses in the query are added from attribute1 to attribute3 in the correct order (the order in the query statement itself should not matter), the index can be utilized. For instance, the quere

is optimal. It runs the query directly on the index and finds all relevant rows.

The same holds true for the queries:

The only difference is, that only the first two or the first map can used to find the set. The remaining maps are scanned. All relevant rows are found with the index and the Objects can be read.

But, what about the following queries?

The index cannot be used efficiently, because the first Map needs to be scanned completely. Some benefit may be gained from the other maps, but it is not said, that the database can do it.

That’s why it is important to put some thoughts into the correct definition is combined indices. If the last queries are dominant, the order of the attributes should be changed. Maybe, an additional index might make sense. It heavily depends on the use cases expected.

Time Series and Continuous Data

For time series data and continuous data, a special look is needed. For this kind of data, a check for equality makes sense only in rare cases, but creating an index make sense anyways.

Continuous data (also time series) is queried with inequality to get defined ranges of the data. In case an index was created, a range scan can take place. With the same binary search algorithm the start of the range can be found and afterwards all values before the end are used to retrieve the IDs. We assume here again, that the Values are ordered in the index and a binary search can take place.

Continuous Data in Combined Secondary Indices

Let’s assume we have a table with two attributes Time and Name. What is the best order for the combined index? The query could look like:

For this example let’s assume the selectivity of name is 10% and also the time range is 10%.

If we create the index in the order

Index = Map<NameValue, Map<TimeValue, Set<ID>>>

the database can narrow the search with the name already to 10% of the index and can perform a range scan over the second map. This results into an index read of 1% of the size only.

If we create the index in the other order

Index = Map<TimeValue, Map<NameValue, Set<ID>>>

the database can perform a range scan over the first map. But, because time is a continuous value, the name will be most likely unique and for each time value the name needs to be checked again. This leads into a less efficient retrieval of data and a read of 10% of the index.

In case continuous data is to be put into an index, it is to be expected that it is most efficient if it is put at last position.

Short Summary

The explanations above are maybe not the most accurate, but for me they served quite well to understand some effects I experienced.

Another article about partitioned tables is in preparation. With these some other effects are to be considered.

Java: Performance issues with FileInputStream

During the studies of database design with the accompanying toy project DuctileDB, I found some disturbing performance issues.

During the implementation of a needed log structured storage, I found that the performance is terribly slow and only about 1/100 of the performance expected. A simple check of the provided source code of FileInputStream reveals the issue: There is absolutely no optimization included.

As widely known (at least I expect it to be widely known), file systems are based on blocks and all files are stored cut into this blocks. That’s why a hard disk is also called a block device. Block devices are normally read buffered. To get a good performance, the buffer size needs to be at least the block size of the file system.

Surprisingly, at least for me, the FileInputStream is just reading every byte as single byte in the read() method. A poor performance was not a surprise anymore after the finding.

The first test was, to just decorate the FileInputStream with a BufferedInputStream. The default buffer size of BufferedInputStream is 8192 bytes or 8kB. I checked and for my systems, the block size is 512 bytes only. The performance increase was astonishing.

As a studied physicist, I wanted to know it better. So I started some measurements…

Measurement Results

The measurements took place on the systems I had at hand. The measurements were performed with a simple JUnit test. The test is part of PureSol Technologies’ Steaming Library. The class is called SequentialReadPerformanceTest.

The test creates a simple 1MB file and reads the files byte by byte. The measurement was performed ten times and a simple statistics was calculated which returns a minimum, maximum and of course a mean throughput. For the reading, the buffer size is varied from 0 bytes (a pure FileInputStream) up to 256kB. In the charts the 0 bytes buffer is drawn as 0.1kB, because of the logarithmic scale, a 0 was not shown.

I had three systems at hand:

Windows 10 on SSD:

 BufferSize minThroughput maxThroughput meanThroughput sigmaThroughput 0 0.478313481960232 0.560977316112137 0.53931803772942 0.026823943405206 1 20.7921956295732 159.661168284835 106.530929541767 41.5452866636502 2 135.890172930229 199.804763831118 159.301353868548 21.9993284291508 4 114.080929031247 185.829790831254 162.480645695998 20.4335777478615 8 119.472762811058 207.246561853822 173.82688161485 22.5238524242766 12 152.112263334608 212.048153787975 174.07568558316 18.3455536745854 16 160.007641966543 206.408535265054 177.837585133614 15.110967892046 24 122.760600929588 200.196611853492 174.685818018924 22.1811746856282 32 130.191143000353 205.116448207889 175.602014380743 23.4884718567785 40 189.830287548959 238.60181506275 217.714568123906 14.6494451455874 48 179.301964860046 257.090941811432 211.133906654611 24.3405587930466 56 180.374229186133 253.724040842617 220.05957106526 22.9934585645487 64 169.576206649008 252.181203292139 214.61583945847 22.28789046186 128 172.901240725885 242.260583721361 214.784464231356 22.0707625968981 256 158.073193513394 235.072835911773 198.832874007776 28.4353501136663

This is my development laptop with a WIndows 10 installed on a SSD. Filesystem is NTFS with a block size of 512 bytes.

VirtualBox on SSD:

 BufferSize minThroughput maxThroughput meanThroughput sigmaThroughput 0 3.30897090085418 3.4685246511367 3.37156905146268 0.058287703014056 1 45.8442138958278 204.489511433226 149.196065250552 54.546756256114 2 169.028569641574 213.338498388938 191.078173076755 16.1087752022564 4 175.832342723626 220.551930904206 204.223397588209 14.7572527618542 8 167.40312316305 212.853555391402 200.302253142487 14.2208722932322 12 174.043085529545 221.548785267493 200.891618472135 15.2731165631726 16 178.626872017884 223.56519816547 204.200373947012 15.2758384736513 24 181.218954353416 225.566526761738 206.516876975593 13.137885697191 32 223.296922708024 265.152151154852 245.022329558141 13.661324942683 40 216.037271505894 256.377367956888 240.406641413718 13.4568412728096 48 194.605412997549 258.278920524087 238.66016565359 18.617845389739 56 218.857124728457 268.919769748526 241.338112649669 16.2068776298014 64 215.220983192018 262.003958883976 240.601463050768 14.6838561989163 128 207.162624999802 258.216081521909 237.674988725484 14.2341770113763 256 202.084528616352 262.459082128095 234.696119688756 19.2935202838983

This VirtualBox runs on the Windows 10 system mentioned above. As guest system an Ubuntu 17.10 is installed. The filesystem is btrfs with a block size of 512 bytes.

KVM on magnetic disks:

 BufferSize minThroughput maxThroughput meanThroughput sigmaThroughput 0 3.50369275983678 3.64641950704795 3.59516953215679 0.042635735226103 1 41.0512347657029 256.14289221404 210.779922949005 78.5887297940257 2 232.373524058612 285.055321462553 263.672741286731 12.7495562706978 4 255.566173924001 272.880340559129 264.701869296095 5.75897219156574 8 256.293085158907 278.836772514138 268.961324482104 6.65276931366453 12 255.882054365566 276.077129595719 267.372542695553 7.72968634110117 16 258.613027035372 275.235002420108 269.684282122623 5.18769137033597 24 171.856825532156 276.072041550241 251.919547286073 27.995974992746 32 252.29570040848 270.142097067889 264.173763360173 5.0979184809125 40 289.992165109599 304.345453637775 295.727761674292 3.72938830994006 48 290.484623991654 299.758266839562 295.794604575198 2.85277669207739 56 291.98249401528 305.188024108252 297.294492472613 3.86242932963834 64 292.089367018439 299.632351890619 294.804240737189 2.2082451746197 128 255.519590730416 302.032878151745 294.733138187711 13.2476788765596 256 285.858694354284 302.92348363624 295.621763218591 5.44402902250886

This is an Ubuntu 16.04 guest inside of  KVM running on another Ubuntu 16.04. The host has magnetic disks running in RAID1 soft raid with a btrf filesystem with 512 bytes block size. The guest has a XFS file system with a block size of 512 bytes.

Summary and Recommendations

There several are some findings in the graphs above:

1. A FileInputStream without any additional buffer provides a terrible performance. Astonishingly, the difference of the performance of a pure FileInputStream and the maximum performance is almost two orders of magnitude. At least, for me it is surprising.
2. On the virtual environments, the minimum performance of FileInputStream is not as bad as on the ‘bare metal’ Windows 10. I guess, it is related to additional buffering between the host and guest environment. More investigations would help here. Not sure, whether I can do that in future.
3. The maximum performance is for buffers of 32kB or 40kB. As the block size is only 512 bytes, this is surprising for me as well. At the moment, I do not have a concise explanation for that.
4. The default buffer size of BufferedInputStream of 8 kB, provides already a good performance gain.

These observations lead to some recommendations of mine:

1. As soon as you read from file in Java, always use a BufferedInputStream to dramatically speed up reading.
2. As long as the best buffer size is not known, the default buffer size of BufferedInputStream can be used.
3. I did not do measurements here, but I know from experience, that for every FileOutputStream a BufferedOutputStream also provides a much better performance. Without a better knowledge, we can assume that the throughput speed ups are similar. Therefore, for every FileOutputStream, put a BufferedOutputStream decorator around it.

If I find some time, I will also try to find out what the correct behavior of FileOutputStream is.

Based on the findings and recommendations, I will develop a special FileOutputStream in PureSol Technologies’ Streaming Library. The buffer size will be configurable via a system property. Additionally, I also think about a mechanism to automatically find the best performance during static initialization to maximize performance.

So, have a look from time to time at the library and this blog to find more information on this topic and news on the progress of the library.

Thoughts on the Different Kinds of Test

General Remarks

Testing shall assure the quality of a product and that development does not break the already implemented functionality. The quality is defined in ISO 9126 for example (see: http://rickrainerludwig.wordpress.com/2014/02/10/software-quality-charateristics-in-iso-9126). Testing is a proof of quality and tests are accepted delivery criteria.

Tests can be separated roughly into different kinds of tests:

1. Functional tests
2. GUI tests
3. Non-functional tests
4. Penetration tests

Functional Tests

Functional tests are all intended to run without any GUI interaction. The checks are on functional level and use data to check all kinds of transformation, calculation and handling. These tests assure the correct behavior and results of the application. There are three different kinds of functional tests which are explained below:

1. Unit tests
2. Component tests
3. Integration tests

Unit Tests

Unit tests are intended to test single methods and classes in total isolation on a functional level. Here the implementation of single methods and functions are checked like correct calculations, and correct error handling for invalid parameters. Tests should be performed for valid parameters with a check on the returned results, invalid parameters and the correct error behavior and for corner cases which are to be checked for correct behavior, too.

These tests should be provided by the developers. Unit tests can be created during development as simple ways to check the development code. Test Driven Development (TDD) would be best. Except of file IO no other external infrastructure must be used. Mocking is allowed, if needed, but as it is about single methods and classes, it will not be necessary in most cases.

Since in an optimal setup each method should have at least three unit tests (valid, invalid and corner case parameters), a lot of unit tests are to be expected. Because of this high number of tests, a time constraint should be kept of 0.1s for instance. 10.000 tests (which is around 3000 method, which is 500 classes) would take about 20min. This is a lot of time for a continuous integration test, which uses unit tests.

Component Tests

Component tests test several classes in relationship to each other. In component tests no environment should be involved. The environment is simulated with simple mock-ups. Only the filesystem should be involved in testing to load sample data for instance. For mocking the Mockito framework is used in a lot of projects at PureSol Technologies.

Ideally, these tests would be provided by the developers, too. The scope is also functionality, but with the interference of other classes. Component tests would also be a good check during development.

Component tests can and should be run in a continuous integration way, too. That’s why component tests should also have  a time constraint of 0.1s, too.

Integration Tests

Integration tests check the functionality of the product with environment interaction. These tests also include databases and other server systems which will be in place during normal product usage. The environment for integration tests should be as close as possible to the target environment.

Integration tests tend to run a longer time than component tests and unit tests. These tests should run after the normal continuous integration build separately as a single test stage. Integration tests do not have a fixed time constraint.

GUI Tests

The tests following in this section are tests on GUI level. The tests are either pure UI tests or end-product tests.

Pure UI tests  just check the correct appearance and functioning of the UI itself. These tests should be quick. Together with the integration test of interfaces like REST, the functionality of the final product is quite likely, but not completely proven.

That’s why, product tests can be performed via UI. These tests check the correct functionality and are intended to test the final product like the customer is going to experience it. The test environment setup and configuration shall be as close as possible to the customer installation.

There are just two possibilities for GUI tests:

1. Automated GUI tests
2. Manual GUI tests

Automated GUI Tests

Automated GUI tests check the standard behavior of the GUI and are a good tool for end-product testing.Here the most common use cases shall be tests and not only the happy tests, but also the error behavior.

The tests’ maintainability is quite expensive due to a changing GUI. The tests are easily broken and the repair needs to be done carefully. Therefore, these tests should only check the GUI and not the functionality. The functionality is checked with non-GUI Tests.

Manual GUI Tests

These tests are manual test. These tests are very expensive due to the binding of a lot of people for the task. These tests should be minimized to check the aesthetics of the product like colors, graphs, prints and layout. These tests can not be automated.

Non-Functional Tests

Strictly speaking, GUI tests are also non-functional, but here we need to test everything which is not related to the use cases of the product itself.

The possible tests are:

1. Endurance tests
2. Performance tests
3. Resource tests

Endurance Tests (Iteration Tests)

Endurance tests are tests which test one functionality for a large number of times. For example a certain calculation can be performed 1000 times one after another. You can see several effects here which might influence the quality:

1. Memory Leaks: After a large number of iterations one can see memory leaks. The function should be tested at least as often as it is expected to be run in a single application startup on client customer site.
2. Stability: One can see how stable the function under test is. This can be a mathematical stability as long as the formula relys on some environmental data or the stability in a communication module and so on.

Endurance tests can be coupled with performance tests to save time. During the endurance test the time can be measured for function execution and a statistics can be provided.

Performance Tests

Performance tests test the speed and answer time of applications. Some applications are time critical or time is a measure of service quality. Performance tests assure the correct performance behavior.

Load tests test software applications on heavy load. These tests are important to check how stable an application is on load due to memory consumption, concurrency effects and bottle necks within the applications. Some of these issues would be undetectable without such tests.

Penetration Tests

These tests are done mostly by humans and not robots due to its creative nature. Penetration tests try to break a product through use cases, to show the stability and reliability. The goal is to break the application and to provide information about the breaking mechanism to stabilize and harden the software application.

These kind of tests are explained below. Additionally, there are two other kind of tests: Static Source Code Analysis and Code Reviews. Both can be performed by Developers and/or QA and help to keep the maintainability high.

Test Approches for Green and Brown Field Projects

The picture  on the right side show the relationships between the different test levels. The tests are a kind of test stages in green field projects. Usually, the stages are done from bottom to top and from different parties or members of the time.

Green Field Projects (and also new code in brown field if possible)

Here it is to be pointed out, that the usual order of testing in Green Field Projects (and code writing) is done in the following direction:

1. Writing Code (developer)
2. Checking Code (developer and CI system with plugins)
3. Writing Unit Tests (performed as block in TDD with steps 1 and 2)
4. Writing Component Tests (developer and QA)
5. Writing Integration Tests (developer and QA)
6. Setting up automated GUI tests (QA)
7. Manual Testing (QA and if needed, developers conducted by QA)
8. Endurance Tests (QA)
9. Performance Tests
11. Decoupled: Penetration Tests (continuously by all, but systematically by QA)

The first three steps which go hand in hand and in TDD it seems that the numbering is reorder to 3, 1 and 2, but the code of the tests is also code.

The tests become from one step to the next less detailed, but go more into use cases and real world scenarios. The tests become also more and more realistic in the configuration and setup in relationship to the target environment. Even patch levels need to be controlled, if needed.

Brown Field Projects

In Brown Field Projects it is the other way around. A legacy code base is available and not under test. The approach can not go here to write unit tests first, because it is too much work to do and in most cases it takes a long time to understand the old legacy code. To get a good unit test suite, one calculates roughly to have at least 1 unit test for every 100 lines. In a legacy code base for only 100,000 lines it would mean to write at least 1,000 meaningful unit tests. This work would really suck…

The approach goes here to write integration tests first to assure the basic functionality. The basic use cases need to be right and the functionality is assured. With integration tests not all paths of working can be checked, but the basic behavior, the correct results for the most important use cases and the error handling for the most common error scenarios can be checked.

As soon as new functionality is developed, functionality changed or extended, new integration tests are to be implemented first to assure the still needed functionality stays unchanged and the new functionality is tested as it is required. Unit test and other tests are implemented as needed and practical.

As soon as an integration check fails, the failure analysis will reveal the details of the issue. These details are then check with additionally developed component and unit tests. As long as this procedure is used, more and more meaningful unit tests are developed and the more test coverage is reached.

The same is with GUI tests. At first one has only a chance to perform manual GUI tests. Later on the most common use cases and the general GUI behavior can be tested automatically. This also leads to a better test coverage.

In Brown Field Projects the tests in different tests groups are also developed in parallel. Integration tests are developed in parallel with the performing of manual testing which assures the current ability to deliver of the software. The current situation at hand dictates what is to be done.

Hadoop Client in WildFly – A Difficult Marriage

(This article was triggered by a question “Hadoop Jersey conflicts with Wildfly resteasy” on StackOverflow, because I hit the same wall…)

For a current project, I evaluate the usage of Hadoop 2.7.1 for handling data. The current idea is to use Hadoop’s HDFS, HBase and Spark to handle bigger amount of data (1 TB range, no real Big Data).

The current demonstrator implementation uses Cassandra and Titan as databases. Due to some developments with Cassandra and Titan (Aurelius was aquired by DataStax.), the stack seems not to be future-proof. An alternative is to be evaluated.

The first goal is to use the Hadoop client

in WildFly 9.0.1. (The content of this artical should be also valid for WildFly >=8.1.0.) HDFS is to be used at first to store and retrieve raw files.

Setting up Hadoop in a pseudo distributed mode as it is described in “Hadoop: The Definitive Guide” was a breeze. I was full of hope and added the dependency above to an EJB Maven module and wanted to use the client to connect to HDFS to store and retrieve single files. Here it is, where the problems started…

I cannot provide all stack traces and error messages anymore, but roughly, this is what happend (one after another; when I removed one obstacle, the next came up):

• Duplicate providers for Weld where brought up as errors due to multiple providers in the Hadoop client. Several JARs are loaded as implicit bean archives, because JavaEE annotations are included. I did not expect that and it seems strange to have it in a client library which is mainly used in Java SE context.
• The client dependency is not self contained. During compile time an issue arised due to missing libraries.
• The client libraries contains depencencies which provide web applications. These applications are also loaded and WildFly try to initialize them, but fails due to missing libraries which are set to provided, but not included in WildFly (but maybe in Geronimo?). Again, I am puzzled, why something like that is packaged in a client library.
• Due to providers delivered in sub-dependencies of Hadoop client, the JSON provider was switched from Jackson 2 (default since WildFly 8.1.0) back to Jackson 1 causing infinite recursions in trees I need to marshall into JSON, because the com.fasterxml.jackson.*  annotations were not recognized anymore and the org.codehaus.jackson.*  annotations were not provided.

The issues are manyfold and  is caused by a very strange, no to say messy packaging of Hadoop client.

Following are the solutions so far:

Broken implicte bean archives

Several JARs contain JavaEE annotations which leads to an implicit bean archive loading (see: http://weld.cdi-spec.org/documentation/#4). Implicit bean archive support needs to be switched off. For WildFly, it looks like this:

Change the Weld settings in WildFly’s standalone.conf from

to

This switches the implicit bean archive handling off. All libraries used for CDI need to have a /META-INF/beans.xml  file now. (After switching off the implicit archives, I found a lot of libraries with missing beans.xml  files.)

Services provided, but not working

After switching off the implicit bean archives and added new dependencies to get the project compilied, I run into issues during deployment. Mostly, the issues were missing runtime dependencies due to missing injection providers.

The first goal was to shut off all (hopefully) not needed stuff which was disturbing. I excluded the Hadoop MapReduce Client App and JobClient (no idea what these are for). Additionally, I excluded Jackson dependencies, beacause they are already provided in the WildFly container.

Broken JSON marshalling in RestEasy

After all the fixes above, the project compiled and deployed successfully. During test I found that JSON marshalling was broken due to infinite recursions I got during marshalling of my file trees. I drove me cracy to find out the issue. I was almost sure that WildFly 9 switched the default Jackson implementation back to Jackson 1, but I did not find any release note for that. After a long while and some good luck, I found a  YarnJacksonJaxbJsonProvider class which forces the container to use Jackson 1 instead of Jackson 2 messing up my application…

That was the final point to decide (maybe too late), that I need a kind of calvanic isolation. Hadoop client and WildFly need to talk through a proxy of some kind not sharing any dependencies except of one common interface.

Current Solution

I created now one Hadoop connector EAR archive which contains the above mentioned and fixed Hadoop client dependencies. Additionally, I create a Remote EJB and add it to the EAR which provides the proxy to use Hadoop. The proxy implements a Remote Interface which is also used by the client. The client performs a lookup on the remote interface of the EJB. That setup seems to work so far…

The only drawback in this scenario at the moment is, that I cannot use stream throug EJB, because streams cannot be serialized. I think about creating a REST interface for Hadoop, but I have no idea about the performance. Additionally, will the integration of HBase be as difficult as this!?

For the next versions, maybe a fix can be in place. I found a Jira ticket HDFS-2261 “AOP unit tests are not getting compiled or run”.

Software Engineering and the 4 Causes of Aristotle

I stumbled on the principle of the 4 causes of Aristotle some years ago when I did some reading about cause and effect, leadership and management. The question is: How do we get things done? When do we get things done? And also very important: How do we get things done in the right way?

The Four Causes of Aristotle

Everything what happens has four causes. There are no more and no less. Exactly four. Only if all four causes are present, something can come into existence. That is what Aristotle formulated in one of his most famous books: Physica. (Have also a look here: http://en.wikipedia.org/wiki/Four_causes)

In the next sections I describe the four causes. These causes appear in the order I present them. To give an accessible example, I use the picture of building a house. You will see how the four causes apply there.

Causa Finalis: The final cause

The causa finalis or final cause is the basic reason for anything to happen. You can also translate it as need, requirement or wish. Only with something like that a trigger is present to start some development.

For our house building example, we can think of it like the need to move into a new place because of a lack of space. For instance, a couple lives in a small flat and they are happy, but a child is to be born. They found out, that with a small child the place is too small, there is not a good chance to have a children room, the bath room is to narrow… They find that the current situation is changing and a need for more space is coming up.

This is the causa finalis. A need or requirement which is not detailed, yet. There is just an issue to be solved, but there is not a detailed plan, yet. But, the final result is formulated. In our case: More space.

Causa Formalis: The form

After the causa finalis is met, the second cause happens: The causa formalis or form. The final cause showed an issue and its final abstract solution and now it is thought about it and a plan or form is formulated. The causa formalis brings the vision for how the cause finalis can be solved.

In the house building example, our couple may have thought about renting the flat next to them and re-modelling a wall (what is maybe not allowed by the owner), moving into a bigger flat (which might be too expensive), or to build a house. Maybe, after deciding to build a house, they go to an architect and plan how large the house will be, what it will look like, and so forth. At the end of this process a clear vision exists how the situation is to be solved. After the abstract final result, more space, the plan has now a real, detailed form, but is still not physical, yet.

Causa Materialis: The material

After a clear vision exists due to the causa formalis, this vision needs to come to the final solution somehow. For that the last two causes are needed. The very next is the causa materialis or the material. For anything to happen, the material needs to be present or in other words: The bondary conditions need to be met.

In the house building example, the material is the material to build the house likes stones, concrete, wood and so forth, but also the knowledge on how to build it. The physical material is needed, what is obvious, but also the knowledge. Without these, there is no chance that a house can be build.

As boundary conditions other stuff is needed as well like some ground to put the house onto, time to build it and so forth. This is also part of the causa materialis.

Causa Efficiense: The execution

After the needs started the process, the vision was formed and the material was organized, the last cause is the execution or causa efficiense. A trainer of mine told me once: A vision without execution is just hallucination. He is right. One can dream of the best stuff, to have everything in place and so forth, but without execution nothing happens. That’s what the last cause is about. In this meaning, it also links to the block post I wrote some years before about The Trinity of Action.

In the house building example, this is the actual building of the house. We have everything in place now to perform the actual solution. We have the need to give as drive and energy, we have a vision and plan, we have the material and knowledge, and the last step is to put the material together with the knowledge we have to get the actual house build.

The Dependencies of the Four Causes

The four causes are sorted in another order in most cases, but in my opinion this is not optimal. There is a strict order of appearance in the order I described it above.

The first cause in my opinion is the causa finalis. It is a natures principle that without energy nothing takes place. Without the need, requirement or issue, there is no energy to change a current status quo. There is always a causa finalis needed as a seed for any change. I cannot think of a situation where it is different.

The second cause after the seeding by the causa finalis is always a kind of plan. A lot of actions in our world seem to be performed without plan, but the closer look reveals, that even there is a plan, but maybe not a well thought through one or just based on a pattern or experience, but there is one.

The cause materialis may be also the second reason, because the material might be already there for the solution, but it is not seen as such without a kind of plan. On the other side a plan also reveals what material might be missing which is needed to be organized or waited for.

At the very end the action, the causa efficiense can take place, but without a plan or material nothing can be done.

The principle of the Four Causes helps me a lot during my daily private and professional life, because it helps me to understand what happens on one site, but it also provides me a guideline on how to work in certain situation, because it gives me an order for what to work on.

The 4 Causes in Software Engineering

There is not so much to write anymore. In Software Engineering, this principles are at work as well.

At first a customer has a need to be solved which lead into some requirements which are the causa finalis. It is formulated what the final outcome has to be. After that architectural and design papers are written and planning done which are the causa formalis. With the organization of hardware, software, developers, office space and everything else, the boundary conditions for the actual development are met. The final development is the causa efficiense.

The trick is to reflect the current status of a software project from time to time and think about the four principles to find out, whether all causae are there. If something is missing, the project cannot be finished successfully:

1. Is the causa finalis not met, no customer will pay due to a lack of need.
2. Is the cause formalis not met, no customer will pay, because it is not useful.
3. Is the cause materialis not met, the product cannot be developed, shipped or run. Again, nobody will pay.
4. At least, if the causa efficiense is missing, the actual product is not build and cannot be sold either.

That’s all the secret in here…

JavaEE: Arquillian Tests support Multi-WAR-EARs

Before version 1.0.2 Arquillian did not support EAR deployments which contained multiple WAR files. The issue was the selection of the WAR into which the arquillian artifacts are to be placed for testing. The trick until then was to remove all WAR files not needed and to leave only one WAR within the EAR.

Starting from version 1.0.2 Arquillian does support Multi-WAR-EAR-Deployments as described in http://arquillian.org/blog/2012/07/25/arquillian-core-1-0-2-Final.

If you have an EAR which is used via

you can select a WAR with the following lines:

That’s it. After this selection Arquillian stops complaining about multiple WAR files within the EAR and the selected WAR is enriched and tested.

JavaEE WebSockets and Periodic Message Delivery

For a project I had the need to implement a monitoring functionality based on HTML5 and WebSockets. It is quite trivial with JavaEE 7, as I will explain below.

Let us assume the easy requirement of a simple monitoring which sends periodic status information to web clients. The web client shall show the information on a web page (inside a <div>…</div> for instance). For that scenario, the technical details are shown below…

The JavaScript code is quite easy and can be taken from JavaScript WebSocket books and tutorials. (A good introduction is for instance Java WebSocket Programming by Oracle Press). A simple client might look like:

The functions for onopen, onclose and onerror are neglected, because we want to focus on JavaEE. The important stuff is shown above: We connect with new WebSocket to the URL which shall provide the periodic updated and with onmessage we put the data somewhere into our web page. That’s it from the client site.

For JavaEE, there is a lot of documentation which shows how to create @ServerEndpoint classes. For instance:

But, how to make it send periodic messages easily? After some testing on WildFly 8.2, I came to this simple solution:

The trick is to make the @ServerEndpoint class also an EJB @Singleton. The @Singleton assures that only one instance is living at a time and this instance can keep also the session provided during @OnOpen. In other words: The actual server endpoint instance is exactly the same where the scheduler is running on. If it would not be @Singleton, multiple instances will or may exist and the session field is not set in @Schedule and might lead to a NullPointerException if not checked for.

Short answer: No. But, more efficient.

A happy new year to all of you! This is the first post in 2014 and it is a (not so) short post about a topic which follows me all the time during discussions about high performance computing. During discussions and in projects I get asked about how programs can be programmed to run faster. The problem is, that this mind set is misleading. It always takes me some minutes to explain the correct mind set: Programs cannot run faster, but more efficient to save time.

If we neglect that we can scale vertically by using faster CPUs, faster memory and faster disks, the speed of a computer is constant (by also neglecting CPUs which change there speed so save power). All programs run always with the same speed and we cannot do anything to speed them up by just changing the programming. What we can do is, to use the hardware we got as efficient as possible. The effect is: We get more done in less time. This reduces the program run time and the software seem to run faster. That is what people mean, but looking on efficiency brings the mind set to find the correct leverages on how to decrease run time.

A soon as a program returns the correct results it is effective, but there is also the efficiency which is to be looked at. Have a look to my post about effectiveness and efficiency for more details about the difference between effectiveness and efficiency. To gain efficiency, we can do the following:

Use all hardware available

All cores of a multi-core CPU can be utilized and all CPUs of the system if we have more than one CPU in the system. GPU or physical accelerator cards can be used for calculation if present.

Especially in brown field projects, where the original code comes from single core systems (before 2005 or so) or system which did not have appropriate GPUs (before 2009), developers did not pay attention multi-threaded, heterogeneous programming. These programs have a lot of potential for performance gains.

Look out for:

CPU utilization

Introduce mutli-thread programming into your software. Check the CPU utilization during an actual run and look for CPU idle tines. If there are any, check your software whether it can do something at the time the idle times occur.

GPU utilization

Introduce OpenCL or CUDA into your software to utilize the GPU board or physics accelerator cards if present. Check the utilization of the cards during calculation and look for optimizations.

Data partitioning for optimal hardware utilization

If a calculation does not need too much data, everything should be loaded into memory to have the data present there for efficient access. Data can also organized to have access in different modes for sake of efficiency. But, if there are calculations with amounts of data which do not fit into memory, a good strategy is needed for not to perform calculations on disk.

The data should be partitioned into smaller pieces. These pieces should fit into memory and the calculations on these pieces should run in memory completely. The bandwidth CPU to memory is about 100 to 1000 faster than CPU to disk. If you have done this, check with tools for cache misses and check whether you can optimize this.

The bottle neck for calculations are CPU and/or GPU. They need to be utilized, because only they bring relevant results. All other hardware a facilities around that. So, do everything to keep the CPUs and/or GPUs busy. It is not a good idea to load all data into memory (and let CPU/GPU idle), then start a calcuation (everything is busy) to store the results afterwards (and have the CPU/GPU idle again). Develop you software with dynamic data loading. During the time calculations run, new data can be caught from disk to prepare the next calculations. The next calculations can run during the time the former results are written onto disk.This maybe keeps a CPU core busy with IO, but the other cores do meaningful work and the overall utilization increases.

Do not do unnecessary things

Have a look to my post about the seven muda to get an impression about wastes. All these wastes can be found in software and these lead into inefficiency. Everything which does not directly contribute to the expected results of the software needs to be questioned. Everything which uses CPU power, memory bandwidth and disk bandwidth, but is not directly connected to the requested calculation may be treated as potential waste.

To have a starter look for, check and optimize:

Decide early

Decide early, when to abort loops, what calculations to do and how to proceed. Some decisions are made in code on a certain position, but sometimes these checks can be done earlier in code or before loops, because the information is already present. This is something to be checked. During refactorings there might be other, more efficient positions for these checks. Look out for them.

Validate economically

Do not check in functions the validity of your parameters. Check the model parameters at the beginning of the calculations. Do it once and thoroughly. If these checks are sufficient, there should be no illegal state afterwards related to the input data. So they do not need to be checked permanently.

Let it crash

Check only input parameters of functions or methods if a fail of those be fatal (like returning wrong results). Let there be a NullPointerException, IllegalArgumentException and what so ever if something happens. This is OK and exceptions are meant for situations like that. The calculation can be aborted that way and the exception can be caught in a higher function to abort the software or the calculation gracefully, but the cost to check everything permanently is high. On the other side: What will you do when a negative value come into a square root function with double output or the matrix dimensions do not fit in a matrix multiplication? There is no meaningful way to proceed, but to abort the calculation. Check the input model and everything is fine.

Crash early

Include sanity checks in your calculations. As soon as the calculation is not bringing more precision, runs into a wrong result, gives the first nan or inf values or behaves strangely in any way, abort the calculation and let the computer compute something more meaningful. It is a total waste of resources to let a program run, which does not do anything meaningful anymore. It is also very social to let other people calculate stuff in the meantime.

Organize data for efficient access

I have seen software which looks up data in arrays element wise by scanning from the first element to the position where the data is found. This leads into linear time behavior O(n) for the search. This can be done with binary search for instance which brings logarithmic time behavior O(log(n)). Sometimes, it is also possible to hold data in memory in a not normalized way to have access to it in different ways. Sometimes a mapping is needed from index to data and sometimes the other way around. If memory is not an issue, think about keeping the data in memory twice for optimized access.

Conclusion

I hope, I could show how the focus on efficiency can bring the right insights on how to reduce software run times. The correct mind set helps to identify the weak points in software and the selection of the points above should point out some directions to look into software to find inefficiencies. A starting point is presented, but the way to go is different for every project.