Tag Archives: engineering

Local Indices on Partitioned Tables

I wrote an article lately about a simple model to understand indices in databases. It may help to read the other article first.

In a data processing project I worked on, the data is put into an Oracle database. The data is queried differently based on questions to be answered. Even worth, the queries are not known upfront, so a quick enough access is to be provided for all kind of usage scenarios.

During the design phase some decisions were made based on the information at the time. As the main point of view onto the data was time series based, the decision was made to use partitioned tables with local indices based on the time stamps. In the aftermath, that was a mistake, which was corrected.

Let’s assume, we want to store two years of data and the partition key is based on the year and week. For instance week 50 in 2017 has  the partition key 1750. Let’s assume we store data now for 2016 and 2017, we need about 104 partitions with partition keys 1601, 1602, …, 1652, 1701, 1702, …1751 and 1752. For the sake of simplicity for the next calculations, we assume we have in total 100 partitions. Each partition has local indices on different fields.

What is wrong with this scenario now? As long as we can directly access the table with partition keys, there is nothing wrong. We can create selects like:

Oracle knows in which partition to look at and which index to use to find the name. But, what happens in case we want to look up the name for the whole time? The query could be something like:

In this case we get terribly slow. But why?

Let’s calculate the time characteristics in Big O Notation:

Let’s assume we have a total number of N samples in the whole time series. In trivial, unpartitioned case, without any additional index, we have the time characteristics:

t(N) = O(N)

If we put an index onto the table which can be utilized, we get a much better time characteristics of:

t(N) = O(log2(N))

That’s actually, why indices are used. The speedup S is :

S(N) = O(N) / O(log2(N)) ~ N / log2(N)

If N is only 1, there is not speedup at all, but for let’s say 1000 samples, the speedup is already about 100. That’s great.

But, what happens in case of partitioning? Let’s assume we have in total P partitions. If we can use the partition key only without any additional index, Oracle just accesses the one partition and we get a behavior of

t(N) = O(N/P)

with a total speedup of

S(N) = O(N) / O (N/P) ~ P

In our example above, this is a speed up of a factor of 100 resulting in a 1/100 of query time, because we only need to access 1/100th of the data. Not bad at all.

The same characteristics we can calculate with an additional index. In this case we get:

t(N) = O(log(N/P))

What happens now in case we search for name over the whole time series? Because, we need to read the whole table again over all partitions with a local index only, this comes down to:

t(N) = O(P * log(N/P))

The problem is, we have to look P times into each partition and can use the local index only. This is P time utilizing an index. Compared with an index on unpartitioned table or with a global index, the speed up is is much smaller that one. In our example above with 100 partitions, this leads to:

S(N) = O(log(N)) / O(100*log( N/100) ~ 1/ 100

The reason is that the partition number increases linearly, but the gain of the performance due to the index is increasing much slower.

This calculations show, that in cases of random access on partitioned tables, global indices are recommend to not slow down the queries.

Simple Model for Understanding Database Indices

I had a lot of discussions in the past about databases and indices. The type of the database is not  important, as the general design of indices are always the same. The following model can be applied to relational databases, document stores, big table databases, columnar stores and also to graph databases.

The following model is used to explain what indices are and why they work so well and especially why there are certain restrictions to them:

  1. Multiple single value indices do not help efficiently in where clauses with AND conditions.
  2. For combined indices, the where clauses need to have certain AND clauses to get fully utilized. Missing some of the clauses may lead into not using the index.
  3. The order of values in combined indices are important, when time series or continuous data is involved. It is shown what the correct order is to get good performance.

These three points, I have to discuss over and over again, as most people are not database specialists and the inner works of databases are not well known. The following simple model shall provide some insights.

Storing Data in General

Storing data itself is normally quite simple, just put the stuff to disk. I know, it is not meaningful, but as starting point, let’s assume it for now. It is also not important whether the data is put onto disk, into a storage like Hadoop or something else.

For the different databases we can assume we have basic data structures like

  1. Relational Database: Row = Map<ColumnName, Value>
  2. Big Table Database: ColumnGroup = Map<Key, Value>
  3. Document Store: Document = <any data structure like JSON>
  4. Graph Database: Vertex, Edge = <any structure>

Important for now is, we have different kinds of data structures, but it is not important for now, how they look like. We just put them onto the disk as a simple binary file.

If we do this, there is only one thing missing: We can’t distinguish them, that’s why we define a data store with Object IDs:

DataStore = Tuple<ID, Object>

We have an object ID which is unique defining a certain Object. We can read now the store Object by Object and can find the correct one by comparing the ID.

Primary Index

With the store above, we can store any data and can retrieve it via the unique ID. The issue is, to find the related Object, we need to read the whole store. This leads to a time characteristics in Big O Notation with a total number of N samples of:


This is a poor characteristics. This can be improved dramatically by adding a primary index. This primary index maps every ID to the storage position of the Object:

PrimaryIndex = Map<ID, DataFileAndPosition>

Do not mistake this primary index with a primary key from a relational database. Here we talk about a technical index to retrieve data from disk efficiently. The primary key in a relational database is logical construct on the database content for data organization and modelling.

As soon as we have a primary index in place, the time characteristics changes to:

t(N) = O(log(N))

We assume here, that the index is ordered by the IDs and searchable with a binary search.

The speedup S is:

S(N) = O(N) / O(log(N)) ~ N / log(N)

For N = 1000 is around 100 and for N = 1,000,000 it is around 50,000.

For a simple key values store, this would be good enough. But, what do we do if we look for a specific Object without knowing its ID? We still have to search the whole data store.

Single Value Secondary Indices

To find objects with certain attributes, values or properties (the name depends on the type of database), indices are used. For this to work, we need to have the possibility to extract this Values out of the stored Object:

Value = Object(Name)

The Name in a relational database is for instance the column name and for a graph database a property name. To find the value, we need to have an index like:

Index = Map<Value, Set<ID>>

What about the time characteristics? It was

t(N) = O(N)

before. Now, we can do an index lookup first, which has a time characteristics like the primary index:

t(N) = O(log(N))

This is pretty good and with the additional time characteristics of the primary index, we get an overall time characteristic of:

t(N) = O(log(N)) + O(log(N))= 2 * O(log(N)) = O(log(N))

That’s great, isn’t it? Again, we assume, we can access the Value with a binary search.

Definition of Cardinality and Selectivity

For a better understanding of the next chapters and for the sake of simplicity, we need to explain two terms: Cardinality and Selectivity.

Cardinality C, is the number of unique values. That means, C <= N for a single attribute. In case the Cardinality is known (an index should keep track of it),the number of returned results can be estimated as:

NumerOfResults = N / C

The simple idea is, that the number of each unique value is equal. The factor 1 / C can be defined as Selectivity like:

Selectivity = 1 / C

Single value indices in AND clauses

With the explanations above in mind, we can explain why multiple single value indices cannot be combined. Let’s assume a SQL query like

Additionally, we assume to have single value indices on both columns. What can the database do now?

The first implementation would normally find that both columns have a single value index and uses the index with the better Selectivity (which filters most and returns least number of results). It takes the IDs out of it, reads the Objects and filters the results again with the second conditions. In this case we have the same behavior as with one index only.

A second implementation could check both available indices and takes the ID sets out of both. From both sets the intersection is calculated and with the result the Objects are read. Depending on the amount of effected IDs and the overall number of Objects, this might be less efficient, than the first approach.

A real solution is only a combined index.

Combined Secondary Indices

A combined index is just an extension of a single value index with multiple values:

CombinedIndex = Map<Attribute1Value, Map<Attribute2Value, … Map<AttributeNValue, Set<ID>>…>>

Combined Indices and Where Clause

With the definition of the combined index above, we see why it is important to add the correct where clause to a query. Let us define a combined index over three attributes:

CombinedIndex = Map<Attribute1Value, Map<Attribute2Value, Map<AttributeNValue, Set<ID>>>>

As long as the AND clauses in the query are added from attribute1 to attribute3 in the correct order (the order in the query statement itself should not matter), the index can be utilized. For instance, the quere

is optimal. It runs the query directly on the index and finds all relevant rows.

The same holds true for the queries:

The only difference is, that only the first two or the first map can used to find the set. The remaining maps are scanned. All relevant rows are found with the index and the Objects can be read.

But, what about the following queries?

The index cannot be used efficiently, because the first Map needs to be scanned completely. Some benefit may be gained from the other maps, but it is not said, that the database can do it.

That’s why it is important to put some thoughts into the correct definition is combined indices. If the last queries are dominant, the order of the attributes should be changed. Maybe, an additional index might make sense. It heavily depends on the use cases expected.

Time Series and Continuous Data

For time series data and continuous data, a special look is needed. For this kind of data, a check for equality makes sense only in rare cases, but creating an index make sense anyways.

Continuous data (also time series) is queried with inequality to get defined ranges of the data. In case an index was created, a range scan can take place. With the same binary search algorithm the start of the range can be found and afterwards all values before the end are used to retrieve the IDs. We assume here again, that the Values are ordered in the index and a binary search can take place.

Continuous Data in Combined Secondary Indices

Let’s assume we have a table with two attributes Time and Name. What is the best order for the combined index? The query could look like:

For this example let’s assume the selectivity of name is 10% and also the time range is 10%.

If we create the index in the order

Index = Map<NameValue, Map<TimeValue, Set<ID>>>

the database can narrow the search with the name already to 10% of the index and can perform a range scan over the second map. This results into an index read of 1% of the size only.

If we create the index in the other order

Index = Map<TimeValue, Map<NameValue, Set<ID>>>

the database can perform a range scan over the first map. But, because time is a continuous value, the name will be most likely unique and for each time value the name needs to be checked again. This leads into a less efficient retrieval of data and a read of 10% of the index.

In case continuous data is to be put into an index, it is to be expected that it is most efficient if it is put at last position.

Short Summary

The explanations above are maybe not the most accurate, but for me they served quite well to understand some effects I experienced.

Another article about partitioned tables is in preparation. With these some other effects are to be considered.

Java: Performance issues with FileInputStream

During the studies of database design with the accompanying toy project DuctileDB, I found some disturbing performance issues.

During the implementation of a needed log structured storage, I found that the performance is terribly slow and only about 1/100 of the performance expected. A simple check of the provided source code of FileInputStream reveals the issue: There is absolutely no optimization included.

As widely known (at least I expect it to be widely known), file systems are based on blocks and all files are stored cut into this blocks. That’s why a hard disk is also called a block device. Block devices are normally read buffered. To get a good performance, the buffer size needs to be at least the block size of the file system.

Surprisingly, at least for me, the FileInputStream is just reading every byte as single byte in the read() method. A poor performance was not a surprise anymore after the finding.

The first test was, to just decorate the FileInputStream with a BufferedInputStream. The default buffer size of BufferedInputStream is 8192 bytes or 8kB. I checked and for my systems, the block size is 512 bytes only. The performance increase was astonishing.

As a studied physicist, I wanted to know it better. So I started some measurements…

Measurement Results

The measurements took place on the systems I had at hand. The measurements were performed with a simple JUnit test. The test is part of PureSol Technologies’ Steaming Library. The class is called SequentialReadPerformanceTest.

The test creates a simple 1MB file and reads the files byte by byte. The measurement was performed ten times and a simple statistics was calculated which returns a minimum, maximum and of course a mean throughput. For the reading, the buffer size is varied from 0 bytes (a pure FileInputStream) up to 256kB. In the charts the 0 bytes buffer is drawn as 0.1kB, because of the logarithmic scale, a 0 was not shown.

I had three systems at hand:

Windows 10 on SSD:

BufferSize minThroughput maxThroughput meanThroughput sigmaThroughput
0 0.478313481960232 0.560977316112137 0.53931803772942 0.026823943405206
1 20.7921956295732 159.661168284835 106.530929541767 41.5452866636502
2 135.890172930229 199.804763831118 159.301353868548 21.9993284291508
4 114.080929031247 185.829790831254 162.480645695998 20.4335777478615
8 119.472762811058 207.246561853822 173.82688161485 22.5238524242766
12 152.112263334608 212.048153787975 174.07568558316 18.3455536745854
16 160.007641966543 206.408535265054 177.837585133614 15.110967892046
24 122.760600929588 200.196611853492 174.685818018924 22.1811746856282
32 130.191143000353 205.116448207889 175.602014380743 23.4884718567785
40 189.830287548959 238.60181506275 217.714568123906 14.6494451455874
48 179.301964860046 257.090941811432 211.133906654611 24.3405587930466
56 180.374229186133 253.724040842617 220.05957106526 22.9934585645487
64 169.576206649008 252.181203292139 214.61583945847 22.28789046186
128 172.901240725885 242.260583721361 214.784464231356 22.0707625968981
256 158.073193513394 235.072835911773 198.832874007776 28.4353501136663

This is my development laptop with a WIndows 10 installed on a SSD. Filesystem is NTFS with a block size of 512 bytes.

VirtualBox on SSD:

BufferSize minThroughput maxThroughput meanThroughput sigmaThroughput
0 3.30897090085418 3.4685246511367 3.37156905146268 0.058287703014056
1 45.8442138958278 204.489511433226 149.196065250552 54.546756256114
2 169.028569641574 213.338498388938 191.078173076755 16.1087752022564
4 175.832342723626 220.551930904206 204.223397588209 14.7572527618542
8 167.40312316305 212.853555391402 200.302253142487 14.2208722932322
12 174.043085529545 221.548785267493 200.891618472135 15.2731165631726
16 178.626872017884 223.56519816547 204.200373947012 15.2758384736513
24 181.218954353416 225.566526761738 206.516876975593 13.137885697191
32 223.296922708024 265.152151154852 245.022329558141 13.661324942683
40 216.037271505894 256.377367956888 240.406641413718 13.4568412728096
48 194.605412997549 258.278920524087 238.66016565359 18.617845389739
56 218.857124728457 268.919769748526 241.338112649669 16.2068776298014
64 215.220983192018 262.003958883976 240.601463050768 14.6838561989163
128 207.162624999802 258.216081521909 237.674988725484 14.2341770113763
256 202.084528616352 262.459082128095 234.696119688756 19.2935202838983

This VirtualBox runs on the Windows 10 system mentioned above. As guest system an Ubuntu 17.10 is installed. The filesystem is btrfs with a block size of 512 bytes.

KVM on magnetic disks:

BufferSize minThroughput maxThroughput meanThroughput sigmaThroughput
0 3.50369275983678 3.64641950704795 3.59516953215679 0.042635735226103
1 41.0512347657029 256.14289221404 210.779922949005 78.5887297940257
2 232.373524058612 285.055321462553 263.672741286731 12.7495562706978
4 255.566173924001 272.880340559129 264.701869296095 5.75897219156574
8 256.293085158907 278.836772514138 268.961324482104 6.65276931366453
12 255.882054365566 276.077129595719 267.372542695553 7.72968634110117
16 258.613027035372 275.235002420108 269.684282122623 5.18769137033597
24 171.856825532156 276.072041550241 251.919547286073 27.995974992746
32 252.29570040848 270.142097067889 264.173763360173 5.0979184809125
40 289.992165109599 304.345453637775 295.727761674292 3.72938830994006
48 290.484623991654 299.758266839562 295.794604575198 2.85277669207739
56 291.98249401528 305.188024108252 297.294492472613 3.86242932963834
64 292.089367018439 299.632351890619 294.804240737189 2.2082451746197
128 255.519590730416 302.032878151745 294.733138187711 13.2476788765596
256 285.858694354284 302.92348363624 295.621763218591 5.44402902250886

This is an Ubuntu 16.04 guest inside of  KVM running on another Ubuntu 16.04. The host has magnetic disks running in RAID1 soft raid with a btrf filesystem with 512 bytes block size. The guest has a XFS file system with a block size of 512 bytes.

Summary and Recommendations

There several are some findings in the graphs above:

  1. A FileInputStream without any additional buffer provides a terrible performance. Astonishingly, the difference of the performance of a pure FileInputStream and the maximum performance is almost two orders of magnitude. At least, for me it is surprising.
  2. On the virtual environments, the minimum performance of FileInputStream is not as bad as on the ‘bare metal’ Windows 10. I guess, it is related to additional buffering between the host and guest environment. More investigations would help here. Not sure, whether I can do that in future.
  3. The maximum performance is for buffers of 32kB or 40kB. As the block size is only 512 bytes, this is surprising for me as well. At the moment, I do not have a concise explanation for that.
  4. The default buffer size of BufferedInputStream of 8 kB, provides already a good performance gain.

These observations lead to some recommendations of mine:

  1. As soon as you read from file in Java, always use a BufferedInputStream to dramatically speed up reading.
  2. As long as the best buffer size is not known, the default buffer size of BufferedInputStream can be used.
  3. I did not do measurements here, but I know from experience, that for every FileOutputStream a BufferedOutputStream also provides a much better performance. Without a better knowledge, we can assume that the throughput speed ups are similar. Therefore, for every FileOutputStream, put a BufferedOutputStream decorator around it.

If I find some time, I will also try to find out what the correct behavior of FileOutputStream is.

Based on the findings and recommendations, I will develop a special FileOutputStream in PureSol Technologies’ Streaming Library. The buffer size will be configurable via a system property. Additionally, I also think about a mechanism to automatically find the best performance during static initialization to maximize performance.

So, have a look from time to time at the library and this blog to find more information on this topic and news on the progress of the library.

Hadoop Client in WildFly – A Difficult Marriage

(This article was triggered by a question “Hadoop Jersey conflicts with Wildfly resteasy” on StackOverflow, because I hit the same wall…)

For a current project, I evaluate the usage of Hadoop 2.7.1 for handling data. The current idea is to use Hadoop’s HDFS, HBase and Spark to handle bigger amount of data (1 TB range, no real Big Data).

The current demonstrator implementation uses Cassandra and Titan as databases. Due to some developments with Cassandra and Titan (Aurelius was aquired by DataStax.), the stack seems not to be future-proof. An alternative is to be evaluated.

The first goal is to use the Hadoop client

in WildFly 9.0.1. (The content of this artical should be also valid for WildFly >=8.1.0.) HDFS is to be used at first to store and retrieve raw files.

Setting up Hadoop in a pseudo distributed mode as it is described in “Hadoop: The Definitive Guide” was a breeze. I was full of hope and added the dependency above to an EJB Maven module and wanted to use the client to connect to HDFS to store and retrieve single files. Here it is, where the problems started…

I cannot provide all stack traces and error messages anymore, but roughly, this is what happend (one after another; when I removed one obstacle, the next came up):

  • Duplicate providers for Weld where brought up as errors due to multiple providers in the Hadoop client. Several JARs are loaded as implicit bean archives, because JavaEE annotations are included. I did not expect that and it seems strange to have it in a client library which is mainly used in Java SE context.
  • The client dependency is not self contained. During compile time an issue arised due to missing libraries.
  • The client libraries contains depencencies which provide web applications. These applications are also loaded and WildFly try to initialize them, but fails due to missing libraries which are set to provided, but not included in WildFly (but maybe in Geronimo?). Again, I am puzzled, why something like that is packaged in a client library.
  • Due to providers delivered in sub-dependencies of Hadoop client, the JSON provider was switched from Jackson 2 (default since WildFly 8.1.0) back to Jackson 1 causing infinite recursions in trees I need to marshall into JSON, because the com.fasterxml.jackson.*  annotations were not recognized anymore and the org.codehaus.jackson.*  annotations were not provided.

The issues are manyfold and  is caused by a very strange, no to say messy packaging of Hadoop client.

Following are the solutions so far:

Broken implicte bean archives

Several JARs contain JavaEE annotations which leads to an implicit bean archive loading (see: http://weld.cdi-spec.org/documentation/#4). Implicit bean archive support needs to be switched off. For WildFly, it looks like this:

Change the Weld settings in WildFly’s standalone.conf from


This switches the implicit bean archive handling off. All libraries used for CDI need to have a /META-INF/beans.xml  file now. (After switching off the implicit archives, I found a lot of libraries with missing beans.xml  files.)

Missing dependencies

I added the following dependencies to fix the compile/linkage issues:

Services provided, but not working

After switching off the implicit bean archives and added new dependencies to get the project compilied, I run into issues during deployment. Mostly, the issues were missing runtime dependencies due to missing injection providers.

The first goal was to shut off all (hopefully) not needed stuff which was disturbing. I excluded the Hadoop MapReduce Client App and JobClient (no idea what these are for). Additionally, I excluded Jackson dependencies, beacause they are already provided in the WildFly container.

Broken JSON marshalling in RestEasy

After all the fixes above, the project compiled and deployed successfully. During test I found that JSON marshalling was broken due to infinite recursions I got during marshalling of my file trees. I drove me cracy to find out the issue. I was almost sure that WildFly 9 switched the default Jackson implementation back to Jackson 1, but I did not find any release note for that. After a long while and some good luck, I found a  YarnJacksonJaxbJsonProvider class which forces the container to use Jackson 1 instead of Jackson 2 messing up my application…

That was the final point to decide (maybe too late), that I need a kind of calvanic isolation. Hadoop client and WildFly need to talk through a proxy of some kind not sharing any dependencies except of one common interface.

Current Solution

I created now one Hadoop connector EAR archive which contains the above mentioned and fixed Hadoop client dependencies. Additionally, I create a Remote EJB and add it to the EAR which provides the proxy to use Hadoop. The proxy implements a Remote Interface which is also used by the client. The client performs a lookup on the remote interface of the EJB. That setup seems to work so far…

The only drawback in this scenario at the moment is, that I cannot use stream throug EJB, because streams cannot be serialized. I think about creating a REST interface for Hadoop, but I have no idea about the performance. Additionally, will the integration of HBase be as difficult as this!?

For the next versions, maybe a fix can be in place. I found a Jira ticket HDFS-2261 “AOP unit tests are not getting compiled or run”.

Software Engineering and the 4 Causes of Aristotle

I stumbled on the principle of the 4 causes of Aristotle some years ago when I did some reading about cause and effect, leadership and management. The question is: How do we get things done? When do we get things done? And also very important: How do we get things done in the right way?

The Four Causes of Aristotle

Everything what happens has four causes. There are no more and no less. Exactly four. Only if all four causes are present, something can come into existence. That is what Aristotle formulated in one of his most famous books: Physica. (Have also a look here: http://en.wikipedia.org/wiki/Four_causes)

In the next sections I describe the four causes. These causes appear in the order I present them. To give an accessible example, I use the picture of building a house. You will see how the four causes apply there.

Causa Finalis: The final cause

The causa finalis or final cause is the basic reason for anything to happen. You can also translate it as need, requirement or wish. Only with something like that a trigger is present to start some development.

For our house building example, we can think of it like the need to move into a new place because of a lack of space. For instance, a couple lives in a small flat and they are happy, but a child is to be born. They found out, that with a small child the place is too small, there is not a good chance to have a children room, the bath room is to narrow… They find that the current situation is changing and a need for more space is coming up.

This is the causa finalis. A need or requirement which is not detailed, yet. There is just an issue to be solved, but there is not a detailed plan, yet. But, the final result is formulated. In our case: More space.

Causa Formalis: The form

After the causa finalis is met, the second cause happens: The causa formalis or form. The final cause showed an issue and its final abstract solution and now it is thought about it and a plan or form is formulated. The causa formalis brings the vision for how the cause finalis can be solved.

In the house building example, our couple may have thought about renting the flat next to them and re-modelling a wall (what is maybe not allowed by the owner), moving into a bigger flat (which might be too expensive), or to build a house. Maybe, after deciding to build a house, they go to an architect and plan how large the house will be, what it will look like, and so forth. At the end of this process a clear vision exists how the situation is to be solved. After the abstract final result, more space, the plan has now a real, detailed form, but is still not physical, yet.

Causa Materialis: The material

After a clear vision exists due to the causa formalis, this vision needs to come to the final solution somehow. For that the last two causes are needed. The very next is the causa materialis or the material. For anything to happen, the material needs to be present or in other words: The bondary conditions need to be met.

In the house building example, the material is the material to build the house likes stones, concrete, wood and so forth, but also the knowledge on how to build it. The physical material is needed, what is obvious, but also the knowledge. Without these, there is no chance that a house can be build.

As boundary conditions other stuff is needed as well like some ground to put the house onto, time to build it and so forth. This is also part of the causa materialis.

Causa Efficiense: The execution

After the needs started the process, the vision was formed and the material was organized, the last cause is the execution or causa efficiense. A trainer of mine told me once: A vision without execution is just hallucination. He is right. One can dream of the best stuff, to have everything in place and so forth, but without execution nothing happens. That’s what the last cause is about. In this meaning, it also links to the block post I wrote some years before about The Trinity of Action.

In the house building example, this is the actual building of the house. We have everything in place now to perform the actual solution. We have the need to give as drive and energy, we have a vision and plan, we have the material and knowledge, and the last step is to put the material together with the knowledge we have to get the actual house build.

The Dependencies of the Four Causes

The four causes are sorted in another order in most cases, but in my opinion this is not optimal. There is a strict order of appearance in the order I described it above.

The first cause in my opinion is the causa finalis. It is a natures principle that without energy nothing takes place. Without the need, requirement or issue, there is no energy to change a current status quo. There is always a causa finalis needed as a seed for any change. I cannot think of a situation where it is different.

The second cause after the seeding by the causa finalis is always a kind of plan. A lot of actions in our world seem to be performed without plan, but the closer look reveals, that even there is a plan, but maybe not a well thought through one or just based on a pattern or experience, but there is one.

The cause materialis may be also the second reason, because the material might be already there for the solution, but it is not seen as such without a kind of plan. On the other side a plan also reveals what material might be missing which is needed to be organized or waited for.

At the very end the action, the causa efficiense can take place, but without a plan or material nothing can be done.

The principle of the Four Causes helps me a lot during my daily private and professional life, because it helps me to understand what happens on one site, but it also provides me a guideline on how to work in certain situation, because it gives me an order for what to work on.

The 4 Causes in Software Engineering

There is not so much to write anymore. In Software Engineering, this principles are at work as well.

At first a customer has a need to be solved which lead into some requirements which are the causa finalis. It is formulated what the final outcome has to be. After that architectural and design papers are written and planning done which are the causa formalis. With the organization of hardware, software, developers, office space and everything else, the boundary conditions for the actual development are met. The final development is the causa efficiense.

The trick is to reflect the current status of a software project from time to time and think about the four principles to find out, whether all causae are there. If something is missing, the project cannot be finished successfully:

  1. Is the causa finalis not met, no customer will pay due to a lack of need.
  2. Is the cause formalis not met, no customer will pay, because it is not useful.
  3. Is the cause materialis not met, the product cannot be developed, shipped or run. Again, nobody will pay.
  4. At least, if the causa efficiense is missing, the actual product is not build and cannot be sold either.

That’s all the secret in here…

5S Methodology and Software Development

The 5s methodology is used to keep the working environment clean, ordered and efficient. I came across this methodology when I was working in a production area for a semiconductor factory. This system was referenced from time to time when internal audits showed some weaknesses in regard to efficiency, order or cleanness. For a short reference on 5s methodology have a look to: http://en.wikipedia.org/wiki/5S_%28methodology%29.

The key topics Sorting, Straightening, Shine, Standardize, and Sustain are quite weak translations of the Japanese words Seiri, Seiton, Seiso, Seiketsu, and Shitsuke due to the wish to translate it into English words which also start with the S letter. Nevertheless, they express roughly the idea behind that and details are explained below. Again like for the post about the Seven Muda, I also translate these topics into the field of Software Engineering as I understand it.

Sorting (Seiri)

This principle has a strong relation to the Muda Inventory, Over-Processing and Over-Production.

Classic meaning:

The meaning here is: Remove everything unnecessary. Check all tools, materials, and machinery and remove everything which is not needed. This cleans out the workspace, makes spaces and removes distraction. The rate of defects decreases due to a lowered risk to use wrong tools or materials and more space means less incidents.

Software Engineering:

For software engineering it is the same, but it is twofold:

  1. For development process: Remove all tools and stuff in your workspace, IDE or PC which is not needed. These tools distract the developer, make the workspace cleaner as mentioned above and also sometimes more stable (everyone who uses Eclipse with a lot of plugins know, what I mean).
  2. For Architecture and Design: Remove all components, interfaces, libraries, and systems which are not needed. These are added as soon as they are needed. For example: To implement everything in patterns right from the beginning does not make much sense, when  not needed, yet. The requirements may change and what was though at the beginning is needed, will not be needed further on. Only use and implement, what is needed and postpone everything else into the future when needed.

Not hitting the correct meaning, but also part of it is duplicate code. Duplicate code is something which is redundant. Redundancy is also something which needs to be cleaned out. Duplicate code is a nightmare for maintainability and should be avoided in all means.

Straightening (Seiton)

This principle has a relationship to Muda Transport, Motion and Waiting.

Classic meaning:

This principle is about straightening the processes. Everything should be processed in an efficient way. Transport ways need to be shortened, motions to be avoided, unnecessary tasks to be eliminated and wait time to be reduced. This principle can only be applied in iterations with close observations.

Software Engineering:

In software engineering this principle can be used as a driving factor for lean architecture and design. The Muda Transport, Motion and Waiting in post about the Seven Muda give hints were to look out in software engineering.

Shine (Seiso)

Classic meaning:

This is about cleaning and ordering the workspace. As soon as everything is clean and ordered, the station is ready for usage. In each shift or on daily basis, cleaning and ordering should be scheduled. On such a workspace, process flaws and defects are better to find and the work is easier, cleaner and safer.

Software Engineering:

For software engineering, I would refer to Clean Code and Refactoring. Write clean code and clean the code as soon as bad smells are detected. This keeps the code clean and erosion is prevented. Bugs are easier to spot in clean code and also easier fixed.

With refactoring the architecture and design stays clean, too. This assures an understandable architecture which support bug fixing and improvements.

These actions should take place during normal work, but also scheduled at the end of sprints for instance. Code Reviews can also set in place in critical parts of the software to assure the right measure of quality. A time budget from 5% to 50% depending of the state of the code should be scheduled. In brown field projects with a lot of legacy code massive cleanup can help to improve the later development of new functionality dramatically. But, also in green field projects erosion takes place and should be fought with a 5% time budget at least. Have a look to the books Refactoring by Martin Fowler and Clean Code by Robert C. Martin for details, or Working Effectively with Legacy Code by Robert C. Martin.

Standardize (Seiketsu)

Classic meaning:

In fabrications all work stations should be standardized, what means that they should look, function and feel all the same. It is more easy to train people on a new station that way, the quality is higher due to a lower defect rate and it is also cheaper if it is possible to reuse procedures, tools and material.

Software Engineering:

For software engineering, there are two possible meanings:

  1. All developers should use the same tools for development. It is more easy then to maintain a development environment where only one IDE is present, one build system, on OS and so forth. Only for testing there may be some variation, but for pure development, it is easier to deal with one kind of tool for one purpose.
  2. Within the software everything should be handled in a standardized way. So the architecture should define standards and also design. For instance it could be a standard that all components of a larger system communicate to each other with a REST interface. There only one REST library is used. It would be worse if all components would talk with another protocol like SOAP, RMI, EJB and so forth. For design it is the same. Exception handling for instance should be defined how it should be done. How are files handled? Coding guide lines and so forth.

Standards help that people can identify parts in larger systems more easily. Understandability and Maintainability improve dramatically.

Sustain (Shitsuke)

The first four points are hard enough to accomplish, but this sustaining point is even harder. What you did and accomplish in the first four sections is a large step to an efficient production environment. This is something which is done in form of a project, but a project is time limited. The real art of 5S is now to keep the state what was accomplished and to even improve it. That’s a huge leap! The goal is to establish a control system which checks for instance from time to time the current situation in a kind of internal audit and to raise the issues found. The issues should be fixed as soon as possible. With a regular check of the other 4S and an improvement of the findings, the current state can be sustained and even improved. But, this needs a lot of attention and energy.

In Software Engineering the buzz words Code Smell (or just Smell), Refactoring and Architecture Refactoring come to mind. As soon as there are bad smells, an action needs to be taken to fix this issue. The longer the issues is present, the more it manifests itself and the harder is it to be fixed. As the Asian proverb says: It is easy to change the direction of a river at its source…

Further Enhancements

Sometimes some enhancements are added. These points express some enhancements which should be taken care of, too. I only explain them shortly, because they are quite self explanatory.

Enhancement to 6S: ‘Get Used to it’ (Shukan)

This point is often added to the original 5S. With the checks and fixes in the Sustaining part, people get trained to keep an open eye on 5S. Over time everybody should develop a habit of fixing everything which is not in order to have an easier and more efficient life. It is good to create a company culture for 5S.


Classic meaning:

This is very easy: Keep everything and order and additionally, watch out for sources of accidents and prepare everything that this accidents cannot happen.

Software Engineering:

Here it is about the quality of source code and architecture for accidents like crashes, wrong results, stability and so forth.


Classic meaning:

Keep people out which are not supposed to be in certain area. Keep secrets secret and confidential data confidential.

Software Engineering:

Build  your software in a way that only authenticated people are allowed to change settings which they are authorized to and keep data protected from people which are not allowed to see them.

The Seven Muda (Wastes) and Software Engineering

I was introduced into the term Muda (waste) when I was working for a semiconductor fab. I learned that there a seven of them and that  a close look out to these wastes can reduce costs and increase quality dramatically. A short introduction about it can be found here: http://en.wikipedia.org/wiki/Muda_%28Japanese_term%29.

I want to write about these wastes with a shifted focus. The seven Muda were formulated for classical production and ‘real world’, but as we can see later on, these principles can be used for software engineering, too. Looking out for these wastes can help to make architecture, design and code lean and clean.


Classic meaning:

The classic meaning is the reduction of transportation. This means, one should look out for everything what is transported and how these transports can be minimized. In fabrication it means for example that the transport for goods can be optimized by ordering larger quantities, to use vendors which are closer by and so forth. The transportation costs can be reduced and the margin can be increased. Transportation does not add value to the product, but it adds risk. During any transport operation the risk is there for breaking, loosing and delaying the product.

Software engineering:

Transportation can be translated here for example to IO. Avoid transportation over network, to disc and what so ever. If IO can be reduced, the timing is better, bandwidth is saved and other applications and systems are not affected negatively by a bandwidth exhaustion due to excessive use of IO of one system. To safe IO, good data models should be used for a high reuse of data. Double fetches could be avoided, only data which is needed is fetched from a DB for example and not the whole database is read just in case… Savings here leads directly into more responsive systems, reduced costs for IO facilities and high throughput.


Classic meaning:

The meaning of Waste of Inventory is quite easy. It is about the waste to have raw material, finished products and work in progress laying around without the prospect to monetize it. It may or may not be sold. In this state it is a potential waste and should be avoided. A waste of inventory may lead into selling the products under value or even into dumping them.

Software engineering:

In software engineering inventory is twofold:

  1. Source Code: Writing source code which is not requested by the customer (directly or indirectly) may not lead to revenue. So it is a waste of time and also resources to produce it. Only functionality which brings in revenue is to be developed. Everything else is potential waste, like finished products laying in a warehouse without a demand by customers. A lot of money can be saved by letting developers produce software which adds value and brings in money. Also trial and error development (or Programming by Coincidence like described in The Pragmatic Programmer by Andrew Hunt et. al.) is waste. All development versions which are dumped on the way to the final version are waste. Some thoughts in advance can save a lot of time, money and trouble.
  2. Data: With the ‘Big Data’ discussion the Waste of Inventory is put into public again. Every piece of data stored costs some money. Even one pays storage by cents per gigabyte, the big amount of data makes it expensive. Data should be stored only if needed and selected carefully. Costs for storage can be reduced. Due to date is transported to the storage facilities, transport costs are also reduced.


Classic meaning:

When transport is about bringing the goods from one facility to another or from one machine to another, than Motion is about the handling of products during the production process. The more handling is involved during production, the more time is needed for that action and risk is added for damage and low quality. Also, motion is mechanical motion and lead into a degeneration of the machines used. For people it is the same. To much handling of products lead to illnesses and other issues which also cost money in form of sick days. Avoiding motion reduces costs for maintenance, sick days and broken products.

Software engineering:

Motion is software engineering is not so easy to define. The closes equivalents in my opinion are:

  1. Motion = unnecessary things done in software. This may be an animation too much which is not needed, but might break the application by its presence and wastes CPU time. It may be a storage operations too much just to be save some data temporarily for a case of power failure, but it the data could be recalculated if needed. There might be a watchdog to much. Defensive programming is find, but too much is not needed. Things done unnecessarily lead into waste of time and resources. The software runs longer and wastes CPU time. Sorry, I do not have a better explanation, but maybe you get the point.
  2. Motion = unnecessary things done during software development process. This can mean unnecessary work due to Programming by Coincidence due to missing design sessions. It can also mean writing unnecessary documentation, design papers and such stuff. It can mean unnecessary meetings, conference calls and status presentations.


Classic meaning:

Every product in Work which waits for something, does not add value, consumes space and the delays may lead to a bad reputation. All wait times need to be reduced. This can be done by queue managing. Have a look to the book The Principles of Product Development Flow: Second Generation Lean Product Development by Donald G. Reinertsen for more information.

Software engineering:

In software engineering, the most obvious waste due to waiting would be a programmed delay or sleep in a program to wait for something. This is obviously not a good design. Better do a design with asynchronous execution and notification. A program should always do something meaningful if possible. Do not wait for something to happen. Do something in meanwhile and wait for notification for example or have some processes in parallel which fill up the CPU time of a sleeping thread. All waits are a waste of customers time. This should be avoided, otherwise it leads into frustration without adding value to anything.


Classic meaning:

In classic fabrication this means: You do something better, more accurate or more beautiful than required. The customer is paying for a product with a negotiated specification. This specification needs to be met, but not more. More work on the product will lead into higher production costs, more time needed and more risk for damage without a monetary compensation.

Pay attention: Over processing can be a part of a marketing strategy and a customer satisfaction program. By over delivering a customer may be surprised positively which may lead to a returning customer, a higher order for the next time and so forth. This is not over processing as it is meant above. This is part of a strategy which brings higher revenue in future.

Software engineering:

Over processing is quite the same as in classic engineering. A software product calculates more accurately than needed. The performance tuning was done extensively to get the last microseconds out of the calculations. And there are much more things like that. As long as the product is good enough, we should stop working on features already done. It does not bring more value to the customer.

Here too: Please pay attention for over delivering. This is a magic tool if done right. See above at the classic engineering section.


Classic meaning:

Over-production is simply the production of more pieces of a product than needed the time of production. There is a risk that not all products which were produced can be sold. The avoidance of over production reduces costs, reduces the amount of resources needed for production and is also good for the environment.

Software engineering:

Over-production has two meanings in software engineering, as far as I can see it:

  1. Over production of results: A software product which produces more results than needed, wastes resources and time. This is not what customers want and that is also nothing they want to pay for. At least, provide configuration possibilities.
  2. Over production of features: In software engineering (as in all engineering disciplines), engineers tend to over-engineering. Full blood engineers want to make the product perfect, feature rich, shiny and so forth. This might lead to feature bloat. Every feature which is not requested by the customer does not add value. A customer will not pay more money for functions they do not want to use. That’s why a lot of products come in different flavors like community, basic and enterprise version. The customer chooses what features are needed and pays for exactly them.


Classic meaning:

Defect products need repairing or if they can not be repaired, need to be dumped. Both choices cost money. Additionally, the reputation is influenced negatively which costs future money due to customers not wanting to pay again for a product from the same manufacturer. It becomes even worse as soon as the defect damages something on customer site and the customer asks (un-)politely for regress. A good customer support division can compensate a lot, but this is expensive, too. So: Defects should be avoided. They always waste a lot of money.

Software engineering:

For software defects the same facts are valid like for mechanical engineering. Defects cost money and reputation. So, the best is not to have any. Avoiding defects by excessive testing and quality control is cheaper than handling angry customers, doing failure analysis, bug fixing, patch releasing and loosing future customers.

Additionally as 8th Muda: Latent skill

There is an additional unofficial 8th Muda: Latent skill. Officially, it is spoken about utilizing the skills of employers. People which were hired to fill out a certain position might be able to do much more or more valuable work than what the position requires. These people should be given an oportunity to grow and do what they are capable of. Additionally, a lot of employees want to learn more and want to be trained. It is not only about getting a higher salary, but also about personal grow and satisfaction.

In my opinion there is another site of Latent Skill Waste: It is about machinery. Some high-tech machines are capable of doing more, than there were bought to do. They can be utilized if it is possible. In IT this is were cloud computing was invented. It is partly waste by waiting and waste by latent skill, when servers are not utilized due to too less work. With cloud computing utilization of servers can be increased. This utilization comes in two flavors: Doing more of the same work (reducing waste of waiting) or running other services in parallel (reducing waste due to latent skill). A higher utilization means more revenue and therefore more profit, because the deprecation costs are the same.

A Final Thought

The Muda are not meant to be used for cost reduction in first place. The mind set is not correct, in my opinion. The Muda are about efficiency. To do cost reduction, efficiency needs to be increased, that is correct. But, to think about cost reduction only leads into decisions which might hurt quality and effectiveness. About the difference in mind set and practical approach, I might write about later on in another post.

Thoughts on the Agile Manifesto

From time to time, I discuss the agile methodology with clients and friends. The Agile Manifesto was published at http://agilemanifesto.org about 12 years ago. It was debated a lot and the debates still are going on. I try present here a small inside I had during the last years.

The main statement is

“Manifesto for Agile Software Development

We are uncovering better ways of developing
software by doing it and helping others do it.
Through this work we have come to value:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

That is, while there is value in the items on
the right, we value the items on the left more.

Interestingly, the focus is shifted from the product, its documentation and the technical process (the planning) to customer focus and customer satisfaction. This is also part of another system which is called Total Quality Management (http://en.wikipedia.org/wiki/Total_quality_management). The focus shift is very obvious and necessary, when we think about the only income source each company has: The customer. The customer (or different customers if an organization has different services to offer) is the only source for income and therefore for revenue, profit and growth. Any revenue increase can only happen, when customers pay more for services or products or more customers are willing to spend money on the companies services or products. It is therefore obvious, that the focus needs to be on the customer and that the organization needs to be aligned to meet the needs and expectations of customers.

That’s why the first point about ‘Individuals and interactions’ is the most important point. Translated to easy actions it means: Identify your customers, treat them individually and implement processes for easy communication and interaction. Only customers can tell you what they need, what they expect and what they are ready to pay for. Individual customers treated well, will tell you more detailed, what they need and bring new business ideas. Ask a group of people and there is no detail. But, ask a single individuals and listen closely. You might get a lot of insights.

In software development the main reason of organizations is delivery working software and systems. These are the primary needs of their customers. They do not need a fancy manual to read, what the software might be able to do after spending hours to read the manual and trying things out in tutorials, but they need a software which brings business value. That’s why the second point is important. Have a running, valuable software which is self-explanatory and the customer is willing to pay for it. You can save a tree by dropping the printed documentation. Have a look to Apple products. How much manuals are sold with this complex and feature rich software? Some only help and a self-explanatory UI and everything is fine. This is one of the fundamentals of Total Quality Management: Only the customer can tell you what she wants and she is also the one who pays. Is there another way to work with the knowledge?

The third point is the enhancement of the first point. If the chance is there, try to work with your customer closely together. In Scrum and XP it is done by short release cycles and demos to show the customer progress on regular basis after each release cycle and ask for critics, comments and new ideas. It helps to deliver software which is valuable for the customer and therefore, which is paid for. An even better idea is to embed a representative of the customer into the development team. The responses are immediate and customer’s acceptance testing is done the whole time. The possibility of developing software for what the customer does not want to pay, is reduced dramatically. And again: The customers pays for the product. There is no way to make a better product than to build the product together with your customer and when the customer is part of the team, she is even more engaged and willing to help for development. At the very end, the willing to pay is much higher, when the product was kind of custom built.

By doing all this, be prepared: With each demo, feedback session and communication to the customer, there might be new ideas, comments and critics. The requirements are about to change on daily basis. That’s what the fourth value is about. Be open for changes. Customer only have vague idea at the beginning, about what they want. But, during the development, more ideas arise, some faulty ideas are dropped and new wishes pop up. That’s kind of normal and part of the process. This helps to make the product better at the end and the business value is increased. A product like that can be priced higher, though. What is better than that?

The Killer Difference between Mechanical Engineering and Software Engineering

I always run into the same discussions when it comes to project management and quality assurance in software development. Some people (mostly people without technical background in software development) do not really understand why classic project management and quality assurance methods fail in software development. Here is the killer argument why this does not work:

In hardware engineering for each piece of hardware product the same procedure is applied repeatedly. In software  the procedure is applied once and then the product is copied.

This is now my killer argument in most discussions related to that topic to stop discussions why these methods do fail. The difference should be easy to understand. For a deeper understanding, I explain a little more:

  1. Hardware engineering has a lot of physical constraints which limit the choices and there for timelines are more predictable.
  2. In hardware engineering it is not so easy to define new requirements and the requirements do not change on regular basis like in software development. A car needs to go fast and safe, but it is not to be intended to dive from one day to another. Nobody would think about requesting something like this and expects the car to be able to dive in the next release within a short period of time. For software it is a common pattern, because it is just easily put in, there is only some code to be written…
  3. In hardware production, due to its repeating nature, one can track cp and cpk values and drive the process to engineering. After some prototypes (which are fixed manually later or disposed), the production process is optimized and controlled. The products become better in quality over time due to continuous improvements. That’s classic quality assurance. For software: A product is produced once (and hopefully well tested), but shipped not just once, but up to thousands of time until a bug is reported. Hardware products are all a kind of special. Software is copied.

In this way, I could go on. I just take the classic methodologies and apply them to software development with the difference in mind.