Schlagwort-Archive: programming

Hadoop Client in WildFly – A Difficult Marriage

(This article was triggered by a question „Hadoop Jersey conflicts with Wildfly resteasy“ on StackOverflow, because I hit the same wall…)

For a current project, I evaluate the usage of Hadoop 2.7.1 for handling data. The current idea is to use Hadoop’s HDFS, HBase and Spark to handle bigger amount of data (1 TB range, no real Big Data).

The current demonstrator implementation uses Cassandra and Titan as databases. Due to some developments with Cassandra and Titan (Aurelius was aquired by DataStax.), the stack seems not to be future-proof. An alternative is to be evaluated.

The first goal is to use the Hadoop client

in WildFly 9.0.1. (The content of this artical should be also valid for WildFly >=8.1.0.) HDFS is to be used at first to store and retrieve raw files.

Setting up Hadoop in a pseudo distributed mode as it is described in „Hadoop: The Definitive Guide“ was a breeze. I was full of hope and added the dependency above to an EJB Maven module and wanted to use the client to connect to HDFS to store and retrieve single files. Here it is, where the problems started…

I cannot provide all stack traces and error messages anymore, but roughly, this is what happend (one after another; when I removed one obstacle, the next came up):

  • Duplicate providers for Weld where brought up as errors due to multiple providers in the Hadoop client. Several JARs are loaded as implicit bean archives, because JavaEE annotations are included. I did not expect that and it seems strange to have it in a client library which is mainly used in Java SE context.
  • The client dependency is not self contained. During compile time an issue arised due to missing libraries.
  • The client libraries contains depencencies which provide web applications. These applications are also loaded and WildFly try to initialize them, but fails due to missing libraries which are set to provided, but not included in WildFly (but maybe in Geronimo?). Again, I am puzzled, why something like that is packaged in a client library.
  • Due to providers delivered in sub-dependencies of Hadoop client, the JSON provider was switched from Jackson 2 (default since WildFly 8.1.0) back to Jackson 1 causing infinite recursions in trees I need to marshall into JSON, because the com.fasterxml.jackson.*  annotations were not recognized anymore and the org.codehaus.jackson.*  annotations were not provided.

The issues are manyfold and  is caused by a very strange, no to say messy packaging of Hadoop client.

Following are the solutions so far:

Broken implicte bean archives

Several JARs contain JavaEE annotations which leads to an implicit bean archive loading (see: http://weld.cdi-spec.org/documentation/#4). Implicit bean archive support needs to be switched off. For WildFly, it looks like this:

Change the Weld settings in WildFly’s standalone.conf from

to

This switches the implicit bean archive handling off. All libraries used for CDI need to have a /META-INF/beans.xml  file now. (After switching off the implicit archives, I found a lot of libraries with missing beans.xml  files.)

Missing dependencies

I added the following dependencies to fix the compile/linkage issues:

Services provided, but not working

After switching off the implicit bean archives and added new dependencies to get the project compilied, I run into issues during deployment. Mostly, the issues were missing runtime dependencies due to missing injection providers.

The first goal was to shut off all (hopefully) not needed stuff which was disturbing. I excluded the Hadoop MapReduce Client App and JobClient (no idea what these are for). Additionally, I excluded Jackson dependencies, beacause they are already provided in the WildFly container.

Broken JSON marshalling in RestEasy

After all the fixes above, the project compiled and deployed successfully. During test I found that JSON marshalling was broken due to infinite recursions I got during marshalling of my file trees. I drove me cracy to find out the issue. I was almost sure that WildFly 9 switched the default Jackson implementation back to Jackson 1, but I did not find any release note for that. After a long while and some good luck, I found a  YarnJacksonJaxbJsonProvider class which forces the container to use Jackson 1 instead of Jackson 2 messing up my application…

That was the final point to decide (maybe too late), that I need a kind of calvanic isolation. Hadoop client and WildFly need to talk through a proxy of some kind not sharing any dependencies except of one common interface.

Current Solution

I created now one Hadoop connector EAR archive which contains the above mentioned and fixed Hadoop client dependencies. Additionally, I create a Remote EJB and add it to the EAR which provides the proxy to use Hadoop. The proxy implements a Remote Interface which is also used by the client. The client performs a lookup on the remote interface of the EJB. That setup seems to work so far…

The only drawback in this scenario at the moment is, that I cannot use stream throug EJB, because streams cannot be serialized. I think about creating a REST interface for Hadoop, but I have no idea about the performance. Additionally, will the integration of HBase be as difficult as this!?

For the next versions, maybe a fix can be in place. I found a Jira ticket HDFS-2261 „AOP unit tests are not getting compiled or run“.

JavaEE: Arquillian Tests support Multi-WAR-EARs

Before version 1.0.2 Arquillian did not support EAR deployments which contained multiple WAR files. The issue was the selection of the WAR into which the arquillian artifacts are to be placed for testing. The trick until then was to remove all WAR files not needed and to leave only one WAR within the EAR.

Starting from version 1.0.2 Arquillian does support Multi-WAR-EAR-Deployments as described in http://arquillian.org/blog/2012/07/25/arquillian-core-1-0-2-Final.

If you have an EAR which is used via

you can select a WAR with the following lines:

That’s it. After this selection Arquillian stops complaining about multiple WAR files within the EAR and the selected WAR is enriched and tested.

JavaEE WebSockets and Periodic Message Delivery

For a project I had the need to implement a monitoring functionality based on HTML5 and WebSockets. It is quite trivial with JavaEE 7, as I will explain below.

Let us assume the easy requirement of a simple monitoring which sends periodic status information to web clients. The web client shall show the information on a web page (inside a <div>…</div> for instance). For that scenario, the technical details are shown below…

The JavaScript code is quite easy and can be taken from JavaScript WebSocket books and tutorials. (A good introduction is for instance Java WebSocket Programming by Oracle Press). A simple client might look like:

The functions for onopen, onclose and onerror are neglected, because we want to focus on JavaEE. The important stuff is shown above: We connect with new WebSocket to the URL which shall provide the periodic updated and with onmessage we put the data somewhere into our web page. That’s it from the client site.

For JavaEE, there is a lot of documentation which shows how to create @ServerEndpoint classes. For instance:

But, how to make it send periodic messages easily? After some testing on WildFly 8.2, I came to this simple solution:

The trick is to make the @ServerEndpoint class also an EJB @Singleton. The @Singleton assures that only one instance is living at a time and this instance can keep also the session provided during @OnOpen. In other words: The actual server endpoint instance is exactly the same where the scheduler is running on. If it would not be @Singleton, multiple instances will or may exist and the session field is not set in @Schedule and might lead to a NullPointerException if not checked for.

Can Programs Be Made Faster?

Short answer: No. But, more efficient.

A happy new year to all of you! This is the first post in 2014 and it is a (not so) short post about a topic which follows me all the time during discussions about high performance computing. During discussions and in projects I get asked about how programs can be programmed to run faster. The problem is, that this mind set is misleading. It always takes me some minutes to explain the correct mind set: Programs cannot run faster, but more efficient to save time.

If we neglect that we can scale vertically by using faster CPUs, faster memory and faster disks, the speed of a computer is constant (by also neglecting CPUs which change there speed so save power). All programs run always with the same speed and we cannot do anything to speed them up by just changing the programming. What we can do is, to use the hardware we got as efficient as possible. The effect is: We get more done in less time. This reduces the program run time and the software seem to run faster. That is what people mean, but looking on efficiency brings the mind set to find the correct leverages on how to decrease run time.

A soon as a program returns the correct results it is effective, but there is also the efficiency which is to be looked at. Have a look to my post about effectiveness and efficiency for more details about the difference between effectiveness and efficiency. To gain efficiency, we can do the following:

Use all hardware available

All cores of a multi-core CPU can be utilized and all CPUs of the system if we have more than one CPU in the system. GPU or physical accelerator cards can be used for calculation if present.

Especially in brown field projects, where the original code comes from single core systems (before 2005 or so) or system which did not have appropriate GPUs (before 2009), developers did not pay attention multi-threaded, heterogeneous programming. These programs have a lot of potential for performance gains.

Look out for:

CPU utilization

Introduce mutli-thread programming into your software. Check the CPU utilization during an actual run and look for CPU idle tines. If there are any, check your software whether it can do something at the time the idle times occur.

GPU utilization

Introduce OpenCL or CUDA into your software to utilize the GPU board or physics accelerator cards if present. Check the utilization of the cards during calculation and look for optimizations.

Data partitioning for optimal hardware utilization

If a calculation does not need too much data, everything should be loaded into memory to have the data present there for efficient access. Data can also organized to have access in different modes for sake of efficiency. But, if there are calculations with amounts of data which do not fit into memory, a good strategy is needed for not to perform calculations on disk.

The data should be partitioned into smaller pieces. These pieces should fit into memory and the calculations on these pieces should run in memory completely. The bandwidth CPU to memory is about 100 to 1000 faster than CPU to disk. If you have done this, check with tools for cache misses and check whether you can optimize this.

Intelligent, parallel data loading

The bottle neck for calculations are CPU and/or GPU. They need to be utilized, because only they bring relevant results. All other hardware a facilities around that. So, do everything to keep the CPUs and/or GPUs busy. It is not a good idea to load all data into memory (and let CPU/GPU idle), then start a calcuation (everything is busy) to store the results afterwards (and have the CPU/GPU idle again). Develop you software with dynamic data loading. During the time calculations run, new data can be caught from disk to prepare the next calculations. The next calculations can run during the time the former results are written onto disk.This maybe keeps a CPU core busy with IO, but the other cores do meaningful work and the overall utilization increases.

Do not do unnecessary things

Have a look to my post about the seven muda to get an impression about wastes. All these wastes can be found in software and these lead into inefficiency. Everything which does not directly contribute to the expected results of the software needs to be questioned. Everything which uses CPU power, memory bandwidth and disk bandwidth, but is not directly connected to the requested calculation may be treated as potential waste.

To have a starter look for, check and optimize:

Decide early

Decide early, when to abort loops, what calculations to do and how to proceed. Some decisions are made in code on a certain position, but sometimes these checks can be done earlier in code or before loops, because the information is already present. This is something to be checked. During refactorings there might be other, more efficient positions for these checks. Look out for them.

Validate economically

Do not check in functions the validity of your parameters. Check the model parameters at the beginning of the calculations. Do it once and thoroughly. If these checks are sufficient, there should be no illegal state afterwards related to the input data. So they do not need to be checked permanently.

Let it crash

Check only input parameters of functions or methods if a fail of those be fatal (like returning wrong results). Let there be a NullPointerException, IllegalArgumentException and what so ever if something happens. This is OK and exceptions are meant for situations like that. The calculation can be aborted that way and the exception can be caught in a higher function to abort the software or the calculation gracefully, but the cost to check everything permanently is high. On the other side: What will you do when a negative value come into a square root function with double output or the matrix dimensions do not fit in a matrix multiplication? There is no meaningful way to proceed, but to abort the calculation. Check the input model and everything is fine.

Crash early

Include sanity checks in your calculations. As soon as the calculation is not bringing more precision, runs into a wrong result, gives the first nan or inf values or behaves strangely in any way, abort the calculation and let the computer compute something more meaningful. It is a total waste of resources to let a program run, which does not do anything meaningful anymore. It is also very social to let other people calculate stuff in the meantime.

Organize data for efficient access

I have seen software which looks up data in arrays element wise by scanning from the first element to the position where the data is found. This leads into linear time behavior O(n) for the search. This can be done with binary search for instance which brings logarithmic time behavior O(log(n)). Sometimes, it is also possible to hold data in memory in a not normalized way to have access to it in different ways. Sometimes a mapping is needed from index to data and sometimes the other way around. If memory is not an issue, think about keeping the data in memory twice for optimized access.

Conclusion

I hope, I could show how the focus on efficiency can bring the right insights on how to reduce software run times. The correct mind set helps to identify the weak points in software and the selection of the points above should point out some directions to look into software to find inefficiencies. A starting point is presented, but the way to go is different for every project.

Software Development and Licensing Issues: The License Maven Plugin

There a some common pitfalls in software development. Most can be overcome by the developers them self, but when it comes to software licensing, they are completely lost (all developers I met at least). For me, it is the same and I tried to find a lawyer who would know about that, but I did not find any, yet. I still have some questions, but I cannot get them answered professionally, but some website and books give at least some information and hints…

One good source is the book „Understanding Open Source and Free Software Licensing“ (http://shop.oreilly.com/product/9780596005818.do). Here some information is presented on what licenses can be used to produce Open Source Software and proprietary software and which are to be avoided. Terms like Copyright and copy-left are explained. But, do you always know what licenses are used libraries you use and link against?Are you sure, that the licenses of these libraries are correct? May it be possible that an Apache License licensed library uses a GPL licensed library? In that case, the used library is also copy-left… 🙁

At least for Java in combination with Maven there is a possibility to check that automatically, and it is advised to do this regularly with the build system. Maven and maven artifacts should contain a license hint in their poms about the used licenses and the dependencies of them should contain the same hint. So it is possible to check the licenses transitively for all dependencies in your project.

For an easy implementation have a look to https://github.com/RickRainerLudwig/license-maven-plugin and http://oss.puresol-technologies.com/license-maven-plugin.

Quality Assurance in Software Development from its Root

I checked the V-Model of software development again a couple of days ago during my holidays. (For example have a look to Wikipedia: http://en.wikipedia.org/wiki/V-Model_%28software_development%29). The model in general is very helpful to explain different levels of testing and their meaning. It is a difference whether I use unit testing to check different functions of the software or integration testing to check the system as whole to check the different parts working together. From functional point of view this V-Model is a good model, but it lacks the most basic building block of software: The source code itself.

Let us take car manufacturing as an example. The V-Model tells us to specify the basic requirements like the car needs to be able to go from A to B with four passengers and a big trunk. We specify then the system design like it is a person car which has a certain size and shape and go down via engine design to details like the different parts of the engine. We take the requirements and design everything from top to bottom by separating more and more smaller building blocks which are separated again until we end up with the most basic building blocks like screws, nuts and bolts. The test goes in production the other way around. First the screws, nuts and bolts are checked for their correct size before they are put into an engine for example. The engine is tested before it is put into the car and so on.

But, when we look to the car manufacturing, there is something which is still missing. It is the part where engineers think about the maintainability of the car. They already think during design phase of new car models about their maintainability. It would be a catastrophe if a mechanic has to dis-assemble the whole car just to change the front light bulb. (In my car I need to walk my way up to the front light starting from the front wheel! So they did not do a good job here…)

In software development it is the same. We think about the correct behavior of the code, but we forget, that we need to maintain the code in future and we also need to develop it further. So maintainability is a big deal, too. Here we need to pay more attention on software architecture, design and source quality to not block our development for further functionality. It is even more obvious, that we need to do this, when we think about the fact, that the source code of the current product is the base for all following products.

I18n and L10n in current software developments

When I have a look to current software developments and have in my mind the current process of a world becoming smaller and smaller, I start to wonder, why there is no more attention drawn to internationalization (I18n) and localization (L10n).

Why should be software be internationalized?

As long as their are only engineers and managers using my software, it might be OK to assume that my potential users are speaking English. On the opposite, my experience is that especially in Asian countries managers and engineers are not speaking very well. Some of the countries are not open for global business long enough that their  ‚older‘ managers and engineers had no chance to learn and train English a long time. Young managers and engineers are to be expected to be better in speaking English, but as soon as one knows the significant differences in eastern and western languages, one knows, that there will still be a significant amount of people not speaking English so well. I know from experience what it means to lean a sino-tibetean language by learning Vietnamese myself.

It is even worse for software used by operators and fabrication staff. Imagine a MES system not localized to the language of the country. Even in Germany operators are not well educated in English. Should I prefer an English speaking MES system or a German speaking one? If the price, the functionality and the performance are almost equal, I prefer a German speaking system. If I would purchase the software for an international working cooperation, I would invest in multi-lingual software.

Marketing requirements

If I would develop software for a global market, I need to have to pay attention to meet the demand of as many people as possible to increase the number of potential customers to increase my sales. One of the easiest things to do is to present cutting edge software to customers in their mother tongue and physical units and time presentation in a way the customers are used to. If I worked in Vietnam, I would like to be able to switch my computer from Vietnamese to English (or even better: German). I would have a chance to use my computer in a more efficient way, to make less mistakes and to deal with the software in a more reliable way. I guess, this is also valid for a lot more people.

Quality Management

„Usability“ is also a quality requirement. Have a look to PureSol Technologies website for some more information.The positively influenced characteristics are „Understandability“, „Learn ability“ and „Operable“. For quality focused software projects this is also a factor to take into account.

Simple Conclusion: Do it

In current software developments we should always start projects with an additional focus on I18n and L10n. The additional time and work to enable my software to change the language and the output units is not that high. Using good frameworks for that, it’s very easy to achieve.

A simple framework example: I18n4Java

Let’s have a look to I18n4Java. It’s a small framework I created for my own applications. It’s stable and usable, but not very user friendly, yet. The L10n functionality is taken from the Java API. It’s a very good and reliable framework. Only the suggestopm for I18n implementation I did not like. The SUN recommendation can be found at http://download.oracle.com/javase/tutorial/i18n. The basic information is: Put all messages in property files and change them in reference to the language needed to be displayed.

This approach has some issues:

  1. The message to be display is separated from the source code. I need to look into additional files to find what is actually display. When I need to change messages, I need to open additional files.
  2. The situation is worse as soon as I need message with parameters. Without seeing the actual message, I do not know in which order I need to put the parameters into the message.
  3. I need extra codes for opening the properties files, additional error handling during loading errors and extra code to read the messages.

This limitations are overcome by I18n4Java. Additionally, it is very easy to use in code.

The only things to do is to add a translator to a class which needs I18n. In each class one just needs to create a Translator object:

Afterwards within the class I only need to put every message to be displayed in to the translator and to show the translated message:

The message is then automatically translated into the language selected with

and if an translation file is found for the selected language.

That’s all what is to be done to enable I18n in Java programmed software. The rest is a simple configuration file to set the language needed. I18n4Java automatically sets itself to the currently set OS locale and tries to translate everything in the language the OS is set to.

Thoughts on High Performance Computing

During my work as consultant, I was asked about high performance computing (HPC) and how to implement it. As always,  one of the strongest constraints is a tight budget.

In the last years, techniques for HPC changed as the hardware changed. Several years before HPC was only possible on computers  made of special HPC processors like NEC’s vector CPUs or a large mainframe was installed with thousands of standard CPUs which work together to run in an astonishing speed. Sometimes, combinations of that was installed.The complexity to program such machines is massive and special knowledge is needed about the programming paradigms and the hardware to get optimal results.

Today the situation is a little different due to several  factors:

  1. Standard CPU will not get faster significantly. The physical constraints are reached and downsizing the chips is not that easy anymore or even impossible. In some dimensions production specifications are around atoms. As long as we do not want to split atoms, we can not reduce some dimensions.
  2. Due to the constrains in the point above, CPU architectures changes. The most significant change are the multi core processors. Moore’s law on speed is extended by multiplying the number of cores in a process.
  3. Gaming industry and industry for graphics processing have let the computer industry into a development of high performance graphics cards. As it turns out, with some minor constraints, these cards are very well suited for HPC. Even on my „old“ nVidia GeForce 8600 GTS, I found 4 multi core processors with 8 cores per processor.

Possibilities for HPC

I do not want to write about special computer hardware and special designed machines. The standard PC technologies are presented here for customers with small budgets where the purchase of a HPC server with thousands of cores is not an option.

Therefore, the following possibilities for HPC are available today:

  1. Even if it is an older approach, cluster computing with PVM or MPI is still a valid possibility. In cluster computing several PCs or servers are interconnected with a standard Ethernet network. The big drawback are latencies of and the speed  in the network. If large computations can be run in parallel where the time consumption of the latency and the bandwidth are much smaller than the computation time, the approach can and should be used. A very prominent example is movie rendering. The scenery information is sent to a client and the calculation is performed on the client. Hundreds of clients can share the work and speed up the whole process dramatically.
  2. Multi Core and Multi Processor parallelization machine is a common choice today. The current number of cores in a standard PC are limited from 2 to 8. Multi core processors with more cores can be expected within the next years, that is for sure. The total speed up of a software is therefore limited on the number of available cores. Even if not HPC is done, the parallelization of software should be a topic, because customers want their machines running as fast as possible and the investment should be used efficiently. For HPC itself it is not a real option, because standard software should use it, too. So it is not special high performance about it.
  3. Real HPC can be done with GPU programming. One constraint of GPUs are the limitation to single precision floating point operations. It is quite ok for calculation of 3D graphics, but for some scientific calculations it is not good enough. nVidia has met this demand by creating the so called Tesla cards. These cards contain up to 448 cores with 6GB RAM and operate in double precision mode. Programmed with nVidias CUDA framework or the OpenCL language high speed ups can be achieved. This is a real low budget HPC solution for many customers.

Example

For a small test with OpenCL, I programmed a small C program which has to perform a simple matrix multiplication. In C a classical sequential matrix multiplication lools like:

I assumed here, that we have quadratic matrices with a size of MATRIX_SIZE in each direction. For a size of 1024 this algorithm needs about 51.9 seconds on my AMD Operton 2600.

The same algorithm was implemented in OpenCL. The Kernel code looks like:

Started is the kernel on my nVidia GeForce 8600 GTS after copying the needed matrix data into the graphics card RAM with:

This leads into a start of 1,048,576 threads which are started on 32 cores. The whole operation is finished in roughly 3.3 seconds. This is a total speed up of 15.7.

One of the specialties to be taken into account is, that GPU processors are not cached and that therefore, no cache coherence is to be expected. All processors write directly into RAM. The host process has to take care for concurrency and to avoid it. In the example above the two index variables for the results matrix are independent and the calculation itself, too. So we could create independent threads for these two variables. The third variable is dependent and can not be parallelized without additional locking mechanisms.

The situation on graphics cards are much more interesting as soon as we take the different memories into account which exist on a graphics card, too. In the example above I used the global memory which is accessible for reading and writing by all processors and the private memory which is private for each core. The private variable r was used due to fast read and write capabilities of the private memory. It’s faster to sum up the result first in private memory and to set the result in global memory later on. We also have a read only memory for constants on the graphics boards (read only for the GPU processors, but writable by the host), texture memory and some more…

Conclusion

As shown above, massive parallel GPU programming and OpenCL is a big chance for HPC on a small budget. Taken into account that my graphics card is not state of the art anymore and the nVidia Tesla cards with their performance, HPC is possible for Science and Research Institutes and Organizations with strong budget constraints.

The Curse of GOTO

When I started to learn programming professionally in 1991 at the Schülerrechenzentrum Dresden I was always told: „Do not use GOTO statements! It is never needed due to other possibilities which are available. GOTOs do disturb the program flow, the understanding of the program becomes more difficult and the program is not well structured anymore.“ I started to program some years before that on a C64 and there was practically no other way than using GOTOs and therefore, it sounded strange to me at first.

At the Schülerrechenzentrum Dresden I learned to program Turbo-Pascal and I was taught to use never ever GOTOs, but to think about a good program flow and to implement decisions and jumps with IF statements and calls of functions and procedures. Over the years I always followed this advice, but from time to time I started thinking about the usage again, but I never found a situation where a GOTO statement would be a better choice than other possibilities of implementation.

Following some facts and taught why not to use GOTOs.

Complexity

GOTOs increase source code’s complexity or seem to increase it. If there is a label in source one never knows exactly what GOTOs and how much will jump there in which circumstances and where are they located. If there is a GOTO around, the search for the corresponding label starts. The flow is not obvious any more and a label is not always distinct.

In the name of complexity GOTOs should be replaced by IF statements and function calls. These are easy to understand and with the right indentation and naming it’s quite obvious what the program does and what the intentions are.

Maintainable Code and Human Understanding

Try to understand code which is written with GOTOs. You always have to find out, where the fitting label is and under which circumstances the GOTO is invoked and what the current status of all variables are. It becomes more difficult in languages where a GOTO is allowed to jump out of the current function into another one or where the GOTO is allowed to jump into loops from outside. It’s very difficult to see and understand what the initial values are after the jump and how the program will proceed. A GOSUB and a RETURN is the same mess.

„Any fool can write code that a computer can understand. Good programmers write code that humans can understand. „
– Martin Fowler –

For maintainable code, source code is needed without such difficulties. A clean source code exposes all its conditions and its program flow. A GOTO messes this up and should therefore never used.

Refactoring

In Martin Fowler book „Refactoring – Improving the Design of Existing Code“ a lot of techniques and patterns are described for improving and developing good code. Refactoring, improving the code, can only be done, if the program flow is obvious and the behavior can be foreseen for any change. If a GOTO is within the area of the refactored code, the change or move of the corresponding label can dramatically change the programs behavior and it’s not that obvious in the first place.

In the name of maintainable code, refactoring should be possible. GOTOs lead to source code which is not easily refactorable and it should therefore be avoided under any circumstance.

For further information about clean code, have a look to the book Clean Code by Robert C. Martin.

Hook Scripts Revised

Hook scripts are flexible in their implementation. I want to show an easy way how to overcome the obstacle of system routines which contain high volatile code due to environmental changes like company changes, mergers and other changes within the logic of any system.

Definition

Hook scripts are normally scripts which are started in special situations on a server or within a software system. Hook scripts can be used to do some customized work during a process in an standard environment, they can be used to trigger other external systems or do a lot of other work which could no be done within a closed systems. Therefore, hook scripts are a great way to provide customers well defined points in a system which can be customized for special needs. I strongly recommend to consider hook scripts as customization solution for larger software systems. A good example are Subversion’s hook scripts.

Issue

During my professional life I faced several situations were I already knew that one or more parts of my new designed and implemented system were to be changed some weeks later on and that this might not be the last time a change was necessary. One time, I faced this situation shortly after a merger of my company. We had to change the convention of lot numbers to the new company standard. We had to face a transition time when we were forced to handle two different types of lot number which were totally different in nomenclature. What could be done here? I designed a system in C/C++ with approximately 100k lines at this time. I didn’t want to implement the reason for the next minor release straight into the system.

Solution

My idea to solve this situation was to use hook scripts to do jobs which perform tasks for volatile specifications and for volatile production environments which are characterized by continuous improvements and permanent updates and optimizations. I had this idea some years before for a part of the system which was also threatened by a volatile production environment. Volatile environments mean a permanent potential reason of change in any software system. I am a lazy guy and therefore, I did not want to make the changes deep within the C/C++ code every time again.

To stay at the example from above, I had designed my system in a way that lot number conventions did usually not matter. A lot number was equal to another if the string comparison was equal. Plain and easy. But I faced the need for a conversion of split lot numbers to root lot numbers. The conversion for split lot numbers also did change. The old convention we had were plain numbers for lots like „nnnnnnn“ (n = decimal number). A zero in the 7th position was for root lots and was changed for splits to other decimal number. The new convention was „lnnnnn“ (l = letter, n = decimal number) and split lots were marked with additional letters at the end, changing even the length of the lot number. I had to face the situation that I needed a method which could translate any lot number to root lot numbers…

The implementation was done in C/C++ with a simple call to POSIX’s system command which called a script which was stored within my normal shared system files somewhere at /usr/share/. The script was called hook_split2rootlot.pl. I implemented a simple logic with less than 20 lines there for the simple translation. Everytime I have to change the convention of translating these lot numbers I just need to change this little and easy to understand script and ready is my work. Due to the implementation in Perl, it is very easy to add a more sophisticated logic for other purposes, too.

Trade-off

A serious trade-off is only speed due to POSIX system calls are quite expensive in time. Hook scripts should only be used for functions which are not often called. For system parts which run very often and have to be fast another solution is to be implemented.

Summary

Hook scripts, if implemented in the right way, can make a system very flexible, customizable and also easy to administrated. I recommend to consider hook scripts during architectural development as a possible solution for flexibility.