Programming is both: A Craft and Science

There is a latent conflict in software development. I experienced it several times in different contexts. It is no always obvious, but nevertheless it is there… There are two different approaches to software development: An academic and a craftsman approach. One needs to be a good craftsman to write solid, maintainable, and efficient code, but one also needs to be an academic to understand IT systems and their specialties.

The academic approach is all about theory. Students in universities get told that theory. These students become very good and they are very knowledgeable, but they lack the experience and practice to develop real world software. They also learn a lot of mathematics like calculus, numerical mathematics, statistics, and linear algebra. For academics it is often difficult to realize that real world software cannot be perfect (too expensive), needs to be maintainable (software needs to be maintained and bug fixed and also developed on in future) and cost efficient, robust solutions might not be modern, cool or state-of-the-art. Academics also tend to fall in love with some problems and try to solve them in the most beautiful way possible. This is economically meaningless and too expensive.

The other approach is, that software is a craft. Real craftsmanship needs a lot of time, practice and guidance by a senior. It is like building stuff. The first trials will be messy, unusable, ugly… But, the more practice a craftsman get the better the outcome is. But, good code does not solve hard and complex issues. These need an academic approach. For example, some problems need to be converted into problems of Linear Algebra to get them performed efficiently. A pure programmer who only learn to program is not able to do that.

The problem is now: Students are intellectually trained very well, but they are no craftsman. Pure craftsman can write good code, but may not be intellectually prepared for complex systems. As a consultant I see both sites. I train scientists and engineers which are experts in their specific domain to write good, maintainable code, but I also see people who write good code, but need some input on how to tackle some hard problems.

What can be done!?

Academics: Science and engineering is ruled by laws and intellectual challenges. It is fine, but there is more to be done in Software. For Academics I recommend to read some books like „The Pragmatic Programmer“ by Andrew Hunt et. al. (http://astore.amazon.de/pureso-21/detail/020161622X) and think about initiatives like Software Craftsmanship (http://manifesto.softwarecraftsmanship.org) and observe the own daily doing. Academics in most cases are very good in self reflection. It is easily understood what needs to be done to not hassle later on with bad code, unmaintainable software. (For questions, I can be booked, too… ;-)) The main concern should be: How can the job be done in a way with maximum of automation, minimum of repetitive work and as maintainable as possible.

Craftsman: A craft is doing, evaluation and redoing until perfection is reached. It is a good approach, but some problems cannot be solved so easily. The easiest way to proceed is to team up with a domain expert, the second best thing is to read about the domain of expertise which is missing to become an expert in this field to a certain level which is needed. The main concern should be: How can the problem be solved efficiently? The most problems are already solved to a certain level. This can be copied and studied. Only the remaining parts need to be invented.

The best software developers combine skills of academics and craftsman. Both sites are challenging on its own, but the combination needs to be developed over time. The result is worth it: The software developed by these people is amazing: Maintainable, understandable, simple, efficient…

SSL in Real Life

We live in the Post Snowden Era and we all need to think about what shall happen with our Privacy, with the Internet and the way we communicate. The situation is fundamentally different to what we had about 20 years before…

All the laws which should protect us were made in times were surveillance needed physical access to place phone taps and microphones. Communication was bound to a distributed physical medium which needed direct access to break into it. Additionally, communication was from point to point on shorted path.

In the internet everything is connected now and on infrastructure based on TCP/IP serves everything. The internet is not as distributed as most people think. Because some very large internet knots bundle a lot of communication on them. Their number is limited and some institutions are able to hook into them as revealed…

One of the recommendations is to encrypt everything. From simple web pages over email to all other forms of communications. As long as only some communication is encrypted, there is still to much information open, so everybody should encrypt it messages (even the unimportant ones) to increase the noise in the net to make it attackers more difficult to find out what are important communications and what is not. With this in mind, we decided at PureSol Technologies to provide all out Website via HTTPS (SSL encryption via HTTP). As soon as the decision was made, the issues started…

Trusted Communication only via Root CA

One of the current major issues is that only Root CAs are able to issue good certificate for SSL which bring Browsers to the „Green Lock Mode“ which shows the site as trusted without any warning. Otherwise, browsers show a warning about not trusted certificates, because the authentication was not performed and nobody knows whether the certificate was provided by the correct organization or person. This is correct in so far, that I could create certificates which claim to be issued from Microsoft Corportation. To avoid this, some so called Root CAs are authorized to check the certificate to be authentic. A good idea so far…

Problem 1: There are not so much Root CAs and the prices are too high for small companies like PureSol Technologies. The prices for certificates can go up to the 1kEUR per year. I better spend this money for developers.

Problem 2: There were already issues with security breaches in Root CAs. The most prominent one was the erronous issues certificates to an individual in the name of Microsoft. The issue, this individual was not in any kind related to Microsoft. See Mircosoft database: https://support.microsoft.com/en-us/kb/293818

Why shouldn’t I provide my own certificates? I can create my keys and certificates on a completely disconnected computer and copy the stuff via Sneakernet. OK, there is still the issue, that I claim to be me, but knowbody knows whether this is correct or not… But to pay thousands of Euros for that kind of service which should be a one time event?

Not handled Certificate Revokation Lists

It is still not standardized when, how and by whom the certificates are to be rechecked for validity. The above mentioned issue with the VeriSign create Microsoft certificates showed, that even Microsoft had to create a hot patch to contain that issue, because Windows itself could not deal with certificate revokation lists. It is still a place of construction and a week spot in SSL and the network of trust.

So, Root CAs provide an expensive service to provide security, but it is not really clear, how the revokation lists are handled. The problem here is manyfold. What happens, if the service for revokation check is down? How often is to be checked?

Running an Own Certificate Authority

After checking prices, services and how much work it is to deal with SSL and certificates on our own, we decided to run our own Certificate Authority.

You can find the needed public information at http://ca.puresol-technologies.com.

There is only one issue: The PureSol Technologies Certificate Authority is not a pre-installed CA in browsers. So, if an HTTPS connection is opened to PureSol Technologie site, the browsers will download the certificates and during the check they find, that no known Root CA has checked and signed these.

To overcome this, we provide the PureSol Technologies Certificate Authority Website for certificate download. To check the authenticity we provide the certificates to clients via Sneakernet for local installation.

Additionally, site which are „read-only“ for clients are provided via HTTP and HTTPS. Only sites which enforce user interaction like logins, we provide via HTTPS only. All HTTP connections on these sites are automatically redirected to its HTTPS variant.

Hadoop Client in WildFly – A Difficult Marriage

(This article was triggered by a question „Hadoop Jersey conflicts with Wildfly resteasy“ on StackOverflow, because I hit the same wall…)

For a current project, I evaluate the usage of Hadoop 2.7.1 for handling data. The current idea is to use Hadoop’s HDFS, HBase and Spark to handle bigger amount of data (1 TB range, no real Big Data).

The current demonstrator implementation uses Cassandra and Titan as databases. Due to some developments with Cassandra and Titan (Aurelius was aquired by DataStax.), the stack seems not to be future-proof. An alternative is to be evaluated.

The first goal is to use the Hadoop client

in WildFly 9.0.1. (The content of this artical should be also valid for WildFly >=8.1.0.) HDFS is to be used at first to store and retrieve raw files.

Setting up Hadoop in a pseudo distributed mode as it is described in „Hadoop: The Definitive Guide“ was a breeze. I was full of hope and added the dependency above to an EJB Maven module and wanted to use the client to connect to HDFS to store and retrieve single files. Here it is, where the problems started…

I cannot provide all stack traces and error messages anymore, but roughly, this is what happend (one after another; when I removed one obstacle, the next came up):

  • Duplicate providers for Weld where brought up as errors due to multiple providers in the Hadoop client. Several JARs are loaded as implicit bean archives, because JavaEE annotations are included. I did not expect that and it seems strange to have it in a client library which is mainly used in Java SE context.
  • The client dependency is not self contained. During compile time an issue arised due to missing libraries.
  • The client libraries contains depencencies which provide web applications. These applications are also loaded and WildFly try to initialize them, but fails due to missing libraries which are set to provided, but not included in WildFly (but maybe in Geronimo?). Again, I am puzzled, why something like that is packaged in a client library.
  • Due to providers delivered in sub-dependencies of Hadoop client, the JSON provider was switched from Jackson 2 (default since WildFly 8.1.0) back to Jackson 1 causing infinite recursions in trees I need to marshall into JSON, because the com.fasterxml.jackson.*  annotations were not recognized anymore and the org.codehaus.jackson.*  annotations were not provided.

The issues are manyfold and  is caused by a very strange, no to say messy packaging of Hadoop client.

Following are the solutions so far:

Broken implicte bean archives

Several JARs contain JavaEE annotations which leads to an implicit bean archive loading (see: http://weld.cdi-spec.org/documentation/#4). Implicit bean archive support needs to be switched off. For WildFly, it looks like this:

Change the Weld settings in WildFly’s standalone.conf from

to

This switches the implicit bean archive handling off. All libraries used for CDI need to have a /META-INF/beans.xml  file now. (After switching off the implicit archives, I found a lot of libraries with missing beans.xml  files.)

Missing dependencies

I added the following dependencies to fix the compile/linkage issues:

Services provided, but not working

After switching off the implicit bean archives and added new dependencies to get the project compilied, I run into issues during deployment. Mostly, the issues were missing runtime dependencies due to missing injection providers.

The first goal was to shut off all (hopefully) not needed stuff which was disturbing. I excluded the Hadoop MapReduce Client App and JobClient (no idea what these are for). Additionally, I excluded Jackson dependencies, beacause they are already provided in the WildFly container.

Broken JSON marshalling in RestEasy

After all the fixes above, the project compiled and deployed successfully. During test I found that JSON marshalling was broken due to infinite recursions I got during marshalling of my file trees. I drove me cracy to find out the issue. I was almost sure that WildFly 9 switched the default Jackson implementation back to Jackson 1, but I did not find any release note for that. After a long while and some good luck, I found a  YarnJacksonJaxbJsonProvider class which forces the container to use Jackson 1 instead of Jackson 2 messing up my application…

That was the final point to decide (maybe too late), that I need a kind of calvanic isolation. Hadoop client and WildFly need to talk through a proxy of some kind not sharing any dependencies except of one common interface.

Current Solution

I created now one Hadoop connector EAR archive which contains the above mentioned and fixed Hadoop client dependencies. Additionally, I create a Remote EJB and add it to the EAR which provides the proxy to use Hadoop. The proxy implements a Remote Interface which is also used by the client. The client performs a lookup on the remote interface of the EJB. That setup seems to work so far…

The only drawback in this scenario at the moment is, that I cannot use stream throug EJB, because streams cannot be serialized. I think about creating a REST interface for Hadoop, but I have no idea about the performance. Additionally, will the integration of HBase be as difficult as this!?

For the next versions, maybe a fix can be in place. I found a Jira ticket HDFS-2261 „AOP unit tests are not getting compiled or run“.