Hadoop Client in WildFly – A Difficult Marriage

(This article was triggered by a question “Hadoop Jersey conflicts with Wildfly resteasy” on StackOverflow, because I hit the same wall…)

For a current project, I evaluate the usage of Hadoop 2.7.1 for handling data. The current idea is to use Hadoop’s HDFS, HBase and Spark to handle bigger amount of data (1 TB range, no real Big Data).

The current demonstrator implementation uses Cassandra and Titan as databases. Due to some developments with Cassandra and Titan (Aurelius was aquired by DataStax.), the stack seems not to be future-proof. An alternative is to be evaluated.

The first goal is to use the Hadoop client

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.7.1</version>
</dependency>

in WildFly 9.0.1. (The content of this artical should be also valid for WildFly >=8.1.0.) HDFS is to be used at first to store and retrieve raw files.

Setting up Hadoop in a pseudo distributed mode as it is described in “Hadoop: The Definitive Guide” was a breeze. I was full of hope and added the dependency above to an EJB Maven module and wanted to use the client to connect to HDFS to store and retrieve single files. Here it is, where the problems started…

I cannot provide all stack traces and error messages anymore, but roughly, this is what happend (one after another; when I removed one obstacle, the next came up):

  • Duplicate providers for Weld where brought up as errors due to multiple providers in the Hadoop client. Several JARs are loaded as implicit bean archives, because JavaEE annotations are included. I did not expect that and it seems strange to have it in a client library which is mainly used in Java SE context.
  • The client dependency is not self contained. During compile time an issue arised due to missing libraries.
  • The client libraries contains depencencies which provide web applications. These applications are also loaded and WildFly try to initialize them, but fails due to missing libraries which are set to provided, but not included in WildFly (but maybe in Geronimo?). Again, I am puzzled, why something like that is packaged in a client library.
  • Due to providers delivered in sub-dependencies of Hadoop client, the JSON provider was switched from Jackson 2 (default since WildFly 8.1.0) back to Jackson 1 causing infinite recursions in trees I need to marshall into JSON, because the com.fasterxml.jackson.*  annotations were not recognized anymore and the org.codehaus.jackson.*  annotations were not provided.

The issues are manyfold and  is caused by a very strange, no to say messy packaging of Hadoop client.

Following are the solutions so far:

Broken implicte bean archives

Several JARs contain JavaEE annotations which leads to an implicit bean archive loading (see: http://weld.cdi-spec.org/documentation/#4). Implicit bean archive support needs to be switched off. For WildFly, it looks like this:

Change the Weld settings in WildFly’s standalone.conf from

<subsystem xmlns="urn:jboss:domain:weld:2.0"/>

to

<subsystem xmlns="urn:jboss:domain:weld:2.0" require-bean-descriptor="true"/>

This switches the implicit bean archive handling off. All libraries used for CDI need to have a /META-INF/beans.xml  file now. (After switching off the implicit archives, I found a lot of libraries with missing beans.xml  files.)

Missing dependencies

I added the following dependencies to fix the compile/linkage issues:

        <dependency>
            <groupId>com.sun.jersey</groupId>
            <artifactId>jersey-server</artifactId>
            <version>1.19</version>
        </dependency>
        <dependency>
            <groupId>com.sun.jersey</groupId>
            <artifactId>jersey-json</artifactId>
            <version>1.19</version>
        </dependency>

Services provided, but not working

After switching off the implicit bean archives and added new dependencies to get the project compilied, I run into issues during deployment. Mostly, the issues were missing runtime dependencies due to missing injection providers.

The first goal was to shut off all (hopefully) not needed stuff which was disturbing. I excluded the Hadoop MapReduce Client App and JobClient (no idea what these are for). Additionally, I excluded Jackson dependencies, beacause they are already provided in the WildFly container.

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>${hadoop.version}</version>
    <exclusions>
        <exclusion>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-app</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-jobclient</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-core-asl</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-mapper-asl</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-jaxrs</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-xc</artifactId>
        </exclusion>
    </exclusions>
</dependency>
<dependency>
    <groupId>com.sun.jersey</groupId>
    <artifactId>jersey-server</artifactId>
    <version>1.19</version>
</dependency>
<dependency>
    <groupId>com.sun.jersey</groupId>
    <artifactId>jersey-json</artifactId>
    <version>1.19</version>
    <exclusions>
        <exclusion>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-core-asl</artifactId>
        </exclusion>
        <exclusion>
             <groupId>org.codehaus.jackson</groupId>
             <artifactId>jackson-mapper-asl</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-jaxrs</artifactId>
        </exclusion>
        <exclusion>
            <groupId>org.codehaus.jackson</groupId>
            <artifactId>jackson-xc</artifactId>
       </exclusion>
    </exclusions>
</dependency>

Broken JSON marshalling in RestEasy

After all the fixes above, the project compiled and deployed successfully. During test I found that JSON marshalling was broken due to infinite recursions I got during marshalling of my file trees. I drove me cracy to find out the issue. I was almost sure that WildFly 9 switched the default Jackson implementation back to Jackson 1, but I did not find any release note for that. After a long while and some good luck, I found a YarnJacksonJaxbJsonProvider class which forces the container to use Jackson 1 instead of Jackson 2 messing up my application…

That was the final point to decide (maybe too late), that I need a kind of calvanic isolation. Hadoop client and WildFly need to talk through a proxy of some kind not sharing any dependencies except of one common interface.

Current Solution

I created now one Hadoop connector EAR archive which contains the above mentioned and fixed Hadoop client dependencies. Additionally, I create a Remote EJB and add it to the EAR which provides the proxy to use Hadoop. The proxy implements a Remote Interface which is also used by the client. The client performs a lookup on the remote interface of the EJB. That setup seems to work so far…

The only drawback in this scenario at the moment is, that I cannot use stream throug EJB, because streams cannot be serialized. I think about creating a REST interface for Hadoop, but I have no idea about the performance. Additionally, will the integration of HBase be as difficult as this!?

For the next versions, maybe a fix can be in place. I found a Jira ticket HDFS-2261 “AOP unit tests are not getting compiled or run”.

Leave a Reply