A Peek at Documentation Building Process in Muse Federated Search

The Muse manuals are written in DocBook and a build process generates the PDF files which are actually the documents for the end users. We have improved the building process of Muse manuals so that it does not hang up forever. We used a workaround, more details about this and what the issues are for reference, after explaining the steps involved.

Basically each document is built through a call to the Ant script ${MUSE_HOME}/doc/tools/build.xml where the FOP is launched in Java task using fork = true (i.e. another process). The FOP was blocking here sometimes when many manuals are run in a row – this happens when invoking ${MUSE_HOME}/buildall.xml doc which calls through all the other ants and finally reaches this FOP.

We have ended up adding a timeout to the Java process. We have added a big enough timeout value (15 minutes) that a single manual cannot take that long. Upon killing the Java process the return value is -1. Previously the return value was tested to see if it is 0 (OK) or non-zero (wrong). Upon non-zero it was considered a FOP error and the whole process stopped. We have added a test against -1, and when -1 is detected we output this:

FOP should not be returning the value -1 upon various errors. We have checked the exact FOP source code for this and it can only return 0 (on success), 1 (a File not found) or 2 (a real FOP issue). But for covering everything, the Release Manager should inspect the manuals, as well as actually run through all the build log for any other operation. Note that every time ${MUSE_HOME}/buildall.xml is called, a log file ${MUSE_HOME}/buildall.log is created (overwritten). So, this file should be inspected on all the steps for unusual errors.

This timeout scenario will be recognized as below:

FOP (actually Java Advanced Images – JAI) hangs upon JVM termination so we will always see that FOP has done its job:

This is the state the Release Manager was founding the building process, and here he stopped (^C) the ant process. Now it will take up to 15 minutes (or what was left of them) and the Java process is killed by timeout and the above messages are an indication for this.

If one is following live the building process and sees it stalled in:

he can do the usual stopping to avoid waiting up to 15 minutes. If the process is in the background it will do its job. We have tested this and it actually blocked 2 times and the total amount of time spent for building all the manuals was 71 minutes. For a background process this is fine. Usually it happened that manuals were stalled for hours there because the Release Manager was doing other tasks in parallel so this would be an improvement. As well for the Muse Control Center task that weekly builds the latest manuals. It should not hang up as before.

——–

Now about how a set of bugs and “features” of open source components could chain each other making almost impossible to achieve a reliable environment.

So, we had this bug of Muse Manuals hanging. Based on the observation CPU was 50% (as there are two units). A stack trace revealed that the JAI codec was doing some clean-up and intermittently filling up a stack trace…and JVM could not stop…

Now we have jumped into to try to see the whole DocBook related process. There is a hierarchy of Ant builds from the global buildall.xml to buildall.xml of each project, to build of each sub-project and finally calling ${MUSE_HOME}/doc/tools/build.xml where this line was invoked

We were trying to read the FOP documentation. We do not use the recommended way (http://xmlgraphics.apache.org/fop/0.95/anttask.html), that is, using the Ant Task and we were calling the command line from an external process. We considered doing this (using the FOP Ant task), although we had a very very old FOP (fop-0.20.5). However we needed to split what the command line was able to do and the Ant task was not into two operations. But that is the recommended way. Namely first create the *fo file, through an XSLT and then run the FOP converter. The command line was running the XSLT inside and without the need of a temporary file.

We considered doing this because we were thinking in the direction of switching to FOP 1.0 in the future, which actually accepts the XML dockbook as parameter and not the FO. Also we were thinking there could be improvements to speed because we could be running in the future multiple docs through a file set, without the need to step into each document. Considering this we tried using the documented way of using FOP in Ant and…we ran into a PermGen issue. After increasing it to 256 MB we still get OOM PermGen. We considered then that this is due to the recommended way of doing a task def…

We had many classpath elements not few as the example above, and although this task def was in the context of the build script for the manual finished with each manual, still persist in Ant. This is … an ANT bug:

“Sub-builds (antcall, subant) load a task each time the task is defined, but do not release it when the sub-build project completes.[...]“
[https://issues.apache.org/bugzilla/show_bug.cgi?id=49021]

Following the workarounds mentioned we ended up trying to define the FOP task at the upper level and at the lowest level we tested if it is defined and only then we were defining it. To make it simpler we used the recommended Antlib approach. We ended up with this in the ${MUSE_HOME}/doc/tools/build.xml

…and now everything seemed to be running without Perm Gen and CPU consumption. Using temp files we did not considered to be bringing a big overhead. But after looking inside the generated docs we see they had no image. This time another bug probably of the old FOP, although not mentioned. It was related to not being able to find the relative referenced images unless specifying basedir, although the doc says that the fofile directory is considered as the base. And the fo file was in the same directory which contained the images directory.

“basedir Base directory to resolve relative references (e.g., graphics files) within the FO document. No, for single FO File entry, default is to use the location of that FO file.”
[http://xmlgraphics.apache.org/fop/0.95/anttask.html]

Hence we used basedir and were happy that it worked. Another run, forgot a document was opened, the whole process needed to be revert, run again…finished in about 60 minutes…forgot to mention that we left the ANT_OPTS on the build machine to use more memory (-Xmx896M -XX:MaxPermSize=128M). This should make all the building process take a little less.

Then we were happy that the images appeared in the docs. But we said we should be comparing various new rendered PDFs with old ones. Many of them had the same dimension, but there were others with differences. Muse Testing.pdf (one containing many images) was smaller than the one from CVS with 1 MB.
By looking into it we saw images from Muse Designer Console.pdf and from ICE MARC to XML Converter.pdf. Well we said it has to do with some cache, because now there is a single instance of FOP to the entire run…and, indeed that was it, actually a feature of FOP:

“FOP caches images between runs. There is one cache per FopFactory instance. The URI is used as a key to identify images which means that when a particular URI appears again, the image is taken from the cache. If you have a servlet that generates a different image each time it is called with the same URI you need to use a constantly changing dummy parameter on the URI to avoid caching.”
[http://xmlgraphics.apache.org/fop/1.0/graphics.html#caching]

At this point we said stop to this whole chain as it wasn’t worth. Although we are keeping, for reference, the work done, so that when these issues may be resolved in the future we use the recommended way, adding a dummy parameter (hadn’t done this anyway) or trying to use the name of the file in the image or even mentioning the directory (which is the document name) means many modifications, and also in case a document name change is necessary then there are too many troubles in modifying it. So, using different file names for images should never be considered.

Meanwhile, while doing this by following the right documentation, we came across the timeout parameter of the Java fork, and had always had this workaround in mind. But we were curios why that JAI (Java Advanced Imaging) blocks there in the stack trace from above, and actually found that this is a JAI bug, not resolved in the latest version either. The bug is detailed here:

“The shutdownHook TempFileCleanupThread throws a ConcurrentModificationException in fileIter.next() sometimes. This exception is ignored and because .next() doesn’t succeed the loop never ends. It is about this code in com/sun/media/jai/codec/FileCacheSeekableStream.java:”

[http://java.net/jira/browse/JAI_CORE-121]

According to the comments on Java Net Jira this isn’t resolved so far. We have also a peek in the latest JAI code from trunk and it is not yet resolved. It is strange something can block upon termination. So, we ended up doing the timeout trick…nothing else came to mind, and there was already too much time spent on this.

We wrote all this in case someone else confronts with something similar, and if not to see how not to design things. Recently we have come across many Swing bugs, even filled one to Oracle.

The Core team was fighting a similar stopping bug with RMI from Sun (in the context of Tomcat and Jackrabit)…all these are not nice, in case that there is not even a workaround for some bugs.

This entry was posted in Blog, The Muse Platform. Bookmark the permalink.

Leave a Reply

Your e-mail is never published nor shared. Required fields are marked *.

*


*


captcha