A Peek at Documentation Building Process in Muse Federated Search

The Muse manuals are written in DocBook and a build process generates the PDF files which are actually the documents for the end users. We have improved the building process of Muse manuals so that it does not hang up forever. We used a workaround, more details about this and what the issues are for reference, after explaining the steps involved.

Basically each document is built through a call to the Ant script ${MUSE_HOME}/doc/tools/build.xml where the FOP is launched in Java task using fork = true(i.e. another process). The FOP was blocking here sometimes when many manuals are run in a row – this happens when invoking ${MUSE_HOME}/buildall.xml doc which calls through all the other ants and finally reaches this FOP.

We have ended up adding a timeout to the Java process. We have added a big enough timeout value (15 minutes) that a single manual cannot take that long. Upon killing the Java process the return value is -1. Previously the return value was tested to see if it is 0 (OK) or non-zero (wrong). Upon non-zero it was considered a FOP error and the whole process stopped. We have added a test against -1, and when -1 is detected we output this:


Warning: FOP probably timed out (15 minutes of execution), due to the JAI infinite looping bug.
Although it should be unaffected inspect the PDF document ${docName}.pdf.

FOP should not be returning the value -1 upon various errors. We have checked the exact FOP source code for this and it can only return 0 (on success), 1 (a File not found) or 2 (a real FOP issue). But for covering everything, the Release Manager should inspect the manuals, as well as actually run through all the build log for any other operation. Note that every time ${MUSE_HOME}/buildall.xml is called, a log file ${MUSE_HOME}/buildall.log is created (overwritten). So, this file should be inspected on all the steps for unusual errors.

This timeout scenario will be recognized as below:

[echo] Converting Muse Designer Console.xml
DocBook2PDFSingle:
[INFO] Using org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser as SAX2 Parser

[WARNING] Zero-width table column!
[WARNING] Zero-width table column!
[INFO] Parsing of document complete, stopping renderer
Timeout: killed the sub-process
Java Result: -1
[echo] Warning: FOP probably timed out (15 minutes of execution), due to the JAI infinite looping bug.
Although it should be unaffected inspect the PDF document Muse Designer Console.pdf.

FOP (actually Java Advanced Images – JAI) hangs upon JVM termination so we will always see that FOP has done its job:

Parsing of document complete, stopping renderer.

This is the state the Release Manager was founding the building process, and here he stopped (^C) the ant process. Now it will take up to 15 minutes (or what was left of them) and the Java process is killed by timeout and the above messages are an indication for this.

If one is following live the building process and sees it stalled in:

Parsing of document complete, stopping renderer.

he can do the usual stopping to avoid waiting up to 15 minutes. If the process is in the background it will do its job. We have tested this and it actually blocked 2 times and the total amount of time spent for building all the manuals was 71 minutes. For a background process this is fine. Usually it happened that manuals were stalled for hours there because the Release Manager was doing other tasks in parallel so this would be an improvement. As well for the Muse Control Center task that weekly builds the latest manuals. It should not hang up as before.

——–

Now about how a set of bugs and “features” of open source components could chain each other making almost impossible to achieve a reliable environment.

So, we had this bug of Muse Manuals hanging. Based on the observation CPU was 50% (as there are two units). A stack trace revealed that the JAI codec was doing some clean-up and intermittently filling up a stack trace…and JVM could not stop…


Thread-0? prio=2 tid=0×03088000 nid=0xee0 runnable [0x03fbf000]
java.lang.Thread.State: RUNNABLE
at java.lang.Throwable.fillInStackTrace(Native Method)
- locked <0×06cf3958> (a java.util.ConcurrentModificationException)
at java.lang.Throwable.<init>(Throwable.java:181)
at java.lang.Exception.<init>(Exception.java:29)
at java.lang.RuntimeException.<init>(RuntimeException.java:32)
at java.util.ConcurrentModificationException.<init>(ConcurrentModificationException.java:57)
at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
at java.util.HashMap$KeyIterator.next(HashMap.java:828)
at com.sun.media.jai.codec.TempFileCleanupThread.run(FileCacheSeekableStream.java:323)
Locked ownable synchronizers:
- None
“DestroyJavaVM” prio=6 tid=0×002b7400 nid=0×8a0 in Object.wait() [0x0090f000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
– waiting on <0×0b71ad30> (a com.sun.media.jai.codec.TempFileCleanupThread)
at java.lang.Thread.join(Thread.java:1143)
- locked <0×0b71ad30> (a com.sun.media.jai.codec.TempFileCleanupThread)
at java.lang.Thread.join(Thread.java:1196)
at java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79)
at java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24)
at java.lang.Shutdown.runHooks(Shutdown.java:79)
at java.lang.Shutdown.sequence(Shutdown.java:123)
at java.lang.Shutdown.shutdown(Shutdown.java:190)
- locked <0×2b79e3e0> (a java.lang.Class for java.lang.Shutdown)
Locked ownable synchronizers:
- None

Now we have jumped into to try to see the whole DocBook related process. There is a hierarchy of Ant builds from the global buildall.xml to buildall.xml of each project, to build of each sub-project and finally calling ${MUSE_HOME}/doc/tools/build.xml where this line was invoked


<java classname=”org.apache.fop.apps.Fop” fork=”true” failonerror=”false” maxmemory=”512m” resultproperty=”errorCode” timeout=”900000?>
<sysproperty key=”MUSE_HOME” value=”${MUSE_HOME}”/>
<!--...-->
</java>

We were trying to read the FOP documentation. We do not use the recommended way (http://xmlgraphics.apache.org/fop/0.95/anttask.html), that is, using the Ant Task and we were calling the command line from an external process. We considered doing this (using the FOP Ant task), although we had a very very old FOP (fop-0.20.5). However we needed to split what the command line was able to do and the Ant task was not into two operations. But that is the recommended way. Namely first create the *fo file, through an XSLT and then run the FOP converter. The command line was running the XSLT inside and without the need of a temporary file.

We considered doing this because we were thinking in the direction of switching to FOP 1.0 in the future, which actually accepts the XML dockbook as parameter and not the FO. Also we were thinking there could be improvements to speed because we could be running in the future multiple docs through a file set, without the need to step into each document. Considering this we tried using the documented way of using FOP in Ant and…we ran into a PermGen issue. After increasing it to 256 MB we still get OOM PermGen. We considered then that this is due to the recommended way of doing a task def…


<taskdef classname="org.apache.fop.tools.anttasks.Fop">
<classpath>
<fileset dir="${fop.home}/lib">
<include/>
</fileset>
<fileset dir="${fop.home}/build">
<include/>
<include />
</fileset>
</classpath>
</taskdef>

We had many classpath elements not few as the example above, and although this task def was in the context of the build script for the manual finished with each manual, still persist in Ant. This is … an ANT bug:

“Sub-builds (antcall, subant) load a task each time the task is defined, but do not release it when the sub-build project completes.[...]“
[https://issues.apache.org/bugzilla/show_bug.cgi?id=49021]

Following the workarounds mentioned we ended up trying to define the FOP task at the upper level and at the lowest level we tested if it is defined and only then we were defining it. To make it simpler we used the recommended Antlib approach. We ended up with this in the ${MUSE_HOME}/doc/tools/build.xml


<condition property=”alreadyDefined” value=”true” else=”false”>
<typefound name=”antlib:org.apache.fop.tools.anttasks.Fop:fop”/>
</condition>
<echo message=”alreadyDefined:${alreadyDefined}”/>
<if value=”false”>
<echo message=”redifining”/>
<taskdef uri=”antlib:org.apache.fop.tools.anttasks.Fop” resource=”antlib.xml” classpath=”${MUSE_HOME}/doc/tools/fop/lib/”/>
</if>
<!--...-->
<fop:fop format=”application/pdf” userConfig=”${MUSE_HOME}/doc/tools/fop/conf/userconfig.xml”
basedir=”${docBaseDir}/${docName}” fofile=”${temp.fopFile}” outfile=”${docBaseDir}/${docName}.pdf”/>

“basedir Base directory to resolve relative references (e.g., graphics files) within the FO document. No, for single FO File entry, default is to use the location of that FO file.”
[http://xmlgraphics.apache.org/fop/0.95/anttask.html]

Hence we used basedir and were happy that it worked. Another run, forgot a document was opened, the whole process needed to be revert, run again…finished in about 60 minutes…forgot to mention that we left the ANT_OPTS on the build machine to use more memory (-Xmx896M -XX:MaxPermSize=128M). This should make all the building process take a little less.

Then we were happy that the images appeared in the docs. But we said we should be comparing various new rendered PDFs with old ones. Many of them had the same dimension, but there were others with differences. Muse Testing.pdf (one containing many images) was smaller than the one from CVS with 1 MB.
By looking into it we saw images from Muse Designer Console.pdf and from ICE MARC to XML Converter.pdf. Well we said it has to do with some cache, because now there is a single instance of FOP to the entire run…and, indeed that was it, actually a feature of FOP:

“FOP caches images between runs. There is one cache per FopFactory instance. The URI is used as a key to identify images which means that when a particular URI appears again, the image is taken from the cache. If you have a servlet that generates a different image each time it is called with the same URI you need to use a constantly changing dummy parameter on the URI to avoid caching.”
[http://xmlgraphics.apache.org/fop/1.0/graphics.html#caching]

At this point we said stop to this whole chain as it wasn’t worth. Although we are keeping, for reference, the work done, so that when these issues may be resolved in the future we use the recommended way, adding a dummy parameter (hadn’t done this anyway) or trying to use the name of the file in the image or even mentioning the directory (which is the document name) means many modifications, and also in case a document name change is necessary then there are too many troubles in modifying it. So, using different file names for images should never be considered.

Meanwhile, while doing this by following the right documentation, we came across the timeout parameter of the Java fork, and had always had this workaround in mind. But we were curios why that JAI (Java Advanced Imaging) blocks there in the stack trace from above, and actually found that this is a JAI bug, not resolved in the latest version either. The bug is detailed here:

“The shutdownHook TempFileCleanupThread throws a ConcurrentModificationException in fileIter.next() sometimes. This exception is ignored and because .next() doesn’t succeed the loop never ends. It is about this code in com/sun/media/jai/codec/FileCacheSeekableStream.java:”

/**
* Deletes all <code>File</code>s in the internal cache.
*/
public void run() {
if(tempFiles != null && tempFiles.size() > 0) {
Iterator fileIter = tempFiles.iterator();
while(fileIter.hasNext()) {
try {
File file = (File)fileIter.next();
file.delete();
} catch(Exception e) {
// Ignore
}
}
}
}

[http://java.net/jira/browse/JAI_CORE-121]

According to the comments on Java Net Jira this isn’t resolved so far. We have also a peek in the latest JAI code from trunk and it is not yet resolved. It is strange something can block upon termination. So, we ended up doing the timeout trick…nothing else came to mind, and there was already too much time spent on this.

We wrote all this in case someone else confronts with something similar, and if not to see how not to design things. Recently we have come across many Swing bugs, even filled one to Oracle.

The Core team was fighting a similar stopping bug with RMI from Sun (in the context of Tomcat and Jackrabit)…all these are not nice, in case that there is not even a workaround for some bugs.