Documents clustering

En français

History

To give a sample of the benefits of using XML, within my PhD work (report.ps.gz), I wrote a program in 1998 to do clustering of AML (Astronomical Markup Language) documents. It was using both the meaningful links between the documents, and the keywords associated with them, using a noising partitioning technique, and displaying the result on a topic map. The documents could be retrieved automatically from various sources, starting from an initial document and using the AML links to retrieve the related documents. It was a success, but as many cool PhD software, it disappeared from the web since it could not be maintained anymore.

Back to 2004, I needed a program to cluster other documents, and couldn't find any free software to do this simple task. I decided to resurrect this project, and I found a way to specify the list of documents, keywords and links in an external XML document. This way, it can now work for any collection of documents, even non-XML documents.

Using the program

Here is a sample document list, with keywords and links. The DTD is included in the package.

<DOCLIST>
    <DOCUMENT id="108">
        <URL>section2_1_2_7_APPRENDRE.html</URL>
        <TITLE>Vitesse orbitale</TITLE>
        <KEYWORDS>
            <KEYWORD>KEPLER</KEYWORD>
            <KEYWORD>MASSE</KEYWORD>
            <KEYWORD>MOUVEMENT</KEYWORD>
            <KEYWORD>TRAJECTOIRE</KEYWORD>
            <KEYWORD>VITESSE</KEYWORD>
        </KEYWORDS>
        <LINKS>
            <LINK toid="110"/>
        </LINKS>
    </DOCUMENT>
</DOCLIST>

When the document list is ready, the clustering program can be launched (just double-click on Clustering.jar).

clustering.png

The clustering algorithm is first spreading the documents randomly on the grid, then move them in order to reduce the "cost" progressively. After a while, it stops and the result is recorded in a grid.xml file.

This grid XML file can then be displayed with the DispGrid applet, with an HTML file containing this code:

<applet code="dispgrid.DispGrid" archive="DispGrid.jar" width="100" height="100">
    <param name="url" value="http://server/grid.xml">
</applet>

dispgrid.png

Download

The software is available under GPL licence.

Clustering.tar.gz

Problems

Some web browsers prevent applets from displaying a new window : Internet Explorer with Windows XP SP2 (it used to work before SP2) or Google bar's popup blocker, Firefox 1.5 (it used to work before version 1.5). The applet cannot display a selected web page because of this.

Author: Damien Guillaume