This project develops an advanced multi-document summarization system that can deal efficiently with large sets of 50 to 500+ documents, such as may be returned by an information retrieval or filtering system in response to a broad query or a user profile. Our approach is to detect topics and themes within the set, summarize each of them, and then compose an overall summary. If the topics are not related to one another, a “meta-document” similar to Wall Street Journal’s What’s News is produced. Where topic correlations exist, a condensed cross-document summary is derived.

overall architecture

A two-stage approach is used to identify the main topics and themes in the document set. Documents are compared based on similarity map drawn by correlating paragraphsize passages. Documents displaying varying degrees of passage similarity and overlap are then clustered. Several document level cluster types are identified: each produces a different summary style. Clustering leads to establishing topic or theme “seeds” based on sampling of documents in the set. The number of seeds depends upon the complexity and diversity of the original document set. It can be also regulated by the user. The seeds are then populated by the remaining documents. This approach also works for dynamic data sets (like streaming news feeds) by repetitive sampling of the stream to build seeds and classifying of the incoming documents on the fly.

The XDoX system is written in Java and C++ and runs on Solaris. A Windows NT release is being prepared.

the GUI

For more information please contract Ms. Hilda Hardy (hardyh [at] cs [dot] albany [dot] edu) or Prof. Tomek Strzalkowski (tomek [at] albany [dot] edu).

Selected Publications

G. Stein, T. Strzalkowski and B. Wise, Interactive, Text-Based Summarization of Multiple Documents, Computational Intelligence, vol. 16(4), pp. 606-613. 2000

T. Strzalkowski, G. Stein, J. Wang and B. Wise, Robust Practical Text Summarization, In M. Maybury and I. Mani (eds), Advances in Automatic Text Summarization, MIT Press. 1999, pp. 301-320.

T. Strzalkowski, J. Wang and B. Wise, Summarization-based Query Expansion in Information Retrieval, Proceedings of Coling-ACL’98, Montreal. 1998, pp. 1258-1264.

T. Strzalkowski, J. Wang and B. Wise, A Robust, Practical Text Summarization, Proceedings of AAAI Spring Symposium, Stanford University, 1998, pp. 28-33.

NOTE: To download these papers and to find other publications, see the papers section.