Document Structuring is a subtask of Natural language generation, which involves deciding the order and grouping (for example into paragraphs) of sentences in a generated text. It is closely related to the Content determination NLG task.
Assume we have four sentences which we want to include in a generated text
- It will rain on Saturday
- It will be sunny on Sunday
- Max temperature will be 10C on Saturday
- Max temperature will be 15C on Sunday
There are 24 (4!) orderings of these messages, including
- (1234) It will rain on Saturday. It will be sunny on Sunday. Max temperature will be 10C on Saturday. Max temperature will be 15C on Sunday.
- (2341) It will be sunny on Sunday. Max temperature will be 10C on Saturday. Max temperature will be 15C on Sunday. It will rain on Saturday.
- (4321) Max temperature will be 15C on Sunday. Max temperature will be 10C on Saturday. It will be sunny on Sunday. It will rain on Saturday.
Some of these orderings are better than others. For example, of the texts shown above, human readers prefer (1234) over (2314) and (4321).
For any ordering, there are also many ways in which sentences can be grouped into paragraphs and higher-level structures such as sections. For example, there are 8 (2**3) ways in which the sentences in (1234) can be grouped into paragraphs, including
- It will rain on Saturday. It will be sunny on Sunday.
- Max temperature will be 10C on Saturday. Max temperature will be 15C on Sunday.
- It will rain on Saturday.
- It will be sunny on Sunday. Max temperature will be 10C on Saturday.
- Max temperature will be 15C on Sunday.
As with ordering, some groupings are preferred by others; for example (12)(34) is preferred over (1)(23)(4).
The document structuring task is to choose an ordering and grouping of sentences which results in a coherent and well-organised text from the reader's perspective.
Algorithms and Models
There are three basic approaches to document structuring: schemas, corpus-based, and heuristic.
Schemas  are templates which explicitly specify sentence ordering and grouping for a document (as well as Content determination information). Typically they are constructed by manually analysing a corpus of human-written texts in the target genre, and extracting a document template from these texts. Schemas work well in practice for texts which are short (5 sentences ot less) and/or have a standardised structure, but have problems in generating texts which are longer and do not have a fixed structure.
Corpus-based structuring techniques use statistical corpus analysis techniques to automatically build ordering and/or grouping models. Such techniques are common in Automatic summarisation, where a computer program automatically generates a summary of a textual document . In principle they could be applied to text generated from non-linguistic data, but this work is in its infancy; part of the challenge is that texts generated by Natural Language Generation systems are generally expected to be of fairly high quality, which is not always the case for texts generated by automatic summarisation systems.
The final approach is heuristic-based structuring. Such algorithms perform the structuring task based on heuristic rules, which can come from theories of rhetoric , psycholinguistic models , and/or a combination of intuition and feedback from pilot experiments with potential users . Heuristic-based structuring is appealing intellectually, but it can be difficult to get it to work well in practice, in part because heuristics often depend on semantic information (how sentences relate to each other) which is not always available. On the other hand heuristic rules can focus on what is best for text readers, whereas the other approaches focus on imitating authors (and many human-authored texts are not well structured).
Perhaps the ultimate document structuring challenge is to generate a good narrative. In other words, a text which starts by setting the scene and giving an introduction/overview; then describes a set of events in a clear fashion so readers can easily see how the individual events are related and link together; and concludes with a summary/ending. Note that narrative in this sense applies to factual texts as well as stories. Current NLG systems do not do a good job of generating narratives, and this is a major source of user criticism .
Generating good narratives is a challenge for all aspects of NLG, but the most fundamental challenge is probably in document structuring.
- ^ K McKeown (1985). Text Generation. Cambridge University Press
- ^ M Lapata (2003). Probabilistic Text Structuring: Experiments with Sentence Ordering. Proceedings of ACL-2003 
- ^ D Scott and C de Souza (1990). Getting the message across in RST-based text generation . In Dale, Mellish, Zock (eds) Current research in natural language generation, pages 47-73
- ^ N Karamanis, M Poesio, C Mellish, J Oberlander (2004). Evaluating Centering-based metrics of coherence for text structuring using a reliably annotated corpus. Proceedings of ACL-2004 
- ^ S Williams and E Reiter. Generating basic skills reports for low-skilled readers. Natural Language Engineering 14:495-535
- ^ E Reiter, A Gatt, F Portet, M van der Meulen (2008).The Importance of Narrative and Other Lessons from an Evaluation of an NLG System that Summarises Clinical Data. In Proceedings of INLG-2008 
Wikimedia Foundation. 2010.
Look at other dictionaries:
Document Structuring Conventions — Document Structuring Conventions, or DSC, is a set of standards for PostScript, based on the use of comments, which primarily specifies a way to structure a PostScript file and a way to expose that structure in a machine readable way. A… … Wikipedia
Portable Document Format — PDF redirects here. For other uses, see PDF (disambiguation). Portable Document Format Adobe Reader icon Filename extension .pdf Internet media type application/pdf application/x pdf application/x bzpdf application/x gzpdf … Wikipedia
List of document markup languages — The following is a list of document markup languages.Well known document markup languages*HyperText Markup Language (HTML) *Mathematical Markup Language (MathML) *Rich Text Format (RTF) Microsoft format for exchanging documents with other vendors … Wikipedia
SiSU — Infobox Software name = SiSU developer = Ralph Amissah latest release version = 0.69.0 latest release date = release date|2008|09|16 operating system = Unix like genre = Text Structuring, Publishing, Search license = GPLv3 website =… … Wikipedia
Content determination — is a subtask of Natural language generation, which involves deciding the on the information communicated in a generated text. It is closely related to Document structuring NLG task. Contents 1 Example 2 Issues 3 Techniques … Wikipedia
PostScript — Infobox programming language name = PostScript paradigm = multi paradigm: stack based, procedural year = 1982 designer = John Warnock Chuck Geschke developer = Adobe Systems latest release version = PostScript 3 latest release date = 1997 turing… … Wikipedia
ISO/IEC JTC1/SC34 — ISO/IEC JTC 1/SC 34 is a subcommittee of the ISO/IEC JTC1 joint technical committee, which is a collaborative effort of both the International Organization for Standardization and the International Electrotechnical Commission. cope and Terms of… … Wikipedia
Post Script — PostScript ist eine Seitenbeschreibungssprache, die unter diesem Namen seit 1984 vom Unternehmen Adobe entwickelt wird. Sie ist eine Weiterentwicklung von InterPress und stellt eine Turing vollständige stackorientierte Programmiersprache dar.… … Deutsch Wikipedia
Postscript — ist eine Seitenbeschreibungssprache, die unter diesem Namen seit 1984 vom Unternehmen Adobe entwickelt wird. Sie ist eine Weiterentwicklung von InterPress und stellt eine Turing vollständige stackorientierte Programmiersprache dar.… … Deutsch Wikipedia
Type1 — PostScript ist eine Seitenbeschreibungssprache, die unter diesem Namen seit 1984 vom Unternehmen Adobe entwickelt wird. Sie ist eine Weiterentwicklung von InterPress und stellt eine Turing vollständige stackorientierte Programmiersprache dar.… … Deutsch Wikipedia