Writing and fractality.
Recent availability of data of writing processes at keystroke-granularity has enabled research on the evolution of document writing. A natural step is to develop systems that can actually show this data and make it understandable. Here we propose a data structure that captures a document's fine-grained history and an organic visualization that serves as an interface to it. We evaluate a proof-of-concept implementation of the system through a pilot study with documents written by students at a public university. Our results are promising and reveal facets such as general strategies adopted, local edition density and hierarchical structure of the final text.READ FULL TEXT VIEW PDF
Current text visualization techniques typically provide overviews of doc...
In this document we report on the recent development of a C++ interface ...
In this paper, we report the development of a model and a proof-of-conce...
In visualization education, both science and humanities, the literature ...
Writing is a complex non-linear process that begins with a mental model ...
Stylometry, the science of inferring characteristics of the author from ...
Writing and fractality.
Writing is an old and relevant human skill whose standard product is text. Due to the widespread use of information technologies, most text available today is stored in digital platforms, such as the Web. Although static in their final form, documents –collaborative ones in particular– are works-in-progress, meaning they are still subject to their writing process. Currently, we do not fully understand this everyday phenomenon, even though the way we write is tied to how we learn and structure our knowledge . This may be in part because, as stated by Grésillon and Perrin in the Handbook of Writing and Text Production, “The written (the product) aims at overcoming the writing (the process)” , which means that the better the quality of a text, the more work was spent obscuring and deleting the traces of its own development.
This lack of understanding, and of research on this topic, can be explained by the fact that data was not easily available until recently. But today, Web services such as Google Drive keep records of document changes at keystroke-level, so as data has become widely available, new avenues of research open. However, to the extent of our knowledge, this data has not yet been used to better understand the process of writing, in order, for example, to make writing easier and more effective, to help teaching, etc. In this paper we show that visualizing this data could be an important step towards understanding it.
There is a vast work on text visualization available . There are two main research directions: on the one hand, visualizations to analyze text are available, but they focus on the finished product; on the other hand, the systems that aim at the evolution of documents do so at coarse versioned text. These latter works focus on collaboration (e.g., Wikipedia content) and are not suited for research on individual writing.
In this paper we propose an interactive visualization method in the largely unexplored field of fine-grained text production data. Built upon organic information design guidelines, the proposed visualization shows the whole fine-grained history of a document in one image and displays its development in time with animation. It also can provide access to its textual content interactively, through which a naturally occurring segmentation of the text can be produced. By allowing complex behavior of the production of the text to visually emerge, it fosters exploration of its structure and evolution through time.
We evaluated our interactive visualization through a pilot study, where we visualized and analyzed documents written by engineering students. The results show different characteristics of the writing process that emerged from the visualization: general strategies adopted, local edition density, and in some cases, hierarchical structure of the final text. These results evoke interesting applications for our proposed system in fields including reviewing, writing teaching, assessing the depth of knowledge, among other areas. We conclude that the study and visualization of fine-grained text data enables a deep understanding of text, as it permits to augment the final product with the trace of the decisions performed during its production.
Currently most user interfaces for text visualization focus on finished text, i.e.
, the final product of the writing process of an agent, or a corpus of such products (visual text summarization, topic modeling , mapping content structure , recommender systems , among others (see a survey ).
There is little work on visualization of the process itself that generates a document, particularly in the case of the human writing process . The challenge here is to understand the structure and evolution of text according to several updates, each of which may add new content and/or delete prior content. Currently, there are two prominent sources for this kind of visual interaction: collaborative user-generated content and individual research-generated.
Regarding collaborative content, the most common data source is Wikipedia, with tools like History Flow , that visualizes the revision history of Wikipedia articles, and the Notabilia project, which visualizes collective deliberation . DocuViz  applied the History Flow approach to collaborative documents in Google Docs and Kim et al.  proposed using only document deltas in this same line of visualization.
There are studies that focus on individual writing process . Perrin and Wildi developed a statistical method to infer writing phases using cursor movement data . Caporossi and Leblay  showed a graph-based visualization of the writing of a paragraph with data from ScriptLog (a keystroke logging program), where nodes represent operations; and edges their topological and temporal relations.
Evolution of single documents has thus been researched either from a collaborative, large scale perspective using coarse data, or from an individual, fine-grained one but only at very small scale. To the best of our knowledge, there is no visualization in between that encompasses these dimensions as a whole, therefore uniting writing process research. The system we describe next aims at filling this gap.
We follow the ecological  and organic information design  approaches to create a natural-looking structure of interdependent units. We implement our prototype using the Processing language , as it’s commonly used for organic visualization systems, and using data from Google Docs. Our approach is a departure from the linear, bar chart-style schema found on most of current work and aims at a similar change in the understanding of a document: not as something linear, static, but rather emergent and dynamic, but also irreversible, meaning that nothing is really deleted but submerged.
Here we describe the different stages of the pipeline needed to arrive to such depiction: the definition of text operations; the data structure holding those operations; and then the visual design that depicts the data structure.
We define a document as a chain of atomic (distinguishable in time) operations (insertions and deletions). As in Perrin’s -notation , we group adjacent operations in such a way that no voluntary change in cursor position takes place between any two of them. This results in condensed operations we call (linear keystroke) bursts, which are more coherent and significant than single keystrokes because insertions that were immediately deleted are lost, such as correction of typographical errors (which correspond to low-level information in the writing process ). Finally, we reorder bursts spatially rather than temporally, as pieces of a puzzle that join one to another by the structural points we call Places of Insertion (POIs), which are the points between characters and elements in a document where the blinking cursor can be at.
We store the operations and their spatial relations as a Directed Acyclic Graph, where nodes represent operations, edges are topological relations between operations and their direction follow the arrow of time (see Fig. 1). Each edge points initially to a “null” node, meaning a free POI. An empty document in its original state maps to a root node which contains the time of the file’s creation and a single edge which stands for its only POI. At this point, only an insertion can take place (as a deletion needs more than one POI), so the next step is the addition of a new operation node containing the inserted string at the end of the root edge, from which edges emerge, where is the number of characters inserted, creating new POIs from the original one. This process goes on recursively, always maintaining a tree structure, but this changes when we start considering the critical aspect of deletion in the document. A deletion is a node that bundles together adjacent edges, where is the number of characters deleted, back into a single place of operation. Note that this radically changes the original insertions-only tree structure, since a deletion may encompass many levels of the hierarchy.
We used a glyph-based approach to visualize the aforementioned data structure, where glyphs act as interdependent units and build upon each other. Intuitively, an insertion “opens up” space in the document, by splitting one POI into many, while deletions “close” it, by joining many POIs back into one. The glyph designed to represent insertion nodes is, therefore, a stylized multiplexer. Deletion nodes, on the other hand, do not have their own glyph but retroactively affect insertion glyphs. Figure 2 illustrates this.
Seeing the visualization as a mapping from the data structure to the visual space, the rules that define this mapping are:
For each insertion node, there is one glyph that represents it and its first-level out-edges.
An edge leading from an insertion node to another, means that the correspondent glyphs are related, precisely the latter is placed on top of the former, at the position corresponding to its relative POI within its parent.
And edge leading to a deletion node changes the glyph as shown in Figure 2.
Cosmetics. To avoid spiraling branches, a “phototropism factor” is applied to the growth of the tree, mimicking the plant behavior of growing upwards. Time is represented using a cyclical eight-color categorical palette (see Figure 2, right): nodes are colored according to the session (considered here as a day of writing) in which they were added. The radius of the arc doubles in case its center angle were to surpass .
Interaction. “Phototropism” as well as arc length-node size ratio can be dynamically manipulated to globally change the shape of the tree and improve visibility. When a glyph is selected, the textual contents of its branch (deleted and active children) are displayed on screen in a notation similar to S-notation . Parts of the tree may also be hidden at will.
We performed a pilot study, where computer science students from a public university were asked to share their documents written in Google Drive. In total, we obtained 60 documents, of different lengths (from a few paragraphs to full-length articles), and purposes (though they were all course assignments). Most of them were ruled out before visual analysis due to incompletion or not showing enough complexity. We selected five documents to show here due to their complementariness (see Table 1). For each one, we identified the visualization’s branching structure, which leads to a hierarchic segmentation of the tree. Then, we inspected each branch’s content, and identified which part of the document corresponded to the branch. We also took note of branch length and breadth, and important deletions, which we interpreted in the context of each document.
There are three dimensions of document evolution that, according to the analysis, are well captured by our system:
The internal organization of the text and its hierarchic structure (Fig. 3). We observed that branches of a tree mostly correspond to hierarchical structure of the text. In Cases A and C, branches match paragraph divisions, as they have no other hierarchical level. Cases B and E have a typical hierarchical organization (cover information, sections, and bibliography) which is perfectly matched by the relations of the correspondent branches. Case D has also no more structure than paragraph-level as can be intuited by its “one big branch” appearance.
Some patterns and strategies adopted by users (Fig. 4). The structure of the tree reveals also the strategies used to write the document. Cases B and E, for example, show a well-defined hierarchical structure, meaning its writing bore the final structure in mind from the beginning, something that can be expected in a course assignment. Case B shows a draft that was rewritten and erased, while A was written almost linearly, without important deletions.
The amount of work put into the document and its different parts . This dimension emerges from color heterogeneity and glyph density of a branch. Case studies A, B, C and E have branches of only one color, meaning they were introduced during one session with no later rewriting, whereas D (Fig. 5) has branches showing many appendices of different colors, meaning they were reread and edited in posterior sessions. Moreover, the highest edition density is concentrated around the deletion of a large piece of text that was pasted from another document.
In summary, we observed that the system captures important components of the writing process. Next we discuss the implications and future work due to these findings.
Our results shed light on the dynamic origins of text and the structures underlying the process of writing. These findings could be useful in education (e.g., evaluation and assessment of learning), work (e.g.
, matching thinking structure to teams, which could be used in hiring processes), and natural language processing (e.g., by including human-writing processes into automated document generation, or document summarization). A direct application of our system is a real-time writing-aid in document writing tools, which returned to the document its heterogeneity, for example, showing the relative age of parts of text, their need for update and the thread they belong to.
Scope and Future Work. A rightful critique is that, owing to its lack of a different glyph for deletion nodes, the visualization captures only a subset of the data structure, i.e.
, it is only a spanning tree of the whole graph, which leads to the non-uniqueness of a document’s representation: a design fault because it forces a degree of freedom not present in the data. Future work, then, should include the design of deletion nodes so that they play a structural role. Also, branch overlapping is a major problem, which currently makes it impossible to analyze larger documents. A solution for this would be the implementation of glyph space-awareness, and interactive expansion. Finally, the pilot study showed that coloring a tree by its branching structure is a necessary step for analysis, so an interesting intelligent feature would be the automation of this segmentation and highlighting, linking it to the final document.
Conclusions. We have presented a novel visualization design for document evolution which combines an operational view of the document with an organic visual scheme, and have shown that it renders visible some complex behavior in writing. It can be used, for example, to get an overview of the whole of a document’s history in a single image, which is enough to give an idea of the amount of work put into it and the general strategy adopted. Examples of such strategies are rewriting from a draft, writing with a structure in mind, one- vs. many-session writing, etc. These features are something that, for a single session or single user document and at this level of granularity, to the best of our knowledge, available systems cannot provide. Also, with its interactive functions, the system can be used to produce a segmentation of a document, which in some cases coincides with its hierarchical structure, but in any case is a naturally occurring segmentation which follows the thread of thought of the user. We present this approach and system to provide an integration of computer-aided writing research by proposing a clear focus on the document as a well-defined temporal object.approach and system Topic segmentation should not be abstracted from a document’s history when possible, and this approach proves a fair candidate for segmenting a document through its own writing history.