Now that the summer’s over, there’re still a few things for me to finish up with regard to my project. While the tool’s core functionality is implemented, it still lacks a few features to make it fully usable, such as a dialog for setting options like how many topics to generate. Additionally, I plan to implement a system overview visualization which will present all of the classes in a project, colored according to the topic that they belong to. This view will make use of a tree map like the single topic view, but will visualize the distribution of topics over all of the system’s classes.

Finally, once my plugin is polished, I’ll also be working on a paper introducing the tool to submit to conferences early next year. Anyway, I hope you’ve enjoyed what you’ve read about my research; if you happen to have any questions, I’ll certainly do my best to illuminate things if you post a comment or send me an email.

The Design

As alluded to in the previous post, the dependency link view of X-Ray and the Tree Map were two visualizations that I thought might prove particularly useful in presenting our topic information. Incorporating these elements, the design I came up with uses two primary views. The first aims to introduce all of the topics that our tool has found, giving the user an idea of what concepts they encapsulate and how they might relate to each other. The second view presents a single topic in depth, showing the source documents that make it up so that the user can extend their understanding of the abstract concepts in a body of source code to the concrete source that implements the concepts.


Topics Overview

The overview presents all of the topics, along with the top few words that are most associated with each topic and the top few packages or classes that are associated with the topic. This information will hopefully be enough to get an idea of what concepts the topic encompasses. Additionally, this view presents the dependency links between documents in different topics; each time code in one topic refers to code in another topic, that’s represented by a corresponding arrow between the two topics. This should hopefully give the user some general idea of how the concepts relate to each other.


Single Topic View

The single topic view, obtained by clicking on one of the topics in the topic overview, displays the classes that are associated with the topic the user has selected. These are displayed in a tree map according to their place in the package hierarchy, the typical means of organizing code in a system written in Java. The size of any box in the tree map represents a particular variable; at the moment, either that class’s size in lines of code or the degree to which it belongs to the topic it has been placed in. At the moment, to navigate back to the overview the user right clicks anywhere on this view.

And that’s an overview of the design’s goals and how it tries to accomplish them.

Visualization Options

After getting the interface lined up between a LDA library, JGibbsLDA, and the Eclipse IDE that our tool will be based in, I started looking into the myriad options for visualizing the information we can generate. An incredible number of visualizations of software systems have been proposed over the years, from the 3d CodeCities which maps buildings to source files and blocks to groups of files to SeeSoft‘s line view, which simply maps a line of pixels to a line of source code in a file. The difficult part of our project is determining what visualization, or combination of visualizations, might actually be best for the information we want to visualize.

CodeCity Visualization SeeSoft Line Visualization

In the past, one of the visualizations that has been proposed for topics by Ducasse, et al. is the Distribution Map. This quite simply displays source documents in their respective packages, and displays the topics to which they most belong by means of the document’s color. While this view provides an interesting overview of a system and the spread of topics within it, it doesn’t seem to be the best view for our main goal, which is allowing for the exploration of unfamiliar systems. Another related view that Kuhn, et al. investigated is the rather literal Software Map, which draws a topographic-style map reflecting similarities between code documents by the distance between then, and the size of the documents by the topographic size of their “hills”.

Distribution Map Software Map

Since one of the additional pieces of data that I’m interested in integrating into our visualizations is the structural links between source documents, another interesting view is that implemented by the X-Ray tool. One of X-Ray’s views involves drawing arrows between code elements, arrows that represent one code document’s dependency upon another. I think this information could be very useful in addition to the topic information I discussed in a previous post, and so X-Ray’s visualization is very interesting for my project.

xray1.png xray2.png

Finally, another visualization, introduced in 1992, that I think could be interesting to use in our tool is the Tree Map. Put to good use in a number of tools that visualize how your hard drive’s space is used up, Tree Maps can be useful in any context in which you have hierarchal documents with a size property.

tree map Tree Map 2

Having considered all of these myriad visualization options, along with a number of other possibilities,  I had to select the combination which would best fit our goal of providing for the exploration of a topic map of a software system, and then implement the visualization in our Eclipse plugin.

The Generation and Use of Linguistic Topics

The underlying assumption behind our project is that by presenting the developer with the general ideas present in a particular document of source code (the set of instructions that make up any piece of software), we can help them understand that source code without having to read the code itself to discover these ideas. By mining these general ideas, or topics, that a document is made up of, we can attempt to take over the job of comprehending the purpose of every section of code.

For this we use an algorithm that has been gaining popularity recently, Latent Dirichlet allocation. This technique, also known as LDA, considers each document as being composed of a mixture of one or more topics, based on the words used within that document. After it’s analysis is complete, we’re left with a selection of topics which should relate to some abstract concept that the code deals with, as well as the information about what mixture of topics each document is made of and what words are closely related to each topic.

Since these topics generally closely relate to concepts in the source code, they should provide a simplified overview of the software system being studied before the developer has to drill down into the specific part that they’re interested in. Thus the quandry becomes how to best present this information to the developer.

In addition to the topic information that is our primary concern, there are a number of other pieces of information that we can also visualize, such as the size of code documents, their number of methods or attributes, or the links between documents that can be found in the structure of the source code. These other potentially useful pieces of information must also be considered when designing the display for our tool.

Introduction: Linguistic Topics in Source Code

Hello! I’m Trevor Savage, a rising Senior at the College pursuing a Computer Science major and a Music minor. This summer I’m doing research with Professor Denys Poshyvank in the field of software engineering, specifically working on a tool that will automatically analyze source code in order to determine some of the code’s overarching structure. It does this by looking at textual similarities in the natural, human language contained in different chunks of source code.

While Computer Aided Software Engineering (CASE) tools have failed to live up to their promise of revolutionizing the efficiency of software engineers, these tools have proved quite capable of improving developer efficiency in a more evolutionary manner. Much of the software engineering and development process is creative, requiring a human decision maker to actually design the system. But computers are far better at processing large amounts of information than humans are, and that’s where CASE tools can be particularly effective.

By processing the source code of a software system and creating a map of sorts that can guide a developer in their own exploration of the system, our tool should be able to ease some of the initial work of a developer approaching an unfamiliar project. Hopefully, such a developer will the be able to start working on a new project quicker than if they set out to get a grasp on it without any aids. In particular, the tool should be helpful in situations where the project’s own documentation is inadequate, as can frequently be the case.

Ultimately, after our tool is complete, we’ll be able to study how efficient it actually is at easing a developer’s comprehension of a program. One variable we may study is the usefulness of different ways of visualizing the topic map. The design of these visualizations should prove very interesting, since they’re so many different ways to display different pieces of information, and ultimately we need a design that both displays plenty of relevant information and yet is relatively easy to use and understand.

Currently, I’m working out the design for the tool, sketching out plans for the graphical user interface and working on the back end that will actually generate the topic maps of source code. Ultimately, at the end of the summer, we should have a complete prototype with which we can hopefully conduct more in depth user testing next semester.