Monthly Archive for July, 2009

The Generation and Use of Linguistic Topics

The underlying assumption behind our project is that by presenting the developer with the general ideas present in a particular document of source code (the set of instructions that make up any piece of software), we can help them understand that source code without having to read the code itself to discover these ideas. By mining these general ideas, or topics, that a document is made up of, we can attempt to take over the job of comprehending the purpose of every section of code.

For this we use an algorithm that has been gaining popularity recently, Latent Dirichlet allocation. This technique, also known as LDA, considers each document as being composed of a mixture of one or more topics, based on the words used within that document. After it’s analysis is complete, we’re left with a selection of topics which should relate to some abstract concept that the code deals with, as well as the information about what mixture of topics each document is made of and what words are closely related to each topic.

Since these topics generally closely relate to concepts in the source code, they should provide a simplified overview of the software system being studied before the developer has to drill down into the specific part that they’re interested in. Thus the quandry becomes how to best present this information to the developer.

In addition to the topic information that is our primary concern, there are a number of other pieces of information that we can also visualize, such as the size of code documents, their number of methods or attributes, or the links between documents that can be found in the structure of the source code. These other potentially useful pieces of information must also be considered when designing the display for our tool.