[NOTES] On the Naturalness of Software

ACM Digital Library - 2012

What?

Measuring the repetitiveness and predictability of source code, and comparing it to that of natural language.

Why?

Natural language is rich and complex but the majority of utterances fall in a fairly restricted and repetitive set of words. This allows the use of statistical models instead of formal language models. The assumption here is that source code follows a similar pattern, which might be accentuated by its even more rigid structure.

As statistical language models have enabled huge success in translation, generation, search, … if the assumption of naturalness holds true the same progress could be made with source code.

How?

RQ1: Do n-gram models capture regularities in software?

Build n-gram models of large text and source-code datasets and compute the cross-entropy of code (comparing how surprising an unseen document is compared to the rest of the corpus). The authors compare English cross-entropy vs code cross-entropy and find that “software is far more regular than English”.

RQ2: Is the local regularity that the statistical LM captures merely language-specific or is it also project-specific?

Evaluate cross-entropy across different software projects, finding that structure is project-specific. In other words each software group carries strong local structure but not a global pattern.

RQ3: Do n-gram models capture similarities within and differences between project domains?

The authors compare the cross-entropies of application domains within a project, finding that local regularities do appear within application domains, and much less across. This means there is an influence of the software’s function, not just its form.

Evaluation

Text corpora
- Brown corpus
- Gutenberg Corpus
Software corpora
- 10 Java projects
- 10 Ubuntu applications

The authors also apply their findings in the design of a code-completion extension for Eclipse, which is shown to provide improvements over the built-in engine (it is however shown to have a dependency on the length of the proposed token but does not just propose reserved keywords, demonstrating its ability to encapsulate context).

Comments

More of a comment on the usage of code than on its nature. Then again that is what enabled NLP to be so efficient: do away with theoretical models of language to focus on how it is used.
No comparison of project and domain-based cross-entropies for text.
Nicely written paper, often answering a question that arises in reading in the following section.
The regularity of software is not so surprising, but the fact that it goes beyond pure structure is interesting.
Would have liked to see a comparison of the suggestion ranks of proposed words in the code-completion setting.
More discussion about the “non-language-specific” tokens being proposed: are the class names, variable names? How are the predictions distributed?
Funny quote from the paper: “Every time a linguist leaves our group the performance of our speech recognition goes up” - Fred Jelenik

Selected references

¹ ² ³ ⁴ ⁵

“A study of the uniqueness of source code”, by Gabel and Su ↩
“Data Mining for Software Engineering”, by Tao et al. ↩
“Evaluating a Software Word Usage Model for C++”, by Malik et al. ↩
“IDE 2.0: collective intelligence in software development”, by Bruch et al. ↩
“What’s in a Name? A Study of Identifiers”, by Lawrie et al. ↩