Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Clone Detection in R

Abstract

Copy-and-paste code offers an immediate convenience in exchange for latent risk.Clones scattered across a project become difficult to modify consistently.

Simple awareness can mitigate this condition.A clone detection tool can identify code duplicates, empowering the user to eliminate the clone.

Despite the rise of popularity in data science, the R community has yet to see an effective industrial strength clone detector. This report presents a clone detection process specialized to R.

Our tool is based on a metric-based approach with a post-processing step inspired by token-based techniques. Adapted from Deckard, R source code is converted to an abstract syntax tree. Subtrees are encoded with characteristic vectors. These vectors are compared, offering a scalable and effective similarity calculation. To better compare code structure, we derive a program abstraction technique from CCFinder. String comparison is applied on generalized source code which has been stripped of superficial identifiers.

A systematic mutation test by Roy et al. is adapted to evaluate RClone's performance. The tool is also applied to 43K SLOC of production source code of R libraries: GGPlot2, Broom, and Knitr. RClone was able to effectively detect useful Type-1, Type-2, and Type-3 across the production source code. A sensitivity analysis, based on the Broom library, suggests an optimal threshold of vector distance at 7.5\% and least common sequence at 67\%. Evaluation also reveals potential opportunities to decrease false positive rates and to prune to the most useful results.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View