SimilaCode: Programming Source Code Similarity Detection System Based on NLP

Diego Vallejo Huanga, Jair Morocho, Juan Salgado

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

2 Scopus citations

Abstract

Some tools have been developed in the scientific field to detect similarities in texts; however, some software is not very efficient in detecting plagiarism in programming source codes. In computing, it is expected to find cases of plagiarism in the source code, and there are currently tools that measure the degree of similarity, but they require paid licenses. This scientific article proposes constructing a system that uses Natural Language Processing (NLP), vector space models, and similarity metrics to identify the degree of divergence between pairs of source codes in the Python programming language, with the possibility of extrapolating its applicability to other programming languages. The proposed system is structured in several modules, each with a specific function for both the back-end and front-end of the prototype deployed on the web. The experimentation was carried out using pairs of source codes subjected to modifications at a linguistic and structural level. The results show that our system, Similacode, can detect 100% similarities between source code pairs that have changed their comments. It was observed that the system could identify similarities, even when modifications have been made to the names of variables and functions, reaching levels of similarity higher than 88%. In addition, comparisons were made with two other plagiarism detection tools to assess the degree of similarity, obtaining results with less than 1% differences between the different software. The experiments in Similacode have yielded satisfactory results, demonstrating the system's efficiency in detecting similarities in the analyzed source codes.

Original languageEnglish
Title of host publicationProceedings - 2023 15th International Congress on Advanced Applied Informatics Winter, IIAI-AAI-Winter 2023
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages171-178
Number of pages8
ISBN (Electronic)9798350383829
DOIs
StatePublished - 2023
Event15th International Congress on Advanced Applied Informatics Winter, IIAI-AAI-Winter 2023 - Bali, Indonesia
Duration: 11 Dec 202313 Dec 2023

Publication series

NameProceedings - 2023 15th International Congress on Advanced Applied Informatics Winter, IIAI-AAI-Winter 2023

Conference

Conference15th International Congress on Advanced Applied Informatics Winter, IIAI-AAI-Winter 2023
Country/TerritoryIndonesia
CityBali
Period11/12/2313/12/23

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

Keywords

  • Code Clone
  • Code Plagiarism
  • Programming Languages
  • Python
  • Vector Cosine Model

Cite this