Maggio, Valerio (2013) Improving Software Maintenance using Unsupervised Machine Learning techniques. [Tesi di dottorato]

[img]
Preview
Text
maggio_valerio_25.pdf

Download (8MB) | Preview
[error in script] [error in script]
Item Type: Tesi di dottorato
Lingua: English
Title: Improving Software Maintenance using Unsupervised Machine Learning techniques
Creators:
CreatorsEmail
Maggio, Valeriovalerio.maggio@unina.it
Date: 2 April 2013
Number of Pages: 200
Institution: Università degli Studi di Napoli Federico II
Department: Matematica e applicazioni "Renato Caccioppoli"
Scuola di dottorato: Scienze matematiche e informatiche
Dottorato: Scienze computazionali e informatiche
Ciclo di dottorato: 25
Coordinatore del Corso di dottorato:
nomeemail
Moscariello, Giocondagioconda.moscariello@unina.it
Tutor:
nomeemail
Di Martino, Sergiosergio.dimartino@unina.it
Corazza, Annaanna.corazza@unina.it
Date: 2 April 2013
Number of Pages: 200
Uncontrolled Keywords: Software Maintenance;Machine Learning;Software Remodularisation;Clone Detection;Code Normalisation;Kernel Methods;Unsupervised Learning; Expectation-Maximisation;Maximum Likelihood Estimation;Source code analysis
Settori scientifico-disciplinari del MIUR: Area 01 - Scienze matematiche e informatiche > INF/01 - Informatica
Date Deposited: 04 Apr 2013 11:19
Last Modified: 10 Dec 2014 14:10
URI: http://www.fedoa.unina.it/id/eprint/9079
DOI: 10.6092/UNINA/FEDOA/9079

Abstract

Software maintenance is an essential step in the evolution of software systems and represents one of the most expensive, time consuming, and challenging phases of the whole development process. In particular, the cost and the effort necessary for both the maintenance and the evolution operations (e.g., corrective, adaptive, etc.) are mainly related to the effort necessary to comprehend the system and its source code. As a consequence many "reverse engineering" tools and solutions have been proposed to support the maintainers in their activities. An important resource for maintainers is represented by the architectural information of the system. However such information is usually not documented, or the documentation is outdated. Therefore, the existing code remains the most updated source of information to exploit in order to automatically retrieve and reconstruct the architecture of a system. Many research efforts are being devoted to support this task, in order to define solutions that are able to "re-modularise" a given software application. The main purpose of re-modularisation techniques is to automatically partition the system into meaningful subsystems, in order to locate and group together software components that are in some way related, e.g., they implement the same functionalities. A number of these approaches generally attempt to discover these groups (or clusters) by exploiting the lexical information provided in the source code, such as terms in comments, as well as names of identifiers (e.g., variable, methods and classes). Nevertheless, the source code lexicon has some specific peculiarities that make it conceptually different from a typical textual resource: identifiers are often created by concatenating multiple words (e.g. getAttribute, MINHEIGHT), which may be additionally shortened (e.g., getAttr, MINHGT) to avoid long names. As a consequence, tools and techniques that analyse the source code lexicon must integrate algorithms to "normalise" its vocabulary. Another well known and largely investigated issue in software maintenance is "clone detection": it is focused on the identification of source code duplications. Software clones might affect the reliability and the maintainability of large software systems. For example, errors affecting a fragment of code must be fixed in everyone of its possible duplications. Clones are usually not documented, and their identification is usually complicated since programmers adapt software copies by applying multiple modifications (e.g., adding new statements and renaming variables). Therefore, automatic and reliable approaches are required in order to tackle this problem. In this thesis we proposed new Machine Learning (ML) based approaches that mine the relevant information directly from the source code to cope with the three introduced issues, namely the software re-modularisation, the source code vocabulary normalisation, and the clone detection. In particular, proposed contributions leverages the benefits of ML algorithms, which have been properly tailored and customised in order to make them suitable for the considered domain. All the presented approaches have been extensively assessed with empirical evaluations conducted on large software systems, and results have been compared with other related techniques, whenever possible. Achieved results outperform the state-of-the-art solutions for all the three considered problems, thus confirming the benefits derived from the definition and the application of ML algorithms to maintenance tasks.

Actions (login required)

View Item View Item