Kaynak Kodlardaki Kötü Kokuların Analizinde Hata Komşuluk Etkisi

(1)

Kaynak Kodlardaki Kötü Kokuların Analizinde

Hata Komşuluk Etkisi

Rahime Belen Sağlam1_{, Özkan Kılıç}2_{, and Yusuf Şevki Günaydın}3

1

Ankara Yıldırım Beyazıt Üniversitesi,Ankara, Türkiye [email protected]

2

Ankara Yıldırım Beyazıt Üniversitesi,Ankara, Türkiye [email protected]

3 _{Ankara Yıldırım Beyazıt Üniversitesi,Ankara, Türkiye [email protected]}

Özet. Kaynak kodlardaki kötü kokular kod kalitesini idame edebilmek için çözümlenmesi gereken tasarım sorunlarıdır. Literatürde kodlardaki kötü kokuların tespiti için önerilmiş bir çok araç ve teknik bulunmakta-dır, ancak, bu teknikler aksiyon alınmasını gerektirmeyen yalancı pozitif sonuç üretme eğilimindedir. Dolayısıyla, kötü kokular arasından aksi-yon alınmasını gerektirenleri tespit etmeye yönelik ileri yöntemler geliş-tirilmiştir. Bu yaklaşımlar önceliğine bakmaksızın son revizyondan önce çözümlenmiş tüm uyarıların aksiyon alınması gereken uyarılar olduğu varsayımına dayanır. Bu çalışmada, çözümlenmiş uyarılar üzerine bir analiz gerçekleştirdik ve geliştiricilerin çözümlemek üzere uyarı seçer-ken kullandıkları faktörleri açığa çıkarmaya çalıştık. Çözümlenen düşük öncelikli uyarılara odaklanarak, bu uyarılarla aynı sınıfta bulunan diğer uyarıları inceleyip geliştiricilerin düşük öncelikli uyarıları yüksek önce-likli uyarılar ile aynı sınıfta bulunduğu durumda çözme eğiliminde oldu-ğunu gözlemledik. Bu bulgularımız literatürdeki fazla basite indirgenmiş varsayımı değiştirerek aksiyon alınması gereken uyarı tespiti araştırma-larındaki doğruluğu arttırabilecek niteliktedir.

Anahtar Kelimeler: Yazılım Kalitesi, Makine öğrenmesi, Kusurlu kod, Aksiyon Alınabilir Uyarı, Sınıflandırma.

Bug Neighbourhood Effect on Analysis of Code

Smells in Source Codes

Rahime Belen Sağlam1_{, Özkan Kılıç}2_{, and Yusuf Şevki Günaydın}3

1

Ankara Yıldırım Beyazıt University,Ankara, Türkiye [email protected]

2

3

Abstract. Code smells are the indicators of design flaws or problems in source code which require to be resolved to maintain code quality.

(2)

In the literature, various tools and techniques have been proposed to detect code smells; however, those techniques tend to generate spurious false positive warnings, which do not require an action to be taken. Con-sequently, further approaches have been developed for identification of actionable alerts among them. Those approaches rely on a naive assump-tion and each alert that is resolved before the latest revision is assumed to be actionable disregarding the priority of it. In this study, we have performed some analysis on alerts that were resolved by the developers and tried to uncover the factors that professional developers use to select code smells to resolve. Focusing on the low priority alerts that are re-solved, we investigated the other smells located in the same classes (i.e., collocated smells) and observed the tendency of developers to resolve the low priority issues when they co-occur with the high priority ones in the same file. In addition, applying classification algorithms using the features that cover the information about the neighborhoods of an alert, we observed that the accuracy of actionable alert detection significantly increases.

Keywords: Software Quality, Machine Learning, Code Smells, Action-able Alerts, Classification

1 Introduction

Quality of program source code has been one of the key concerns in software engineering and code smells [8], symptoms of decays in code quality, received great attention both from the academic researchers and industry. A possible way to improve the quality of source code and to reduce bugs are code reviews, in which other developers review changes. However, this approach relies on human inspection and requires considerable human effort. In this context, static code analysis tools introduced an excellent opportunity for developers by analysing the code in an automated way and providing unbiased, objective results.

Static code analysis tools perform analysis of source code with the aim of highlighting potential issues that can arise in a software system. Automatically detected issues help developers to maintain high quality software by detecting issues such as uninitialised or unused variables, empty catch blocks, poorly com-mented and organized code (too long lines or methods) etc. Over the years, dif-ferent tools have been proposed with difdif-ferent facilities including automatically checking code styles, detecting bugs and prioritizing them. Nowadays, studies on static code analysis have matured enough that industrial grade tools are used in commonly in both open source and industrial projects.

Despite their advantages, the main challenge complicating the adoption of these tools is spurious false positive warnings (i.e., alerts that are not actual issues that require an action). Taking the program as an input, a static analysis tool outputs lists of alerts where an alert is an indication that a check the tool performs is violated. As the alert is reported to the developers, it may trigger an action and the developers may choose to react to it or they can choose to ignore the alert. In this context, the number of generated alerts is important for

(3)

a developer to make the right decision for each alert. However, static analysis tools often over-estimate possible program behaviors and generate alerts that do not correspond to true defects [9]. Kremenek et al. [13] report that at least 30% of the warnings reported by sophisticated tools are false positives. In order to improve the output of the static analysis tools, several studies have been con-ducted to detect ‘actionable alerts’ or true positives that need to be resolved by developers. In those studies, researchers proposed several post-processing meth-ods by classifying or ranking the list of generated alerts[1] [2] [4] [5] [14][15]. In those studies, an actionable alert is defined as the one that will be acted upon.

Because getting the actionable alerts by developer feedback is expensive, Heckman et al. proposed a naive approach to classify the alerts: if an alert is still open in the latest revision it is assumed as un-actionable whereas all others, that disappear before the latest revision, are assumed to be actionable, unless the file containing the alert disappeared [10]. Those alerts that belong to a removed file are discarded.This definition has been commonly accepted and used in several studies. Allier et. al slightly changed the definition of un-actionable alerts. Arguing that all alerts that are not resolved in the latest revision cannot be defined as un-actionable, they defined them as discarded [2]. However, they kept the definition of actionable alerts the same.

Disregarding the definition of non-actionable alert definition, in this study we focused on actionable alert definition and analysed the risk of defining alerts that are taken action before the latest revision as actionable. We argue that the developer’s decision to take an action about an alert may differ considering the other alerts generated for the same file and the action taken by the developer does not necessarily mean that the alert is important enough to be resolved. Most of the static code analysis tools provide some details about the alerts including type and priority. It is a known fact that engineers generally focus on higher-priority alerts because finding important bugs is the primary goal [14]. They tend to ignore low priority alerts. This practice is inline with the assumption made in actionable alert definition. However, the definition leads all the resolved alerts to be labelled as actionable disregarding the priority. We argue that the low priority alerts resolved by the developers may cause misleading results in the studies based on this definition. It also decreases the possibility of low priority issues to be detected as non-actionable and provides a challenge for decreasing false positives. Consequently, in this study, we focused on resolved low priority issues and analyzed possible reasons behind the action taken by the developer. We argue that the developers tend to resolve low priority issues when they co-exist with the high priority ones in the same file while they can ignore them when they appear on their own.

In order to test this hypothesis, we collected alerts from the different ver-sions of two open source platforms, namely FIWARE and JDOM, and utilized some features for each alert. We have designed two experiments where in one of the experiments we have expanded the feature set with the proposed feature related to neighborhood information. We ran different machine learning based classification algorithms to detect actionable alerts and observed that the success

(4)

rate of actionable alert identification increases significantly when neighborhood information is used as a new feature.

2 Data set

We used PMD static code analysis software in order to collect alerts from two open source platforms, FIWARE and JDOM. PMD allows to run a set of static code analysis rules on source code files and generates a list of problems that conflict with those rules [3].

FIWARE is an open source framework which provides API modules, called Generic Enablers, to rapidly develop smart solutions. Generic Enablers are set of building blocks which make it easy to develop smart Internet Applications. Within this study, we have focused on one of the GE, AuthZForce, which has 12 versions between the years of 2016 and 2018. This GE consists of Java codes [7]. Similarly, JDOM is an open-source Java-based document object model for XML and it has 8 versions released between the years of 2011 and 2013 [11].

In order to obtain alerts that were resolved or ignored by the developers, first, we have run static code analysis on the each version of the projects. As we have alerts generated by PMD for each version of the projects, we have defined an approach to understand if a specific alert generated for a specific version has been resolved in letter versions. For this purpose, we focused on each consecutive releases and compared all the alerts in the same functions between the consecutive versions. If an alert with the same priority and the same rule ID is not detected in the letter versions of a method, we assumed that the alert has been resolved, which indicated that it was actionable. The remaining errors were assumed to be non-actionable. The following 9 features were utilized for each alert.

· Priority It is an integer value produced by the static code analysis software between 1 and 5 where 5 indicates the highest priority.

· Line It is an integer value indicating the line number of the alert within a file. · FileLOC The total lines of code of the file with the alert.

· Line/FileLOC Line number of the error in the file divided by the total lines of code in that file

· RuleSet It is a string value stating the category of the error type. PMD has 8 categories including Code Style, Error Prone, Security, Performance etc. · Rule It is a string value stating the type of the error. There are more than

100 error types in PMD.

· VersionNumber It is an integer value stating the version of the platform in which an alert arises. We collected data from 12 versions of FIWARE and 8 versions of JDOM.

· ErrBtw2Version It is an integer number stating the number of errors be-tween two consecutive versions.

· DayBtw2Version It is an integer number stating the number of days between the release dates of two consecutive versions.

(5)

We have run classification algorithms using these features. Then, we applied the same algorithms on our new feature set that has been expanded by 10 more features that reflect the neighborhood between alerts:

· SP[1/2/3/4/5 ] Five integer valued features indicating the number of re-solved alerts with priority values 1 to 5 within the same file.

· UP[1/2/3/4/5 ] Five integer valued features indicating the number of unre-solved alerts with priority values 1 to 5 within the same file.

We have worked on 22,933 alerts generated by PMD for FIWARE 1,029 of which were resolved and 21,904 were not. Similarly, a total number of 186,071 alerts were detected for JDOM. 4,825 of them were resolved by the developers while 181,246 were ignored.

3 Methodology and Results

In order to test our hypothesis, we focused on low priority alerts that were re-solved by the developers and computed the number of higher priority alerts in the same file. The results for JDOM and FIWARE data can be seen in Table 1 and Table 2 respectively. The tables show that most of the low priority resolved alerts appear in the same file with higher priority alerts. This observation consti-tutes the basis of the current study and has motivated us to apply classification algorithms on features including neighbourhood information.

Table 1. Bug Neighbourhood on JDOM Data

Actionable Alerts Nonactionable Alerts Priority 1 Priority 2 Priority 3 Priority 1 Priority 2 Priority 3 Actionable Priority 4 31 1 236 6 32 104 Actionable Priority 5 208 38 2889 138 120 2343

Table 2. Bug Neighbourhood on FIWARE Data

Actionable Alerts Nonactionable Alerts Priority 1 Priority 2 Priority 3 Priority 1 Priority 2 Priority 3 Actionable Priority 5 0 5 718 39 34 817

We ran k-Nearest Neighborhood (kNN), Support Vector Machines(SVM), and Artificial Neural Networks (ANN) algorithms on the four data file as ex-plained in the previous section. For kNN, we empirically selected 5 as k value

(6)

and Euclidean distance for classification. In order to train an SVM, a Gaussian kernel was employed with the aim of minimizing both the estimation and ap-proximation errors. Finally, the ANN had 3 hidden layers with ReLU activation function and an output neuron with sigmoid function. The network was fully connected.

For all experiments, 20% of data were split for testing while 80% of them were used for training the classifiers. The following tables illustrate precision, recall and f-1 score for the experiments.

Table 3. Experiment Results on FIWARE Data

Without neighborhood info With neighborhood info KNN SVM ANN KNN SVM ANN precision 77.51% 49.76% 63.64% 86.60% 72.25% 89.00% recall 85.26% 95.41% 85.81% 89.60% 90.96% 86.11% f-1 score 81.20% 65.41% 73.08% 88.08% 80.53% 87.53%

Table 4. Experiment Results on JDOM Data

Without neighborhood info With neighborhood info KNN SVM ANN KNN SVM ANN precision 74.16% 61.85% 65.82% 80.98% 64.70% 79,65% recall 84.96% 87.10% 70.25% 90.56% 93.12% 82,95% f-1 score 79.20% 72.34% 67.96% 85.50% 76.35% 81,27%

The results indicate that when bug neighborhood information is added as additional features, all the machine learning algorithms are found to produce higher precision, recall and f1-score. K-nearest neighbors algorithm outperformed the others in both experiments.

4 Conclusion

In this study, we aimed to investigate the effect of bug neighborhood on develop-ers while selecting alerts to resolve. For this purpose, the alerts between different versions of two open source platforms, FIWARE and JDOM, have been collected. The distribution of the resolved low priority alerts have been investigated and it has been observed that the majority of them co-exist with high priority alerts in the same file. In order to enrich our findings, we have proposed some features

(7)

that can reflect the neighborhood information between alerts with different pri-orities and have run experiments to classify alerts. It has been observed that classifiers can identify actionable alerts better when the neighborhood informa-tion is covered in the features which can be due to the success on detecting low priority issues that are resolved by the developers.

Machine learning methods have impressive success in predicting human de-cisions when trained on large amount of data [6] [12]. Software developers are no exceptions.

When a developer decides to take an action against an error, he or she gen-erally goes with the one with the high priority. The developer usually searches and finds the alert, and then implements the change plan to eliminate the alert. These are the major cost of steps taken against an alert. This study indicates that software developers tend to solve low priority alerts in the vicinity of high priority ones which can be explained with the low effort required to fix it. As the developer accesses to a file with high priority alerts, she can resolve the low priority ones with a relatively limited effort compared to the one whose file is not accessed and edited. Therefore, she could perceive that solving the low pri-ority one in the same context would not take much effort. Finally, the authors would like to recommend that static-code analysis tools include the concept of bug neighborhood in prioritizing the alert.

References

1. A. Aggarwal and P. Jalote. Integrating static and dynamic analysis for detecting vulnerabilities. In 30th Annual International Computer Software and Applications Conference (COMPSAC’06), volume 1, pages 343–350. IEEE, 2006.

2. S. Allier, N. Anquetil, A. Hora, and S. Ducasse. A framework to compare alert ranking algorithms. In 2012 19th Working Conference on Reverse Engineering, pages 277–285. IEEE, 2012.

3. PMD Source Code Analyser. Pmd:an extensible cross-language static code ana-lyzer.pmd source code analyzer, 2019. Accessed = 2019-06-23.

4. C. Boogerd and L. Moonen. Prioritizing software inspection results using static profiling. In 2006 Sixth IEEE International Workshop on Source Code Analysis and Manipulation, pages 149–160. IEEE, 2006.

5. C. Csallner and Y. Smaragdakis. Check’n’crash: combining static checking and testing. In Proceedings of the 27th international conference on Software engineer-ing, pages 422–431. ACM, 2005.

6. D. Silver et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(484), 2016.

7. FIWARE FOUNDATION. Fiware: The open source platform for our smart digital future, 2019. Accessed = 2019-06-23.

8. M. Fowler, K. Beck, J. Brant, W. Opdyke, and D. Roberts. Refactoring: improving the design of existing code. Addison-Wesley Professional, 1999.

9. S. Heckman and L. Williams. A model building process for identifying action-able static analysis alerts. In 2009 International Conference on Software Testing Verification and Validation, pages 161–170. IEEE, 2009.

(8)

10. S. Heckman and L. Williams. A comparative evaluation of static analysis actionable alert identification techniques. In Proceedings of the 9th International Conference on Predictive Models in Software Engineering, page 4. ACM, 2013.

11. JDOM. Jdom, 2019. Accessed = 2019-06-23.

12. J. Kleinberg, H. Lakkaraju, J. Leskovec, J. Ludwig, and S. Mullainathan. Human decisions and machine predictions. Q. J. Econ., 133:237—-293, 2017.

13. T. Kremenek and D. Engler. Z-ranking: Using statistical analysis to counter the impact of static analysis approximations. In International Static Analysis Sympo-sium, pages 295–315. Springer, 2003.

14. J. R. Ruthruff, J. Penix, J. D. Morgenthaler, S. Elbaum, and G. Rothermel. Pre-dicting accurate and actionable static analysis warnings: an experimental approach. In Proceedings of the 30th international conference on Software engineering, pages 341–350. ACM, 2008.

15. H. Shen, J. Fang, and J. Zhao. Efindbugs: Effective error ranking for findbugs. In 2011 Fourth IEEE International Conference on Software Testing, Verification and Validation, pages 299–308. IEEE, 2011.