View of A Framework to Analyze Business Process Log in XML Format

(1)

Research Article

A Framework to Analyze Business Process Log in XML Format

Ang Jin Sheng1*, Jastini Mohd Jamil2,IzwanNizal Mohd Shaharanee3 *1,2,3

Department of Decision Science, School of Quantitative Sciences, Universiti Utara Malaysia, 06010 UUM Sintok, Kedah, Malaysia. 3_{Othman Yeop Abdullah Graduate School of Business,}

Universiti Utara Malaysia, 06010 UUM Sintok, Kedah, Malaysia. [email protected]*, [email protected], [email protected]

Article History: Received: 10 November 2020; Revised: 12 January 2021; Accepted: 27January 2021; Published online: 05April 2021

Abstract:XML has numerous uses in a wide variety of web pages and applications. Some common uses of XML include

tasks for web publishing, web searching and automation, and general application such as for utilize, store, transfer and display business process log data. The amount of information expressed in XML has gone up rapidly. Many works have been done on sensible approaches to address issues related to the handling and review of XML documents. Mining XML documents offera way to understand both the structure and the content of XML documents. A common approach capable of analysing XML documents is frequent subtree mining.Frequent subtree mining is one of the data mining techniques that finds the relationship between transactions in a tree structured database. Due to the structure and the content of XML format, traditional data mining and statistical analysis hardly applied to get accurate result. This paper proposes a framework that can flatten a tree structured data into a flat and structured data, while preserving their structure and content.Enabling these XML documents into relational structured data allows a range of data mining techniques and statistical test can be applied and conducted to extract more information from the business process log.

Keywords:Flatten Sequential Structure Model, Business Process Log, XML, Statistical Analysis,Data Mining

1. Introduction

The volume of eXtensibleMarkup Language (XML) is increasing exponentially every day with the modern technology and techniques in storing and querying those data. Furthermore, XML format is one of the standard for the exchange and representation of data in the Internet (Bača et al., 2017).

Thus, theyoccupy a huge percentage (58%) in the web (Hakawati, Yacob, Raof, Jabiry, & Alhudiani, 2020).Besides, business process is also recorded in XML format.

Business process is a series of events that has a structure happen in a business with a goal to produce a desired result (Aguilar-Saven, 2004). To support these business processes, business process management system (BPMS) has emerged. Thus, the business process logs or event logs are produced by BPMS during business process is performed. These event logs usually saved in XML format (Kim, Yeon, Jeong, & Kim, 2017; Mannhardt, 2016). These event logs can be mined and analysed to obtain more insights from the business processes (Van der Aalst et al., 2003). For instance, the business processes in the Halal industry which are logged in XML format is monitored to ensure the quality of Halal products (Belkhatir, Bala, & Belkhatir, 2020).However, due to the complexities of data structure and content of XML format makes the mining process becomes more difficult(Romei & Turini, 2010).

In order to mine these XML format document, frequent subtree mining (FSM) has been developed in these past few years(Chi, Muntz, Nijssen, & Kok, 2005; Li, Xu, & Liu, 2019; Zaki, 2005a). FSM are used to discover interesting relationship between patterns in the tree database. FSM will generate rules based on the minimum support set by users. However, FSM performance will be decrease when the rules generated are not interesting or useful. Shaharanee and Jamil (2015) suggest that filtering irrelevant variables can prevent the case of generating not interesting rules. Furthermore, the structural properties of XML documents usually ignored by FSM.In short, framework that account structural properties of XML and able to apply statistical analysis in XML data is needed to extract more information from business process logs.

2. Related Works 2.1. Business Process

Business process is the methods to simply outline the way to achieve specific tasks in a firm (Davenport & Short, 1990). A few researchers (Guha, Grover, Kettinger, & Teng, 1997; Lindsay, Downs, & Lunn, 2003; Trkman, 2010)believes that business process is a complete set of events and task to achieve business objectives

(2)

and target. Thus, business process plays an important role to understand how company works (Weske, 2007). To ease routine business process, business process management system (BPMS)is developed to handle and execute operational business process such as invoice management, account management and customer relationship management(Van der Aalst, 2013).

2.2. Event Logs

An event log is a file that log and record a set of events produced by BPMS (Studiawan, Sohel, & Payne, 2020). Most of the event logs in XML format usually comes in two standards, Macromedia eXtensible Markup Language (MXML) and eXtensible Event Stream (XES). MXML standard is the first standard for XML based business process event logs that begins in 2003 (Tibeme, Shahriar, & Zhang, 2018). Figure 1 and Figure 2 shows the example of event log and meta model of MXML respectively(van Dongen & Van der Aalst, 2005). However, the succeeding of XES standard in 2009 and become IEEE standard in 2016 makes XES standard popular and widely use in nowadays (Janes, Maggi, Marrella, & Montali, 2017). Figure 3 and Figure 4 illustrates the example of event log and meta model structure of XES format event log respectively.

Figure 1. Example of Event Log in MXML format

(3)

Figure 3.Example of Event Log in XES format

Figure 4.Example of XES Meta Model 2.3. Frequent Subtree Mining (FSM)

A tree consists of root, vertex or node, vertex label, edge and edge label. A directed edge from one vertex to another vertex will make the relationship between vertices become parent and child relationship. Due to characteristics of every child order is important and every node has a label, XML document can be modelled as ordered, labelled and rooted trees (Zaki, 2005b). Thus, frequent subtree mining (FSM) can be applied to mine XML format data. FSM is a mining technique that can find all subtrees in a tree structure database that happen at least a couple of times and greater than the minimum support threshold(Sadredini, Rahimi, Wang, & Skadron, 2017). Support is the frequency of an items appear in a dataset. Table 1 summarizes the algorithms developed by past researchers in the domain of FSM. There are four type of trees structure to mine, include free tree, unordered tree, ordered tree and hybrid tree. For ordered tree, each element’s positions and orders are important in the tree structure mining. There are few types of frequent subtrees mining algorithm has been developed to mine different type of subtrees such as maximal, closed, induced, embedded and phylogenetic.

(4)

Table 1. FSM Algorithms Summary

Type of tree mining

Algorithm Authors Maximal Closed Induced Embedded Phylogenetic

Free Tree Mining

FreeTreeMiner (Chi, Yang, & Muntz, 2003) X

FreeTreeMiner (Rückert & Kramer, 2004) X

HybridTreeMiner (Chi, Yang, & Muntz, 2004) X

GASTON (Nijssen & Kok, 2004) X

Phylominer (Zhang & Wang, 2007) X

EvoMiner (Deepak, Fernández-Baca, Tirthapura, Sanderson, & McMahon, 2014)

X

Unordered Tree Mining

TreeFinder (Termier, Rousset, & Sebag, 2002)

X X

uFreqT (Nijssen & Kok, 2003) X

PathJoin (Xiao & Yao, 2003) X X

Unot (Asai, Arimura, Uno, & Nakano, 2003)

X CousinPair (Shasha, Wang, & Zhang,

2004)

X

RootedTreeMiner (Chi, Yang, & Muntz, 2005) X

SLEUTH (Zaki, 2005a) X

Uni3 (Hadzic, Tan, & Dillon, 2007)

X BEST (Israt Jahan Chowdhury &

Richi Nayak, 2014)

X

IRTM (Liu & Chen, 2012) X

BOSTER (Israt J Chowdhury & Richi Nayak, 2014)

X Ordered Tree

Mining

FREQT (Asai et al., 2004) X

Chopper and Xspanner

(Wang et al., 2004) X

AMIOT (Hido & Kawano, 2005) X

TreeMiner (Zaki, 2005b) X

MB3-Miner (Chang, Tan, Dillon, Hadzic, & Feng, 2005)

X IMB3-Miner (Tan, Dillon, Hadzic, Chang,

& Feng, 2006)

X X

Hybrid Tree Mining

CMTreeMiner (Chi, Yang, Xia, & Muntz, 2004)

X X X

TRIPS and TIDES (Tatikonda, Parthasarathy, & Kurc, 2006)

X X

POTMINER (Jiménez, Berzal, & Cubero, 2010)

X X

3. Proposed Framework

This study proposes a new way to analyze and mine XML format business process log data. This framework will convert tree structured data into a structured data. This framework will enable a range of data mining techniques and statistical analysis conducted on XML format business process log data. The motivation behind the proposed framework is to investigate how data mining and statistical measurement techniques can be combined to arrive at a more reliable and interesting set of rules. Generally speaking, interesting rules are interpreted as those rules that have a sound statistical basis and are not redundant. Such an approach requires a sampling process, hypothesis development, model building and finally a measurement using statistical analysis techniques to verify and ascertain the usefulness and quality of the rules discovered. This will filter out the redundant, misleading, random and coincidentally occurring rules. The details explanation of the proposed framework will be discussed in the following section.

(5)

Phase 1: Pre-processing of XML Data

Data in the real world need to be pre-processed before performing any data mining and statistical analysis. The main reasons of the data need to be pre-processed data are data usually is inconsistent, incomplete and noisy (Singhal & Jena, 2013). Extract, transform and load (ETL) is a process that transforming data into user’s ideal format before loading to a target destination. Due to business process log data comes with variety format, ETL plays an important role to ensure the log data is in XES format. The transformation of event log format to XES format is because XES is the IEEE standards for event log and widely used nowadays.Filtering is the process when user need to remove the unwanted data. Thus, corrupted transactions and data that are not related to transactions or business processes will be filter in this phase. After ETL and filtering process, the data is ready for second phase of the proposed framework

Phase 2: Flatten Sequential Structure Model (FSSM) Extraction

Flatten Sequential Structure Model (FSSM) Extraction phase is to extract the data from XML format with the structural information. In this phase, the data is converted from tree structure into flat structure. An example of tree structure database with two set of datasets is shown in Figure 5. The two transactions labelled as t1 and t2 with different structures are illustrated to show how data is converted from tree structure into structured format through FSSM. Through FSSM extraction phase, structural properties of every instances in a tree database is preserved and recorded. Table 2 illustrates the example of FSSM extraction flat data. The sequence of a tree structured data is viewed from above to below, then left to right. Each ‘-1’ in Table 2 means a backtrack. Before proceeding to the left side of the node, a backtrack to its parent node is required.

Figure 5. Tree Structure Database Table 2. Example of FSSM Extraction Flat Data

Te x0 x1 x2 x3 x4 x5 x6

t1 a b -1 c -1 d -1

t2 a b c -1 -1 d 0

Phase 3: Flatten Sequential Structure Model (FSSM) Conversion

Flatten Sequential Structure Model (FSSM) Conversion phase is to transform phase 2 data into a structured table so more statistical analysis and data mining techniques can be performed. Table 3 shows the example of FSSM conversion structured data from Table 2. After convert to flat data in phase 2, the data is grouped according to variables to become structured data.

Table 3. Example of FSSM Conversion Structured Data

a b c d

t1 t1a t1b t1c t1d

t2 t2a t2b t2c t2d

Phase 4: Knowledge discovery

Depending the goals user wants to achieve, different data mining and statistical analysis techniques can be utilized at this phase. For example, the process instances can be labelled according to different business

(6)

requirements and then be used to train classifiers for prediction purposes. Frequent pattern mining can then be applied to identify the descriptive characteristics of each group. Chi square test or correlation test can be used to determine the relationship between variables.In most cases mining XML data, FSM is used to generate rules to understand more about the structure and nature of the XML data. However, the rules generated may not be interesting or related. Thus, statistical analysis can be performed to reduce unrelated variables by using FSSM. Out of many trees generated by pattern mining and FSM, statistical significance can be identified to find more significant subtree.

Phase 5: Interpretation

In this phase, the results obtained from previous phases should be analyzed and interpreted in a way is understandable and actionable to the domain experts.

4. Conclusions and Future Work

In conclusion, this paper proposes a framework that can analyze XML format business process log data. In order to analyze these semi-structured data, the frameworkwill flatten the tree structured data and convert it into a structured data. Thus, it enables direct application of wide range of statistical analysis and data mining techniques to tree structured data such as business process event logs. By applying this framework into mining XML format business process, decision makers can extract more information or knowledge from business process logs. Thus, a better and more accurate decision can be made to decrease loss and increase more profit in a business. In the future work, this framework should be applied through experiment using simulated data and real world dataset.

5. Acknowledgement

This research was funded by a grant from Ministry of Higher Education of Malaysia and Universiti Utara Malaysia.

References

1. Aguilar-Saven, R. S. (2004). Business process modelling: Review and framework. International Journal of production economics, 90(2), 129-149.

2. Asai, T., Abe, K., Kawasoe, S., Sakamoto, H., Arimura, H., & Arikawa, S. (2004). Efficient substructure discovery from large semi-structured data. IEICE TRANSACTIONS on Information and Systems, 87(12), 2754-2763.

3. Asai, T., Arimura, H., Uno, T., & Nakano, S.-I. (2003). Discovering frequent substructures in large unordered trees. Paper presented at the International Conference on Discovery Science.

4. Bača, R., Krátký, M., Holubová, I., Nečaský, M., Skopal, T., Svoboda, M., & Sakr, S. (2017). Structural XML query processing. ACM Computing Surveys (CSUR), 50(5), 1-41.

5. Belkhatir, M., Bala, S., & Belkhatir, N. (2020). Business process re-engineering in supply chains examining the case of the expanding Halal industry. arXiv preprint arXiv:2004.09796.

6. Chang, E., Tan, H., Dillon, T. S., Hadzic, F., & Feng, L. (2005). MB3-Miner: Efficient mining eMBedded subTREEs using tree model guided candidate generation. Paper presented at the Proceedings of the First International Workshop on Mining Complex Data (MCD).

7. Chi, Y., Muntz, R. R., Nijssen, S., & Kok, J. N. (2005). Frequent subtree mining–an overview. Fundamenta Informaticae, 66(1-2), 161-198.

8. Chi, Y., Yang, Y., & Muntz, R. R. (2003). Indexing and mining free trees. Paper presented at the Third IEEE International Conference on Data Mining.

9. Chi, Y., Yang, Y., & Muntz, R. R. (2004). HybridTreeMiner: An efficient algorithm for mining frequent rooted trees and free trees using canonical forms. Paper presented at the Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.

10. Chi, Y., Yang, Y., & Muntz, R. R. (2005). Canonical forms for labelled trees and their applications in frequent subtree mining. Knowledge and Information Systems, 8(2), 203-234.

11. Chi, Y., Yang, Y., Xia, Y., & Muntz, R. R. (2004). Cmtreeminer: Mining both closed and maximal frequent subtrees. Paper presented at the Pacific-Asia Conference on Knowledge Discovery and Data Mining.

12. Chowdhury, I. J., & Nayak, R. (2014). BEST: an efficient algorithm for mining frequent unordered embedded subtrees. Paper presented at the Pacific Rim International Conference on Artificial Intelligence.

(7)

13. Chowdhury, I. J., & Nayak, R. (2014). BOSTER: an efficient algorithm for mining frequent unordered induced subtrees. Paper presented at the International Conference on Web Information Systems Engineering.

14. Davenport, T. H., & Short, J. E. (1990). The new industrial engineering: information technology and business process redesign.

15. Deepak, A., Fernández-Baca, D., Tirthapura, S., Sanderson, M. J., & McMahon, M. M. (2014). EvoMiner: frequent subtree mining in phylogenetic databases. Knowledge and Information Systems, 41(3), 559-590.

16. Guha, S., Grover, V., Kettinger, W. J., & Teng, J. T. (1997). Business process change and organizational performance: exploring an antecedent model. Journal of management information systems, 14(1), 119-154.

17. Hadzic, F., Tan, H., & Dillon, T. S. (2007). UNI3-efficient algorithm for mining unordered induced subtrees using TMG candidate generation. Paper presented at the 2007 IEEE Symposium on Computational Intelligence and Data Mining.

18. Hakawati, M. R., Yacob, Y., Raof, R. A. A., Jabiry, M. M. K., & Alhudiani, E. S. (2020). Data Cleaning Model for XML Datasets using Conditional Dependencies. European Journal of Electrical Engineering and Computer Science, 4(1).

19. Hido, S., & Kawano, H. (2005). AMIOT: induced ordered tree mining in tree-structured databases. Paper presented at the Fifth IEEE International Conference on Data Mining (ICDM'05).

20. Janes, A., Maggi, F. M., Marrella, A., & Montali, M. (2017). From Zero to Hero: A Process Mining Tutorial. Paper presented at the International Conference on Product-Focused Software Process Improvement.

21. Jiménez, A., Berzal, F., & Cubero, J.-C. (2010). POTMiner: mining ordered, unordered, and partially-ordered trees. Knowledge and Information Systems, 23(2), 199-224.

22. Kim, K., Yeon, M., Jeong, B.-S., & Kim, K. P. (2017). A Conceptual Approach for Discovering Proportions of Disjunctive Routing Patterns in a Business Process Model. TIIS, 11(2), 1148-1161. 23. Li, Z., Xu, C., & Liu, C. (2019). Frequent Subtree Mining Algorithm for Ribonucleic Acid Topological

Pattern. Revue d'Intelligence Artificielle, 33(1), 75-80.

24. Lindsay, A., Downs, D., & Lunn, K. (2003). Business processes—attempts to find a definition. Information and software technology, 45(15), 1015-1019.

25. Liu, W., & Chen, L. (2012). An efficient way of frequent embedded subtree mining on biological data. J. Comput, 6, 2574-2581.

26. Mannhardt, F. (2016). XESLite-managing large XES event logs in ProM. BPM Center Report BPM-16-04, 224-236.

27. Nijssen, S., & Kok, J. N. (2003). Efficient discovery of frequent unordered trees. Paper presented at the First international workshop on mining graphs, trees and sequences.

28. Nijssen, S., & Kok, J. N. (2004). A quickstart in frequent structure mining can make a difference. Paper presented at the Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining.

29. Romei, A., & Turini, F. (2010). XML data mining. Software: Practice and Experience, 40(2), 101-130. 30. Rückert, U., & Kramer, S. (2004). Frequent free tree discovery in graph data. Paper presented at the

Proceedings of the 2004 ACM symposium on Applied computing.

31. Sadredini, E., Rahimi, R., Wang, K., & Skadron, K. (2017). Frequent subtree mining on the automata processor: challenges and opportunities. Paper presented at the Proceedings of the International Conference on Supercomputing.

32. Shaharanee, I. N. M., & Jamil, J. M. (2015). Irrelevant feature and rule removal for structural associative classification. Journal of Information and Communication Technology, 14, 95-110.

33. Shasha, D., Wang, J. T.-L., & Zhang, S. (2004). Unordered tree mining with applications to phylogeny. Paper presented at the Proceedings. 20th International Conference on Data Engineering.

34. Singhal, S., & Jena, M. (2013). A study on WEKA tool for data preprocessing, classification and clustering. International Journal of Innovative technology and exploring engineering (IJItee), 2(6), 250-253.

35. Studiawan, H., Sohel, F., & Payne, C. (2020). Automatic event log abstraction to support forensic investigation. Paper presented at the Proceedings of the Australasian Computer Science Week Multiconference.

36. Tan, H., Dillon, T. S., Hadzic, F., Chang, E., & Feng, L. (2006). IMB3-Miner: mining induced/embedded subtrees by constraining the level of embedding. Paper presented at the Pacific-Asia Conference on Knowledge Discovery and Data Mining.

(8)

37. Tatikonda, S., Parthasarathy, S., & Kurc, T. (2006). TRIPS and TIDES: new algorithms for tree mining. Paper presented at the Proceedings of the 15th ACM international conference on Information and knowledge management.

38. Termier, A., Rousset, M.-C., & Sebag, M. (2002). Treefinder: a first step towards xml data mining. Paper presented at the 2002 IEEE International Conference on Data Mining, 2002. Proceedings.

39. Tibeme, B., Shahriar, H., & Zhang, C. (2018). Process Mining Algorithms for Clinical Workflow Analysis. Paper presented at the SoutheastCon 2018.

40. Trkman, P. (2010). The critical success factors of business process management. International journal of information management, 30(2), 125-134.

41. Van der Aalst, W. M. (2013). Business process management: a comprehensive survey. International Scholarly Research Notices, 2013.

42. Van der Aalst, W. M., van Dongen, B. F., Herbst, J., Maruster, L., Schimm, G., & Weijters, A. J. (2003). Workflow mining: A survey of issues and approaches. Data & knowledge engineering, 47(2), 237-267. 43. van Dongen, B. F., & Van der Aalst, W. M. (2005). A Meta Model for Process Mining Data.

EMOI-INTEROP, 160, 30.

44. Wang, C., Hong, M., Pei, J., Zhou, H., Wang, W., & Shi, B. (2004). Efficient pattern-growth methods for frequent tree pattern mining. Paper presented at the Pacific-Asia conference on knowledge discovery and data mining.

45. Weske, M. (2007). Business Process Management–Concepts, Languages, Architectures, Verlag. Berlin. 46. Xiao, Y., & Yao, J.-F. (2003). Efficient data mining for maximal frequent subtrees. Paper presented at

the Third IEEE International Conference on Data Mining.

47. Zaki, M. J. (2005a). Efficiently mining frequent embedded unordered trees. Fundamenta Informaticae, 66(1-2), 33-52.

48. Zaki, M. J. (2005b). Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE transactions on knowledge and data engineering, 17(8), 1021-1035.

49. Zhang, S., & Wang, J. T. (2007). Discovering frequent agreement subtrees from phylogenetic data. IEEE transactions on knowledge and data engineering, 20(1), 68-82.