Big Data Platform Development with a Domain Specific Language for Telecom Industries

(1)

Big Data Platform Development with a Domain Specific Language for Telecom Industries

Cüneyt ùenbalcÕ, Serkan Altuntaú, Zeki Bozkus, Taner Arsan*

Kadir Has University Computer Engineering Department Cibali Kampusu 34083 Istanbul, Turkey

* arsan@khas.edu.tr

Abstract – This paper introduces a system that offer a special big data analysis platform with Domain Specific Language for telecom industries. This platform has three main parts that suggests a new kind of domain specific system for processing and visualization of large data files for telecom organizations.

These parts are Domain Specific Language (DSL), Parallel Processing/Analyzing Platform for Big Data and an Integrated Result Viewer. In addition to these main parts, Distributed File Descriptor (DFD) is designed for passing information between these modules and organizing communication. To find out benefits of this domain specific solution, standard framework of big data concept is examined carefully. Big data concept has special infrastructure and tools to perform for data storing, processing, analyzing operations. This infrastructure can be grouped as four different parts, these are infrastructure, programming models, high performance schema free databases, and processing-analyzing. Although there are lots of advantages of Big Data concept, it is still very difficult to manage these systems for many enterprises. Therefore, this study suggest a new higher level language, called as DSL which helps enterprises to process big data without writing any complex low level traditional parallel processing codes, a new kind of result viewer and this paper also presents a Big Data solution system that is called Petaminer.

Keywords – Big Data Analysis, Domain Specific Language, Parallel Processing and Analyzing

I. I

NTRODUCTION

Steps of data processing/analyzing and visualization of results can be performed easily without writing complex queries or C, Java codes, by using this platform. This also prevents consuming too much time to perform some basic operations.

Global data usage has been increasing exponentially since last ten years [1]. If we look at the type of these data, we can show that huge amount of them are generated in our daily life. Customer feedback, call detail records, billing, social media analyzing, emails, web server logs, databases are some of the most important information for enterprises.

Collection and correlation of different kind of data is not an easy operation. Traditional storage and processing systems cannot be used to handle these large datasets [2]. All of these large datasets are called as Big Data.

Infrastructure creates the base of Big Data concept.

Distributed parallel processing and storing tools are placed on

this structure, which is generally known as distributed servers or cloud. These are easy to manage virtual systems that are served as IaaS by big companies such as Google, Amazon, and Microsoft.

Traditional programming models cannot be designed for large datasets. Therefore, Google has developed a new kind of programming framework that is called Map Reduce [3].

This programming model helps problems to divide multiple tasks and solve them in parallel way. Most important part of Big Data tools (analyzing and processing) use this algorithm to increase efficiency of solving complex tasks like Hadoop [4], Hive [5], Pig [6], Cascading, Cascalog, Sawzall, Dremel and so on.

When it comes to large datasets, relational database management systems are not enough to achieve this target.

Therefore, High Performance Schema Free Databases – generally known as NoSQL databases- are designed to perform data operations efficiently and rapidly. BigTable, Hbase, Cassandra, MongoDB, and CouchDB are some of the most popular NoSQL database systems.

When companies deal with big data, it is very hard to obtain valuable information in it. So it is very important to process and analyze of big data on the distributed parallel systems. There are some special kinds of tools to overcome this issue like Hive, Pig, Mahout [7], R and so on.

Although there are lots of advantages of Big Data concept, it is still very difficult to manage these systems for many enterprises. Therefore, this study suggest a new higher level language –called as DSL- which helps enterprises to process big data without writing any complex low level traditional parallel processing codes, a new kind of result viewer and this paper also presents a Big Data solution system that is called Petaminer.

Traditional databases, management and analysis tools, algorithms and processes cannot be easily applied these large datasets [8]. This concept has three main characteristics:

Volume, Variety, and Velocity.

Volume of data that is stored to analyze is extremely increasing. Today, there are almost 2.7 Zettabytes of data in the digital world. [9].

Also Big Data concept has to deal with its variety. There is

not only structured data types but also semi structured and

unstructured which is not suitable to store in RDBMS.

(2)

A bet It i inte

T of t 1)

M sol tec or wit req (M 2)

S how com dev pro 3)

H as N eff 4)

T Pro imp dat Hiv

W sou uns ana cos P ind lay fea Inf que wri W mu tim gen A Do hig

Another char tween arrival, is very import

erval.

II. T

YP

There are som tools are suita

Infrastructur Most of the lutions focus chniques and r protocol mod th other work quired and tha MAC) protocol

Programmin Standard prog w much data mplex progra veloped by ogramming mo High Perform High Perform NoSQL Datab ficiently and q

Processing A The amount ocessing and

portant to ob ta. There are l ve, Pig, Maho III. H

IGHER

While dealing urces from dif

structured dat alyzing huge st / performan Petaminer is dustries and de yers: Collect, atures it al frastructure, w

eries while st iting scripts in When custom ust write nece me consuming nerally same a At that point, omain Specifi gher-level lang

racteristic of to be stored a tant to handle

PES OF

T

OOLS

me important f able:

re

e existing w on either pro require either difications. Th ks include the

at no modific l is needed.

ng Models gramming mo

a is processi amming task Google and odel approach mance Schem mance Schema bases are desi quickly.

Analyzing of data has

analyzing t btain valuable lots of process out, R and so o

R

L

EVEL

D

OMA

C

ON

g with big d fferent kind of ta into a form amount of da nce ratio.

big data ma eveloped by E

Manage, An so has a which helps b taying in the n Java, Pig etc mers of Elkotek

essary scripts for Elkotek a as every big d , we provide ic Language guage that cus

Big Data co and also to be

and to proces

I

N

B

IG

D

ATA

factors to dete

wireless band obing techniq

significant ba he novel aspe e fact that no cation of Med

odels generally ing. In that ks, Map Red d now it h all around th a Free Datab a Free Databa igned to perfo

been increas these large d e information

sing and analy on.

AIN

S

PECIFIC

F

NCEPT

data first job f platform. Se mat. Third one ata. Final job anagement sol Elkotek. It co nalyze, and R Domain Sp business user eir domain el

c.

k need to anal every time.

and their custo ata platform f a new solutio in that study stomers do no

oncept is tim e processed of ss data in this

C

ONCEPT

ermine which

dwidth estim ques or cross- andwidth reso ects in compa o probing traf dia Access Co

y do not deal case, to sim duce algorith

is most po he world [10].

ases

ases that are c orm data opera

sing exponent datasets are n from this se

yzing tools su

F

OR

B

IG

D

ATA

is collecting econd is forma e is processing is finding the lution for tel nsists of four Report. Beside

pecific Lang rs to run com

iminating to lyze a file, Elk

This is too m omers. This ca for all industri on that is call y. This is a ot need to writ

me of f data.

short

types

mation -layer ources arison ffic is ontrol

l with mplify hm is

opular

called ations

tially.

most ea of uch as

A

g data atting g and e best lecom main es all guage mplex learn kotek much ase is ies.

led as new te any

comp of bi

W pow Integ

Figu U sent DFD Then and infor Resu the u D Sy and NET CUS PAYM

W File FD I Inpu Th oper abou N NET NET Cu CHU ANO Fi MOS MOS

plex Pig, Hiv ig data platfor We designed D

erful big data gration of the

ure 1: DSL Sol ser writes DS DFD modul D analyzes DS n DFD determ starts proc rmation is sen ult Viewer rea user.

omain Specif ystem works w

output files.

TWORK_FILE STOMER_COM YMENT_FILE, We can think p Descriptor -F Input_FD;

ut_FD.fileType herefore analy rators. All op ut predefined i etwork Engin TWORK_TRAF TWORK_CLUS ustomer Supp URN, COMPL ORMALY_DET

inance & Mark ST_PROFITA ST_PROFITA

e queries or M rms can easily DSL, DFD an analysis platf whole system

lution for Big L script to an le of Petamin SL script and c mines correct o cessing. Fina nt to the resu ads the JSON

fic Language with two kind There are pr E, SUBSCRIB

MPLAIN_FIL CROSS_SAL redefined file D-. We define

e = “NETWO yzing of these perators are in input files. Th

eering Operat FFIC, NETWO USTERING

ort Operators LAIN_ANALYS

TECTION keting Operat BLE_CUSTO BLE_TRAFFI

Map Reduce c y analyze on th nd result view form - Petami m described be

g Data Platform nalyze files. T

ner platform.

converts it int operation ana ally, generat ult viewer as and shows re – DSL ds of files. The

redefined inp IBER_FILE, LE, COM LES_FILE es as first clas

e file objects l

ORK_FILE”;

e predefined n telecom do hese are:

tors: DPI, ORK_BANDW

: USAGE_AN YSIS, TOP_APP

tors: CROSS_S OMER, FIC

codes. Custom heir own.

wer by examin iner - of Elko elow:

m

hen, this scrip In that sect to right struct lysis framewo ted output a JSON form lated diagram

ese are input f ut files such

BILLING_FI MPETITOR_FI s object called like:

files needs so omain and kn

WITH, NALYSIS,

PLICATION, SALE,

mers ning otek.

pt is tion, ture.

orks file mat.

m for

files as:

ILE, ILE, d as

ome now

117

(3)

W FD out oup Pre

A

f

l

lo

fail

filte

getC

gr

gra

T D D pla DS ope on sol

A ope Vie

We can use pr D output_FD;

tput_FD.locat put_FD = DPI edefined File D

Attributes U

fileType “N K_

location H

ocation[] {lo

lureCheck “Y

erParams[

]

{pa

Columns[]

titles[]

raphType

aphName

Table 1:File D Distributed F DFD is the ma atform -Petam SL and conver eration framew

Result Viewe lution like use Result Viewe After output f erations to the ewer takes it a

redefined oper tion = “http://t

I input_FD;

Descriptor att

Input File Des Usage Ex ETWOR K_FILE”

De

TTP url of file

Lo sing

oc1, loc2, ...}

Lo mu

YES” or

“NO”

N pro an ar1, par2,

...}

Und to

-

Descriptor Att File Descripto

anagement sy miner- and Resu

rt it right struc work. Then, it er in JSON for er-defined func

er

files are gener e Result View and generates

rators such as test.user/defin

ributes for spe

criptor O

xplanation fines input

files ocation of gle input file

ocation of ltiple input

files Need more

cess power nd storage der dev acc.

customer

- {

-

“

- “M

G

tributes or – DFD ystem between

ult Viewer. It cture to proces t handles resu rmat. DFD is ctions (UDF).

ated, DFD sen wer in JSON fo

related graph :

ned/output/path

ecific operatio

Output File Desc

Usage E

-

HTTP url of file

S o lo {loc1, loc2, ...}

M o lo

-

{col1,col5, col9, ...}

Co f o

{title1, title2, ...}

T

co

PieChart”, LineGraph

”, etc.

G ty

R V My

Graph”

G N

n DSL, Big Da gets script fro ss on right

lts and sends t an embedded .

nd graph ormat. Result

according to th”;

ons:

criptor Exp.

-

Single output ocation

Multi output ocation

-

olumns from output

Titles acc.

olumn Graph ype for Result Viewer Graph Name

ata om

them

user

need DSL

W is da is da Pr Pe and throu seco last data

#o 3 3 3 4 4 4

Ta Nu incre extre mapp (Figu

Fi An big d First three mapp

ds. Result View L and DFD stru IV We examined t

ata processing atabase write p rocessing and erformance di its map-reduc ughput results nd two of the one has 8 ma

per node that

ofNodes

able 2: Node S umber of ma ease throughp emely rise of pers is twice ure 2).

igure 2: Node nother import data is to dec t is execution e nodes. Secon

pers on each o

wer will be im uctures are fin V. P

^ERFORMA

two kinds of p g and analyzin performances d Analyzing Pe

fference is ob ce tasks are c s that first two em have 3 map

apper tasks (T has 1024 MB

Execution(sec 2629

2233

2069

1514

1380

1250

Scalability Tab apper tasks d put. Increasin

throughput. T of 3 nodes w

Scalability tant performa crease executi time result wi nd are executi of three nodes

mplemented af nished.

ANCE

R

^ESULT

performance ng performan with HBase.

erformance Re bserved when changed. Ther o of them have apper tasks, fif

Table 2). The B memory.

conds) Thr

able

decrease exe ng number of Throughput o with same num

ance issue wh ion times. Th ith 2 and 3 ma ion time resul s (Table 4).

fter all of thes

S

results. First ces. Another

esults number of no re are 6 differ e 2 mapper ta fth one has 6 ere are 10 GB

roughput(MB/s) 11

13

14

27

29

32

cution time f nodes provi f 4 nodes wit mber of mapp

hile dealing w here are two s

appers on each ts with 3, 6 an e

one one

odes rent sks, and B of

and ides th 3 pers

with sets:

h of

nd 8

(4)

#of Red 2 3 3 6 8

T I tim Mo dec tim nod

F B up num

# Tas

2 3 3 6 8

T I spe and

Fig

fMap

duceTasks E 2,n=3

3,n=3

3,n=4

6,n=4

8,n=4

Table 2: Exec Increasing num me when numb

oreover, numb crease executi me difference b

des 4 up from

Figure 3: Exec By using Tab

table to fin mber of map/r

#ofMapReduce ks

2,n=3

3,n=3

3,n=4

6,n=4

8,n=4

Table 3: Spee In Figure 4, it eedup graph w d nodes.

gure 4: Speedu

Execution1

2629

2069

1514

1470

1250

cution Time T mber of mapp ber of nodes e ber of nodes h ion time (Figu between while m 3 with fixed n

cution Time le 2 values an nd out gain

reduce tasks a

Executi

2431

2096

1503,5

1425

1250

edup Table (n t is possible to while increasi

up

Execution2

2233

Ͳ

1493

1380

Ͳ

able (n denote per tasks decre qual to 3 or 4 has a great per

ure 3). There i e changing tot number of ma

nd variables, w while dealin and nodes (Tab

ionTime

denotes numb o show execut ing number of

Mean

2431 2069

15 1425 1250

es num. of nod ease execution (n = 3, n = 4) formance effe s an importan tal number of apper tasks.

we created a s g with diffe ble 3).

SpeedUp 1

1,175 1,617 1,706 1,945

ber of nodes) tion time gain f map/reduce

503,5

des) n ).

ect to nt

speed erence

n with tasks

H W HBa effic incre usag some and i

Data

Size

709,1

1065 1065 95827 95827 95827

Ta Av the d comp perfo comp perfo same big s the e

Fi W lzo c that comp

HBase NoSQL Writing large d

ase is design ciently. Comp ease the avera ge to increase e different sam its compressed

#of

Records

11 10934599

4863732

7 437587321 7 437587321 7 437587321

able 6: HBase verage impor disk usage. St pression he ormance. Then pressed and ormance diff e data size. F size of sample effects of the l

igure 5: HBas Write performa

compression h disk usage of pressed sampl

L DB Write Pe datasets as qu ed to perform

ression option age import ra

HBase perfo mples such as d output, 9582

Total

Time

(sec)

Avg Per (M 73,995

58,984

49

4891

5350

5051

e DB Write Pe rt performanc tarting with sm

lped to fig n approximate uncompresse ference betwe Finally, averag

e data (95827 large datasets

e Avg. Import ance of the sy helps to decrea f 1065 MB sam

le.

erformances uick as possi m these kind ns like snapp ate and also d ormance. In T s 709 MB dat 27 MB compr

g.Import

rformance

MB/sec)

Dis Us

9 18

17 13

21 53

19 48

19 49

19 47

erformances ces are gener

maller data (7 igure out

ely 1 GB of d ed. There is een these sam

ge import rat MB) are exam (Figure 5).

t Performance ystem is relate ase disk size. I mple data is c

ble is import ds of operati py driver and

decrease the d Table 6, there ta, 1065 MB d ressed data.

sk

sage

Compress (LZO,

Snappy)

802 NO

31 NO

0 YES

8784 YES

9029 YES

7246 YES

ally related w 709 MB) with average imp data is tested w average imp mples that h te of compres mined to find

e

ed with data s In figure 6 sho compared with

tant.

ions lzo disk are data

sion

with hout port with port have ssed out

size.

ows h its

119

(5)

F D MB

I alm dat

F

Figure 6: HBa Disk usage o B) is tested thr

In Figure 8, d most same as

ta to decrease

Figure 7: HBa

Figure 8: HBa

ase Disk Usag f another sam ree times with disk usage rat first sample i disk usage an

ase Compressi

ase Disk Usag

ge – Sample 1 mple that has h compression

te of compres in Figure 6. C nd increase av

ion Tests of S

ge – Sample 2

large size (9 n (Figure 7).

ssed big samp Compressing w erage import r

Sample 2

95827 ple is whole rate.

In indu dom W comp analy lots Sawz are n We and with solut telec Mes basic

Th fram view attrib

A shou can b

W from platf

[1] W 2 [2] E U [3] D P 1 [4] A [5] A [6] A [7] A [8] C h d [9] D W [10]L [11]S a P

V. R

ESEARC

n this study, w ustries, which main specific op Writing Map

plex operatio yzing tasks by

of solutions to zall. Although not higher-lev provide a new design an int h domain speci tion, user can com domain saging protoc c and common here are som mework is not wer is not imp butes are desig

s a further re uld be implem

be defined to i

VI We would like

m Elkotek for form for our u

Warden, P., Big 2011.

Eaton, C., Dero Understanding Big Dean, J., & Gh Processing on Lar

13.

Apache Hadoop P Apache Hive Proj Apache Pig Projec Apache Mahout P Critiques of http://en.wikipedia digm

Driving Marketing Web site: http://w Lam, C., Hadoop S. Melnik, A. Gub and T. Vassilakis.

PVLDB, 2010.

CH

C

ONTRIBUT

we present a offers a new perations.

Reduce prog n. User has y using Map o handle Map h effectivenes vel solutions w higher-leve tegration fram ific big data p n easily perfo

and get sol col is designe n solution all a me limitations

t designed com lemented. On gned and its te esearch, DFD mented. Then increase analy

. ACKNOW to thank Bah their assistan sage.

R

^EFERE

Data Glossary, oos, D., Deutsc

g Data, McGraw- hemawat, S. (20 rge Clusters. Com Project. Web site:

ect. Web site: http ct. Web site: http:

roject. Web site:

Big Da a.org/wiki/Big_da g Effectiveness B www.ibmbigdatah In Action, Manni barev, J. J. Long, . Dremel: Interac

TIONS AND

C

O

big data sys ew higher lev grams is ve

to write both Reduce algor p Reduce jobs ss of processe to write thes el language th mework -DFD platform -Peta form predefin lution with a ed with JSON around the wo s in this stud ompletely yet.

nly DSL opera ests still conti

framework a new operato yzing power o

WLEDGEMEN htiyar KaranlÕk

nce. They pro

RENCES O’Reilly Media ch, T., Lapis, -Hill, USA, 2012 008). MapRedu mmunications of t

http://hadoop.ap tp://hive.apache.o ://pig.apache.org/

http://mahout.ap ata execution data#Critiques_of_

By Managing Th hub.com/infograp

ing Publications, , g. Romer, S. Sh ctive Analysis of

ONCLUSIONS

stem for telec vel language ry difficult h processing rithms. There s such as Drem es increases, t se queries eas hat is called D D- for messag aminer-. With ned operations

a result view N, which is v

orld.

dy. Firstly, D And also re ators, objects

nue.

and result vie ors and attribu of DSL solutio

NT

k and Halit O ovided Petami

a Publications, U G., Zikopoulos, 2.

ce: Simplified the ACM, 51(1),

ache.org/

org/

/ ache.org/

n. Web f_the_Big_Data_p e Flood Of Big D phic/flood-big-dat

USA, 2010.

hivakumar, M. To f Web-Scale Data

com for and and are mel, they sily.

DSL ging this s in wer.

very DFD esult and ewer utes on.

OlalÕ iner

USA, , P.,

Data 107-

site:

para Data.

ta olton asets.