Big Data Platform Development with a Domain Specific Language for Telecom Industries
Cüneyt ùenbalcÕ, Serkan Altuntaú, Zeki Bozkus, Taner Arsan*
Kadir Has University Computer Engineering Department Cibali Kampusu 34083 Istanbul, Turkey
* [email protected]
Abstract – This paper introduces a system that offer a special big data analysis platform with Domain Specific Language for telecom industries. This platform has three main parts that suggests a new kind of domain specific system for processing and visualization of large data files for telecom organizations.
These parts are Domain Specific Language (DSL), Parallel Processing/Analyzing Platform for Big Data and an Integrated Result Viewer. In addition to these main parts, Distributed File Descriptor (DFD) is designed for passing information between these modules and organizing communication. To find out benefits of this domain specific solution, standard framework of big data concept is examined carefully. Big data concept has special infrastructure and tools to perform for data storing, processing, analyzing operations. This infrastructure can be grouped as four different parts, these are infrastructure, programming models, high performance schema free databases, and processing-analyzing. Although there are lots of advantages of Big Data concept, it is still very difficult to manage these systems for many enterprises. Therefore, this study suggest a new higher level language, called as DSL which helps enterprises to process big data without writing any complex low level traditional parallel processing codes, a new kind of result viewer and this paper also presents a Big Data solution system that is called Petaminer.
Keywords – Big Data Analysis, Domain Specific Language, Parallel Processing and Analyzing
I. I
NTRODUCTIONSteps of data processing/analyzing and visualization of results can be performed easily without writing complex queries or C, Java codes, by using this platform. This also prevents consuming too much time to perform some basic operations.
Global data usage has been increasing exponentially since last ten years [1]. If we look at the type of these data, we can show that huge amount of them are generated in our daily life. Customer feedback, call detail records, billing, social media analyzing, emails, web server logs, databases are some of the most important information for enterprises.
Collection and correlation of different kind of data is not an easy operation. Traditional storage and processing systems cannot be used to handle these large datasets [2]. All of these large datasets are called as Big Data.
Infrastructure creates the base of Big Data concept.
Distributed parallel processing and storing tools are placed on
this structure, which is generally known as distributed servers or cloud. These are easy to manage virtual systems that are served as IaaS by big companies such as Google, Amazon, and Microsoft.
Traditional programming models cannot be designed for large datasets. Therefore, Google has developed a new kind of programming framework that is called Map Reduce [3].
This programming model helps problems to divide multiple tasks and solve them in parallel way. Most important part of Big Data tools (analyzing and processing) use this algorithm to increase efficiency of solving complex tasks like Hadoop [4], Hive [5], Pig [6], Cascading, Cascalog, Sawzall, Dremel and so on.
When it comes to large datasets, relational database management systems are not enough to achieve this target.
Therefore, High Performance Schema Free Databases – generally known as NoSQL databases- are designed to perform data operations efficiently and rapidly. BigTable, Hbase, Cassandra, MongoDB, and CouchDB are some of the most popular NoSQL database systems.
When companies deal with big data, it is very hard to obtain valuable information in it. So it is very important to process and analyze of big data on the distributed parallel systems. There are some special kinds of tools to overcome this issue like Hive, Pig, Mahout [7], R and so on.
Although there are lots of advantages of Big Data concept, it is still very difficult to manage these systems for many enterprises. Therefore, this study suggest a new higher level language –called as DSL- which helps enterprises to process big data without writing any complex low level traditional parallel processing codes, a new kind of result viewer and this paper also presents a Big Data solution system that is called Petaminer.
Traditional databases, management and analysis tools, algorithms and processes cannot be easily applied these large datasets [8]. This concept has three main characteristics:
Volume, Variety, and Velocity.
Volume of data that is stored to analyze is extremely increasing. Today, there are almost 2.7 Zettabytes of data in the digital world. [9].
Also Big Data concept has to deal with its variety. There is
not only structured data types but also semi structured and
unstructured which is not suitable to store in RDBMS.
A bet It i inte
T of t 1)
M sol tec or wit req (M 2)
S how com dev pro 3)
H as N eff 4)
T Pro imp dat Hiv
W sou uns ana cos P ind lay fea Inf que wri W mu tim gen A Do hig
Another char tween arrival, is very import
erval.
II. T
YPThere are som tools are suita
Infrastructur Most of the lutions focus chniques and r protocol mod th other work quired and tha MAC) protocol
Programmin Standard prog w much data mplex progra veloped by ogramming mo High Perform High Perform NoSQL Datab ficiently and q
Processing A The amount ocessing and
portant to ob ta. There are l ve, Pig, Maho III. H
IGHERWhile dealing urces from dif
structured dat alyzing huge st / performan Petaminer is dustries and de yers: Collect, atures it al frastructure, w
eries while st iting scripts in When custom ust write nece me consuming nerally same a At that point, omain Specifi gher-level lang
racteristic of to be stored a tant to handle
PES OF
T
OOLSme important f able:
re
e existing w on either pro require either difications. Th ks include the
at no modific l is needed.
ng Models gramming mo
a is processi amming task Google and odel approach mance Schem mance Schema bases are desi quickly.
Analyzing of data has
analyzing t btain valuable lots of process out, R and so o
R
L
EVELD
OMAC
ONg with big d fferent kind of ta into a form amount of da nce ratio.
big data ma eveloped by E
Manage, An so has a which helps b taying in the n Java, Pig etc mers of Elkotek
essary scripts for Elkotek a as every big d , we provide ic Language guage that cus
Big Data co and also to be
and to proces
I
NB
IGD
ATAfactors to dete
wireless band obing techniq
significant ba he novel aspe e fact that no cation of Med
odels generally ing. In that ks, Map Red d now it h all around th a Free Datab a Free Databa igned to perfo
been increas these large d e information
sing and analy on.
AIN
S
PECIFICF
NCEPT
data first job f platform. Se mat. Third one ata. Final job anagement sol Elkotek. It co nalyze, and R Domain Sp business user eir domain el
c.
k need to anal every time.
and their custo ata platform f a new solutio in that study stomers do no
oncept is tim e processed of ss data in this
C
ONCEPTermine which
dwidth estim ques or cross- andwidth reso ects in compa o probing traf dia Access Co
y do not deal case, to sim duce algorith
is most po he world [10].
ases
ases that are c orm data opera
sing exponent datasets are n from this se
yzing tools su
F
ORB
IGD
ATAis collecting econd is forma e is processing is finding the lution for tel nsists of four Report. Beside
pecific Lang rs to run com
iminating to lyze a file, Elk
This is too m omers. This ca for all industri on that is call y. This is a ot need to writ
me of f data.
short
types
mation -layer ources arison ffic is ontrol
l with mplify hm is
opular
called ations
tially.
most ea of uch as
A
g data atting g and e best lecom main es all guage mplex learn kotek much ase is ies.
led as new te any
comp of bi
W pow Integ
Figu U sent DFD Then and infor Resu the u D Sy and NET CUS PAYM
W File FD I Inpu Th oper abou N NET NET Cu CHU ANO Fi MOS MOS
plex Pig, Hiv ig data platfor We designed D
erful big data gration of the
ure 1: DSL Sol ser writes DS DFD modul D analyzes DS n DFD determ starts proc rmation is sen ult Viewer rea user.
omain Specif ystem works w
output files.
TWORK_FILE STOMER_COM YMENT_FILE, We can think p Descriptor -F Input_FD;
ut_FD.fileType herefore analy rators. All op ut predefined i etwork Engin TWORK_TRAF TWORK_CLUS ustomer Supp URN, COMPL ORMALY_DET
inance & Mark ST_PROFITA ST_PROFITA
e queries or M rms can easily DSL, DFD an analysis platf whole system
lution for Big L script to an le of Petamin SL script and c mines correct o cessing. Fina nt to the resu ads the JSON
fic Language with two kind There are pr E, SUBSCRIB
MPLAIN_FIL CROSS_SAL redefined file D-. We define
e = “NETWO yzing of these perators are in input files. Th
eering Operat FFIC, NETWO USTERING
ort Operators LAIN_ANALYS
TECTION keting Operat BLE_CUSTO BLE_TRAFFI
Map Reduce c y analyze on th nd result view form - Petami m described be
g Data Platform nalyze files. T
ner platform.
converts it int operation ana ally, generat ult viewer as and shows re – DSL ds of files. The
redefined inp IBER_FILE, LE, COM LES_FILE es as first clas
e file objects l
ORK_FILE”;
e predefined n telecom do hese are:
tors: DPI, ORK_BANDW
: USAGE_AN YSIS, TOP_APP
tors: CROSS_S OMER, FIC
codes. Custom heir own.
wer by examin iner - of Elko elow:
m
hen, this scrip In that sect to right struct lysis framewo ted output a JSON form lated diagram
ese are input f ut files such
BILLING_FI MPETITOR_FI s object called like:
files needs so omain and kn
WITH, NALYSIS,
PLICATION, SALE,
mers ning otek.
pt is tion, ture.
orks file mat.
m for
files as:
ILE, ILE, d as
ome now
117
W FD out oup Pre
A
f
l
lo
fail
filte
getC
gr
gra
T D D pla DS ope on sol
A ope Vie
We can use pr D output_FD;
tput_FD.locat put_FD = DPI edefined File D
Attributes U
fileType “N K_
location H
ocation[] {lo
lureCheck “Y
erParams[
]
{pa
Columns[]
titles[]
raphType
aphName
Table 1:File D Distributed F DFD is the ma atform -Petam SL and conver eration framew
Result Viewe lution like use Result Viewe After output f erations to the ewer takes it a
redefined oper tion = “http://t
I input_FD;
Descriptor att
Input File Des Usage Ex ETWOR K_FILE”
De
TTP url of file
Lo sing
oc1, loc2, ...}
Lo mu
YES” or
“NO”
N pro an ar1, par2,
...}
Und to
-
-
-
-
Descriptor Att File Descripto
anagement sy miner- and Resu
rt it right struc work. Then, it er in JSON for er-defined func
er
files are gener e Result View and generates
rators such as test.user/defin
ributes for spe
criptor O
xplanation fines input
files ocation of gle input file
ocation of ltiple input
files Need more
cess power nd storage der dev acc.
customer
- {
-
-
“
“
- “M
G
tributes or – DFD ystem between
ult Viewer. It cture to proces t handles resu rmat. DFD is ctions (UDF).
ated, DFD sen wer in JSON fo
related graph :
ned/output/path
ecific operatio
Output File Desc
Usage E
-
HTTP url of file
S o lo {loc1, loc2, ...}
M o lo
-
-
{col1,col5, col9, ...}
Co f o
{title1, title2, ...}
T
co
PieChart”, LineGraph
”, etc.
G ty
R V My
Graph”
G N
n DSL, Big Da gets script fro ss on right
lts and sends t an embedded .
nd graph ormat. Result
according to th”;
ons:
criptor Exp.
-
Single output ocation
Multi output ocation
-
-
olumns from output
Titles acc.
olumn Graph ype for Result Viewer Graph Name
ata om
them
user
need DSL
W is da is da Pr Pe and throu seco last data
#o 3 3 3 4 4 4
Ta Nu incre extre mapp (Figu
Fi An big d First three mapp
ds. Result View L and DFD stru IV We examined t
ata processing atabase write p rocessing and erformance di its map-reduc ughput results nd two of the one has 8 ma
per node that
ofNodes
able 2: Node S umber of ma ease throughp emely rise of pers is twice ure 2).
igure 2: Node nother import data is to dec t is execution e nodes. Secon
pers on each o
wer will be im uctures are fin V. P
ERFORMAtwo kinds of p g and analyzin performances d Analyzing Pe
fference is ob ce tasks are c s that first two em have 3 map
apper tasks (T has 1024 MB
Execution(sec 2629
2233
2069
1514
1380
1250
Scalability Tab apper tasks d put. Increasin
throughput. T of 3 nodes w
Scalability tant performa crease executi time result wi nd are executi of three nodes
mplemented af nished.
ANCE
R
ESULTperformance ng performan with HBase.
erformance Re bserved when changed. Ther o of them have apper tasks, fif
Table 2). The B memory.
conds) Thr
able
decrease exe ng number of Throughput o with same num
ance issue wh ion times. Th ith 2 and 3 ma ion time resul s (Table 4).
fter all of thes
S
results. First ces. Another
esults number of no re are 6 differ e 2 mapper ta fth one has 6 ere are 10 GB
roughput(MB/s) 11
13
14
27
29
32
cution time f nodes provi f 4 nodes wit mber of mapp
hile dealing w here are two s
appers on each ts with 3, 6 an e
one one
odes rent sks, and B of
and ides th 3 pers
with sets:
h of
nd 8
#of Red 2 3 3 6 8
T I tim Mo dec tim nod
F B up num
# Tas
2 3 3 6 8
T I spe and
Fig
fMap
duceTasks E 2,n=3
3,n=3
3,n=4
6,n=4
8,n=4
Table 2: Exec Increasing num me when numb
oreover, numb crease executi me difference b
des 4 up from
Figure 3: Exec By using Tab
table to fin mber of map/r
#ofMapReduce ks
2,n=3
3,n=3
3,n=4
6,n=4
8,n=4
Table 3: Spee In Figure 4, it eedup graph w d nodes.
gure 4: Speedu
Execution1
2629
2069
1514
1470
1250
cution Time T mber of mapp ber of nodes e ber of nodes h ion time (Figu between while m 3 with fixed n
cution Time le 2 values an nd out gain
reduce tasks a
Executi
2431
2096
1503,5
1425
1250
edup Table (n t is possible to while increasi
up
Execution2
2233
Ͳ
1493
1380
Ͳ
able (n denote per tasks decre qual to 3 or 4 has a great per
ure 3). There i e changing tot number of ma
nd variables, w while dealin and nodes (Tab
ionTime
denotes numb o show execut ing number of
Mean
2431 2069
15 1425 1250
es num. of nod ease execution (n = 3, n = 4) formance effe s an importan tal number of apper tasks.
we created a s g with diffe ble 3).
SpeedUp 1
1,175 1,617 1,706 1,945
ber of nodes) tion time gain f map/reduce
503,5
des) n ).
ect to nt
speed erence
n with tasks
H W HBa effic incre usag some and i
Data
Size
709,1
1065 1065 95827 95827 95827
Ta Av the d comp perfo comp perfo same big s the e
Fi W lzo c that comp
HBase NoSQL Writing large d
ase is design ciently. Comp ease the avera ge to increase e different sam its compressed
#of
Records
11 10934599
4863732
4863732
7 437587321 7 437587321 7 437587321
able 6: HBase verage impor disk usage. St pression he ormance. Then pressed and ormance diff e data size. F size of sample effects of the l
igure 5: HBas Write performa
compression h disk usage of pressed sampl
L DB Write Pe datasets as qu ed to perform
ression option age import ra
HBase perfo mples such as d output, 9582
Total
Time
(sec)
Avg Per (M 73,995
58,984
49
4891
5350
5051
e DB Write Pe rt performanc tarting with sm
lped to fig n approximate uncompresse ference betwe Finally, averag
e data (95827 large datasets
e Avg. Import ance of the sy helps to decrea f 1065 MB sam
le.
erformances uick as possi m these kind ns like snapp ate and also d ormance. In T s 709 MB dat 27 MB compr
g.Import
rformance
MB/sec)
Dis Us
9 18
17 13
21 53
19 48
19 49
19 47
erformances ces are gener
maller data (7 igure out
ely 1 GB of d ed. There is een these sam
ge import rat MB) are exam (Figure 5).
t Performance ystem is relate ase disk size. I mple data is c
ble is import ds of operati py driver and
decrease the d Table 6, there ta, 1065 MB d ressed data.
sk
sage
Compress (LZO,
Snappy)
802 NO
31 NO
0 YES
8784 YES
9029 YES
7246 YES
ally related w 709 MB) with average imp data is tested w average imp mples that h te of compres mined to find
e
ed with data s In figure 6 sho compared with
tant.
ions lzo disk are data
sion
with hout port with port have ssed out
size.
ows h its
119
F D MB
I alm dat
F
F
Figure 6: HBa Disk usage o B) is tested thr
In Figure 8, d most same as
ta to decrease
Figure 7: HBa
Figure 8: HBa
ase Disk Usag f another sam ree times with disk usage rat first sample i disk usage an
ase Compressi
ase Disk Usag
ge – Sample 1 mple that has h compression
te of compres in Figure 6. C nd increase av
ion Tests of S
ge – Sample 2
large size (9 n (Figure 7).
ssed big samp Compressing w erage import r
Sample 2
95827 ple is whole rate.
In indu dom W comp analy lots Sawz are n We and with solut telec Mes basic
Th fram view attrib
A shou can b
W from platf
[1] W 2 [2] E U [3] D P 1 [4] A [5] A [6] A [7] A [8] C h d [9] D W [10]L [11]S a P
V. R
ESEARCn this study, w ustries, which main specific op Writing Map
plex operatio yzing tasks by
of solutions to zall. Although not higher-lev provide a new design an int h domain speci tion, user can com domain saging protoc c and common here are som mework is not wer is not imp butes are desig
s a further re uld be implem
be defined to i
VI We would like
m Elkotek for form for our u
Warden, P., Big 2011.
Eaton, C., Dero Understanding Big Dean, J., & Gh Processing on Lar
13.
Apache Hadoop P Apache Hive Proj Apache Pig Projec Apache Mahout P Critiques of http://en.wikipedia digm
Driving Marketing Web site: http://w Lam, C., Hadoop S. Melnik, A. Gub and T. Vassilakis.
PVLDB, 2010.
CH
C
ONTRIBUTwe present a offers a new perations.
Reduce prog n. User has y using Map o handle Map h effectivenes vel solutions w higher-leve tegration fram ific big data p n easily perfo
and get sol col is designe n solution all a me limitations
t designed com lemented. On gned and its te esearch, DFD mented. Then increase analy
. ACKNOW to thank Bah their assistan sage.
R
EFEREData Glossary, oos, D., Deutsc
g Data, McGraw- hemawat, S. (20 rge Clusters. Com Project. Web site:
ect. Web site: http ct. Web site: http:
roject. Web site:
Big Da a.org/wiki/Big_da g Effectiveness B www.ibmbigdatah In Action, Manni barev, J. J. Long, . Dremel: Interac
TIONS AND
C
Obig data sys ew higher lev grams is ve
to write both Reduce algor p Reduce jobs ss of processe to write thes el language th mework -DFD platform -Peta form predefin lution with a ed with JSON around the wo s in this stud ompletely yet.
nly DSL opera ests still conti
framework a new operato yzing power o
WLEDGEMEN htiyar KaranlÕk
nce. They pro
RENCES O’Reilly Media ch, T., Lapis, -Hill, USA, 2012 008). MapRedu mmunications of t
http://hadoop.ap tp://hive.apache.o ://pig.apache.org/
http://mahout.ap ata execution data#Critiques_of_
By Managing Th hub.com/infograp
ing Publications, , g. Romer, S. Sh ctive Analysis of
ONCLUSIONS
stem for telec vel language ry difficult h processing rithms. There s such as Drem es increases, t se queries eas hat is called D D- for messag aminer-. With ned operations
a result view N, which is v
orld.
dy. Firstly, D And also re ators, objects
nue.
and result vie ors and attribu of DSL solutio
NT
k and Halit O ovided Petami
a Publications, U G., Zikopoulos, 2.
ce: Simplified the ACM, 51(1),
ache.org/
org/
/ ache.org/
n. Web f_the_Big_Data_p e Flood Of Big D phic/flood-big-dat
USA, 2010.
hivakumar, M. To f Web-Scale Data
com for and and are mel, they sily.
DSL ging this s in wer.
very DFD esult and ewer utes on.
OlalÕ iner
USA, , P.,
Data 107-
site:
para Data.
ta olton asets.