SESSION TA2:
Embedded Systems, Sensors
and
MEMS
Chair:
Christopher Ryan, Vitesse
Co-chair: Kenneth Hsu, Rochester Institute of
Technology
Workload
Clustering
for
Increasing
Energy
Savings
on
Embedded MPSoCs*
S.H.K.
Narayanan,O. Ozturk,M.
Kandemir
Department
of
Computer Science and
Engineering
The
Pennsylvania
State
University
{snarayan,ozturk,kandemir}
@cse.psu.edu
ABSTRACT
Voltage/frequency scalingandprocessor
low-power
modes(i.e.,
proces-sorshut-down)are twoimportantmechanismsusedfor reducingenergy consumption inembedded MPSoCs. Whileaunified
schemethatcom-bines these two mechanisms canachieve
significant
savings
insomecases, suchanapproach is limited
by
the codeparallelization
strategy employed. Inthispaper, weproposeanovel, integer
linear program-ming(ILP)based workloadclusteringstrategyacrossparallel proces-sors,oriented towardsmaximizingthenumberof
idleprocessors with-outimpactingoriginalexecution times. These idleprocessorscanthen beswitchedtoalowpowermodetomaximizeenergysavings,
whereas theremainingonescanmakeuseofvoltage/frequency scaling.
Inorder tocheckwhether thisapproachbrings
anyenergybenefits
overthepure voltage scalingbased,
pure processorshut-downbased,
or asimple
unified scheme, weimplemented four
different approaches
and tested themusing a setof eightarray/loop-intensive
embeddedapplications.
Oursimulation-basedanalysis
reveals that theproposed
ILPbased ap-proach(1)is veryeffectiveinreducing theenergyconsumptionsofthe applications testedand(2)generatesmuchbetterenergysavings than all thealternateschemes tested(includingaunified
schemethatcom-binesvoltage/frequency scalingand processorshutdown).
I.
INTRODUCTION
We canroughly dividetheeffortsonenergy
savings
inembedded multi-processorsystem-on-a-chip architectures(MPSoCs)
intotwo cat-egories.Inthis firstcategoryarethe studies thatemploy
processorvolt-age/frequency
scaling.
Thebasic idea istoscaledownvoltage/frequency
ofaprocessorifitscurrentworkload is less thantheworkloads of other processors. Incomparison, the studiesinthe secondcategoryshut down unusedprocessors(i.e.,
putthemintolow-powerstatesalong
with theirprivate
memorycomponents)during the executionofthecurrentcom-putation.
Both thesetechniques, i.e.,
voltagescaling
andprocessorshut down, canbeappliedatthe software level (e.g.,directed byan opti-mizing compiler)oratthehardware-level (e.g., basedon apast history-based workload/idleness detectionalgorithm).
Itisalso conceivableto combinethese twotechniques underaunifiedoptimizer.Each of thesetechniqueshasitsadvantages and drawbacks. For ex-ample, aprocessor shut-downbased schememay notbe applicableif there is nounusedprocessor(notethatthis doesnot meanthatthe work-loadsofallthe processors in theMPSoCare similar). Similarly,the effectiveness ofavoltagescaling based scheme is limited bythe num-ber of
voltage/frequency
levelssupported by the underlying hardware. Ingeneral, exploiting processor/memory shutdownsaves more energy when it isapplicable (as it reduces leakageenergysignificantly)orwhenwehaveonlyacouple of voltage/frequency levelstouse. Ifthis isnot thecase, thenvoltage scaling can beeffective (and in some cases it is theonly choice). Based on thisdiscussion, one can expect a unified schemetobesuccessful. However,wewant tore-iterate that if there is nounused(idle) processor in the current workload
assignment,
such a unified schemesimplyreduces to a voltage scaling based approach.M.
Karakoy
180
Queen's
Gate
Imperial College
[email protected]
Ourgoalin this paper is toexploreaworkload
(job)
clustering scheme that combines voltage scaling with processorshut-down'.
The unique-nessofthe proposed unified approach is that it maximizes the oppor-tunities for processor shut-downbyassigningworkloads toprocessors carefully. It achieves this byclusteringtheoriginal workloads of pro-cessorsin as few processors as possible. In this paper, we discuss the technical details of thisapproach to energy savingin embedded MP-SoCs. Theproposed approach is ILP (integer linear programming) based; that is, it determines theoptimal workload clusterings across the processors by formulating theproblem using ILP and solving it using a linear solver. In order to check whether thisapproachbrings any energy benefits over the purevoltage scaling based,pure processor shut-downbased, or a simple unifiedscheme,weimplementedfour dif-ferentapproacheswithin ourlinear solverand tested themusingaset ofeightarray/loop-intensive embeddedapplications. Our simulation-basedanalysis reveals thattheproposedILPbasedapproach (1)is very effectiveinreducingtheenergyconsumptionsoftheapplicationstested and(2) generates muchbetterenergysavings thanallthealtemateschemes tested(including one that combinesvoltage/frequency scalingand pro-cessorshutdown).II. EMBEDDED
MPSOC ARCHITECTURE,
EXECUTION MODEL, AND RELATED WORK
Thechip multiprocessorweconsiderinthis workisashared-memory architecture;thatis,theentire addressspaceisaccessiblebyall proces-sors.Eachprocessorhas aprivateLlcache, and the sharedmemoryis assumedtobeoff-chip. Optionally,wemayincludea(shared) L2 cache aswell. Note that severalarchitectures fromacademia and industry fit in this description [1, 10, 8, 9]. Wekeep the subsequent discussion simple by usingasharedbusastheinterconnect (thoughonecoulduse fancier/higher bandwidth interconnectsaswell).Wealsousethe MESI protocol (thechoiceis orthogonal tothefocus of thispaper) to keep the caches coherent across the CPUs. We assume thatvoltagelevel andfrequency ofeach processorin this architecturecanbe set indepen-dently oftheothers,and alsoprocessors can be placedintolowpower modesindependently. This paper focuses on asingle-issue,five-stage (instruction fetch (IF), instruction decode/operandfetch(ID),execution (EXE),memory access(MEM),andwrite-back (WB) stages) pipelined datapathforeachon-chipprocessor.Ourapplication execution model in this embedded MPSoC can be summarizedasfollows. Wefocusonarray-basedembedded applica-tions thatareconstructedfrom loopnests. Typically, eachloop nest in such anapplication is smallbut executes a large number of iterations andaccesses/manipulates largedatasets(typically multidimensional ar-raysofsignals). Weemployaloopnestbasedapplication paralleliza-tionstrategy. Morespecifically, each loop nest is parallelized indepen-dently of the others. Inthiscontext, parallelizingaloop nest means distributing itsiterations across processors and allowingprocessorsto execute theirportions in parallel. For example, a loop with 1000 iter-ations can beparallelized across10processors by allocating100 itera-*Thisworkissupported in part by NSF Career Award #0093082 and
'In
this paper, we use theterms
"processor
show-down"and
"low-byagrant from GSRC. power mode"interchangeably.(a)
(b) (c) PoPI P2 P3 P4 P5 POPI P2P3P4 P5 POP1P2 P3 P4 P5 POPIP2P3P4P5 POPIP23P4P5`El
idietunused voltage scaledI
shut down WIlFigure 1: Comparison of different
energy-saving
approachesfor a sixprocessorarchitecture. Arrows indicate how theworkloads (jobs)areclusteredbyourapproach.tionstoeachprocessor.
There are manyproposalsforpower managementofadynamic volt-agescaling-capableprocessor. Most of themare attheoperating sys-temlevel and are either task-based
[12]
orinterval-based [4]. While someproposals aimatreducing
energywithoutcompromising
perfor-mance, a recentstudyby Grunwaldetal[5]observed noticeable perfor-mancelossfor someinterval-based algorithmsusing actual measure-ments. Mostoftheexisting compiler
basedstudies suchas [11]targetsingle
processorarchitectures.Incomparison,
ourworktargetsat achip
multiprocessor based environment and combines voltagescaling
and processor shutdown.[15]
presents andanalyzes
avoltage/frequency
scaling scheme but,they
donotconsiderprocessorshut-down.[6]
em-ploys processor ashut-down based mechanism but does not consider voltage/frequencyscaling. Inourexperimental
evaluation,
wecompare ourapproach topurevoltage/frequency
scaling
andtopure processor shut-downaswell.III. OUR
APPROACH
III.1
Overview
Figure1 comparesfour different altemate schemes thatsavesenergy inanembedded MPSoC architecture. It is
assumed,
forillustrative pur-poses, that the architecture has sixprocessors. InFigure
l(a)
shows theworkloadsof
theprocessors(i.e.,
thejobs assigned
tothem)
inagivenloopnest. Theseareassumedtobe theloadseither estimated
by
thecompiler
orcalculatedthrough profiling
andareforasingle
nest.Figures
l(b)
and (c) show the scenarios withpurevoltage/frequency
scaling andpure processorshut-down based
approaches,
respectively.
In(b), fouroutofoursixprocessorstakeadvantage
ofvoltage
scaling
(notethatP5
isnotused in thecomputation
atall).
In(c),
ontheother hand,wecanplaceonlyoneprocessors(P5)
intothelow-power
mode. Acombination of thesetwoapproaches
isdepicted
inFigure l(d).
Ba-sically, this version combines the benefits ofvoltage/frequency scaling
andprocessorshut-down.Finally,
the result thatcanbeobtainedby
the ILPapproachproposed
inthispaper isillustratedinFigure
l(e). Note that whatourapproach
essentially
doesistoclusterthetotalamountofcomputational
load inasfewerprocessorsaspossible
sothat the number of unusedprocessorsis maximized. Inthisparticular
case,theoriginal
loadsof threeprocessors(P2, P3,
andP4)
arecombinedandassigned
to processorP2.
Asaresult,processorsP3
andP4
canalso beplaced
into thelow-power mode(along
withtheirprivate
memorycomponents)
tomaximizeenergysavings,inadditionto
P5.
Thenextsubsectiongives
the technical detailsofthisapproach.
When there areopportunities,
our approachcan also use
voltage/frequency
scalingfor the clustered jobs. Itis importantto pointout that thebenefits fromourapproachcanbeexpectedtobeeven moresignificantwhenthe number of volt-age/frequencylevels is small. Insuchacase,apure
voltage/frequency
scalingbasedapproachcannotstretchthe execution time ofaprocessortofill the available slackcompletely.
However, wefirst needtoclarifytwo important issues. Someone may askatthispoint "whyhastheapplication
(corresponding
tothe scenario inFigure l(a))notbeenparallelized
atthefirstplaceasshown inFigure1(e)?"
There areseveralreasonsforthis. First,most currentcodeparallelizersdonotconsider any energyoptimizations.
Therefore,
thereis really littlereasonfor calculating the workloads of individual processors, and thus little opportunity forworkload clustering. Sec-ond,theconventional parallelizingcompilerstry to use as many proces-sors aspossible for executing agivencomputation unless there exists a compelling reason to dootherwise(e.g.,the excessive synchronization costs). Third,in many cases,tryingtoclustercomputationin very few processors canhave an impact on execution cycles. Since most paral-lelizing compilers donotpredictorquantifythis impact, they do not attemptsuchclusterings, beingontheconservative side.
The secondissue isthat, it is possible that the scenario depicted in Figure l(e) has poor datalocality as compared to scenarios in Fig-ures l(b), (c),and (d). This is because conventional codeparallelizers generallytry toachieve good data locality, by ensuring that each pro-cessormostlyusesthe same setof data elements asmuchaspossible (i.e., high data reuse). Asaresult, the scenario in Figure1(e)canlead to anincrease in data cachemisses,which in turnincreases overallenergy consumption. This overhead should also be factored inourclustering approachto ensure afair comparison.
Themaincontribution ofthe ILP approachproposed inthispaper istoobtain, for each loopnestin anapplication, theresult shown in Figure
l(e),
giventheinitialscenario
(workloadassignment)shownin Figure1(a)
and thus reduceenergyconsumption.11.2
Technical
Details
and Problem
Formulation
This section elaboratesonthe ILPmodel usedto representthe prob-lem.In ourproblem,thereexista setofjobs (workloads) that havetobe executedon a setof availableprocessorsin theembedded MPSoC such that the totalenergy spentby thesystemisminimal and that the exe-cution of thejobs completes withinaspecified time limit, Tmax.2 The processors can rundifferent
jobs
atdifferentvoltage and frequency lev-els, which affectsenergyconsumption.
Theenergyexpended by each processoris thesumofthedynamicenergy aswellastheleakage en-ergyexpended whilerunning. Therestof this sectiondescribes theILP model in detail.III.2.1
System
and Job
Model
We assumethat thejobsaremembersof thesetJ consisting ofJmax elements and the processors belong to the set P in which there are
Pmax
elements. Theprocessors can run atVnum discretesetofvolt-age/frequency
levels (assupported by
thearchitecture).
Itis assumed thatonlyonejobcan run on aprocessor atanytime
andthatonce ajob startsrunningonaprocessor,itrunsuninterrupted
tocompletion.
How-ever, a processorcanbeassigned
to runmorethanonejob,
as aresultof workloadclustering. The durationthat thejoboccupies
the processoris dependentonthesupplyvoltage/frequency
aswellasthe thefrequency
atwhichtheprocessorisrunning
thatparticular job.
Thetime(latency)
eachjob takesup atdifferentvoltage
levels isspecified
in thearrayJob-Length(j,
v).
Similarly,
thedynamic
energy spentby
eachjob
atdifferent
voltage
levels variesand iscaptured by
Job-Dynamic(j, v).3
Total-Energy
is thesumof theenergies spent by alljobs
onall pro-cessorsduetotheirrunning
aswellastheleakageenergyconsumedby
theprocessors. This is the metric whose valuewe want tominimize.111.2.2
Mathematical
Programming Model
Theconstraints
specified
belowgive the mathematical representation ofourmodel. Weuse0-1integer linearprogramming
(ILP). ThisILP formulation is executed for each loopnestseparately. Table1gives the notation used inourformulation.Job
Assignment
Constraints. The0-1 variableX(p,
j,v)
deter-mines whether processor prunsjob j
atvoltage/frequency
levelv. One2In
this paper, we do not assumeaspecificcode(loop nest) paralleliza-tionstrategy.Rather,we assumethateachloopnestisparallelized using
oneof the knowntechniques. For eachloopnest,Tmax
isdetermined bytheprocessor with thelargestworkload. This isto ensurethatourworkloadclusteringdoesnothavea
negative
impactonexecution times.3Here
jrepresents ajob (workload) andvrepresents avoltage (fre-quency)level. In ourimplementation,
theentriesofJobliength(j,
v)
andJob-Dynamic(j,
v) arefilledusingprofiling. All energy estima-tionsareperformed
usingWattch[2]underthe70nmprocess technol-ogy. The increase in data cache missesas aresult ofclusteringis cap-turedduring
ourprofiling.
158
to)onthatprocessor.This is capturedby Expression(5):
D-Energy =
Pmax-1 Jmax-1Vnum-1
E
E
EX(p,
j, v) *Job.Energy(j,v)p=o j=o v=o
(5)
Expression (6)calculates theleakageenergyspent.As mentioned
ear-lier,ifBusy(p) is 1, then leakage isspentbyprocessor p.
Pma.-1
L-Energy=Leakage-Value*
E
*Busy(p)p=o (6)
Table 1: Notation usedinourmodel.
jobrunscompletelyon oneprocessorand alljobsarescheduledtorun
onlyonce.This isspecifiedasfollows:
VpEP VjEJ VvEV
X(p,j,v)E{0I1}
(1)Pmax-1Vnum 1
VjEJ
E
E
X(p,j,v)=1 (2) p0o v-oConstraint (1) expresses thetermX(j,p,v) as abinary variable; a processoreitherrunsthejob oritdoesnot. Constraint(2) statesthat
eachjobcanberunonlyononeprocessorand that alljobsareassigned
tosomeprocessors(i.e.,nojob isleftunassigned). Noticethatwewant todetermine the valueofX(p, j, v)for allp,j,andv.
Deadline Constraints. Jobsare assignedtoprocessors aslong as
they canmeetthe time deadline that is specified. Constraint (3) ex-pressesthis:
Jmax-1Vnum-1
VpEP E X(p,j, v)*Job-Length(j, v) <Tmax (3)
j=0 v=0
NotethatTmaxisdetermined, for each loopnest,by the longest(largest) workload.
Clustering and Processor Shut-Down Constraints. Multiple jobs
arerun onthesameprocessorifthe number ofjobs,Jmax, exceedsthe
numberofprocessor,Pmax, but also if suchanarrangementreduces the overallenergyspentby thesystem. Incase aprocessorisnotassigned
anyjob, either becauseofclusteringofjobsorbecauseJmax < Pmax
orbecause of both thesereasons,then it is shut down. Suchaprocessor
doesnotconsumeanydynamicenergyasithasnojobs runningonit anditdoesnotconsumeanyleakageenergysince it isshut down(except forsomesmallamountofleakageinmemorycomponents). Constraint
(4) is introducedtocaptureprocessorshutdown:
VpE P, VjE J,VvEV Busy(p) >X(p, j,v) (4)
Foraparticularprocessor p, Busy(p) is necessarily 1 ifanyof the
values inX(p, j, v) is 1. Through this constraint, the value of Busy(P) isnotexplicitly expressed if all values in X(p, j, v)are0. However,
avalue of1 inBusy(p) adds leakagetotheoverall energy. As the
objective oftheILP-based modelistoreduceenergy,Busy(p) will be
assignedtobe0if allvaluesinX(p, j, v)are
0.4
Leakage and DynamicEnergy Calculation. The following
expres-sions capturetheleakageenergyanddynamicenergyspentby the
sys-temasthesumof theleakage and dynamic energies, respectively, spent
byeachprocessor. Thetotalamountof dynamicenergyspentbya pro-cessoris thesumof thedynamic energiesspentfor each job that isrun
4Topreservedata inmemorycomponents,ashut-downprocessor con-sumes someleakage [3]. Our experiments areperformed basedon
thisprinciple. However, inour presentation of the ILPformulation, weassume noleakage consumption in the shut-downstateforeaseof
presentation.
Objective Function. The objective function which is the total
en-ergyspentby thesystemis the sumof thethe leakage anddynamic
energies. This is theobjectivefunction thatourapproach triesto mini-mize:
Total-Energy=DEnergy+L.Energy (7)
The constraints and expressions mentioned in this sectionaresufficient
toexpress ourproblem within ILP. Wenextlookatthe additional
con-straints thatcanbe usedin ordertohandletwospecialcases.
Voltage/Frequency Scaling without Clustering. To model classical voltage/frequency scaling within ourILPformulation, aninput value
Assign(j, p) should specify theprocessoronwhich eachjobruns.
Fur-ther, by connecting this valuetothat ofX(j,p,v), all jobsareforcedto
run ontheassignedprocessors alone. This connectioncanbecaptured
by thefollowing constraint:
Vnum-1
VpEP,VjEJ , X(p,j,v)r=Assign(p,j)
V=o
(8)
ClusteringwithoutVoltage/Frequency Scaling.Tomodeljob clus-tering without voltage and frequency scaling,weneedtoconstrainthe choice ofavailable voltagefrequency levelstoeither eachprocessor
in-dividuallyorallprocessors. Inthecaseofconstraining the voltage
lev-els of allprocessorstoonevalue, Constraint(9)canbe usedtoensure
thatnojobsareassigned voltage levels other than theonespecified.
VpEP,VjEJ,veV-l{v'}
X(p,j,v)-=O. (9)Toconstrain each individualprocessortoanindependent voltage level,
Constraint(10) belowcanbe used.
VpEP,VjEJ,VvEV--{vp}
X(p,j,v)=O. (10)Here, v' and vparetheuniversaland individual (forprocessorp) volt-agelevels, respectively. These constraintssimply limit the voltage
lev-elstobeused. In thiscase,thedecisiontocluster jobs togetherona
processorismadebyoursolverand dependsonwhetherit results ina
loweredoverallenergyconsumption.
IV. EXPERIMENTALEVALUATION
Wepresentonly energyresults in this section. Thereason is that noneof the techniques evaluated increases original execution cycles
(i.e., wedonotexceed Tmex inanyloop nest). Specifically, for each
loopnest,theprocessorwiththe largestworkloadsetsthelimit for volt-age/frequency scaling andprocessorshut-down. The ILP solver used
inourexperiments islp-solve [7]. We observed that the ILP solution
timeswiththeapplication codes in ourexperimental suite varied
be-tween 56.7 seconds and 13.2 minutes. Considering the large energy
savings, these solution timesarewithintolerable limits.
Alltheexperimental resultsareobtained using the SIMICS
simu-lationplatform [13]. Specifically, weembedded in the SIMICS
plat-formtiming andenergymodels that helpus simulate the behavior of
thefollowing four schemes: VS (pure voltage/frequency scaling based approach); SD (pureprocessorshut-down basedapproach); VS+SD (a
unifiedapproach thatcombines VS andSD); and CLUSTERING(the
ILP-based approach proposed inthis paper). Thedefault simulation
parametersused inourexperiments are listedinTable 2. In the last
threeschemes, whena processoris unused in thecurrentloopnest,it
159
Notation Explanation
Job-Dynamic(j,v) Dynamic energy forrunning job(workload)jatvoltagev
Job-Length(j, v) Time takento runjobjatvoltagev
X(p,j,v) Value is 1 ifjob j runs on processor patvoltagev J Set ofjobs
P Setofprocessors
T_max Timedeadlinebeforewhich alljobsmustfinish J max Totalnumber ofjobs to be executed
P-max Totalnumberofprocessorsavailable
Vrnum Totalnumber ofvoltage(andfrequency)levels available
Total-Energy Total energyconsumptionof the system(tobeminimized) Leakage-Value Leakageenergy spentbyaprocessor if itisnot
SimulationParameter Value
Processor Speed 400MHz
Numberof Processors 8
Lowest/Highest Voltage Levels 0.8V/1.4V Numberof Voltage Levels 4
8KB Instruction Cache 2-wayassociative
32 byte blocks 8KB DataCache 2-wayassociative
32 byteblocks
Memory 32MB(banked)
Off-ChipMemory AccessLatency 100cycles
BusArbitration Delay 5cycles ReplacementPolicy StrictLRU Cache Dynamic EnergyConsumption 0.6 nJ MemoryDynamic EnergyConsumption 1.17 nJ
LeakageEnergyConsumption for32bytes
NonnalOperation 4.49 pJ
Shut-Down State 0.92pJ
Resynchronization Timefor Shut-DownState 30msec
ResynchronizationTimeforVoltage Scaling 5msec
Table 2: Thedefault simulation parameters.
is shut-down and itsLIinstruction anddata cachesare placed into the low-power mode. Thespecificlow-power modeemployed in this paper isfrom [3].
We used 8
array/loop-intensive
applications for evaluating thefour approachesmentionedabove: 3D,DFE,
LU, SPLAT,MGRID,WAVE5, SPARSE,and XSEL. 3D is animage-based modeling application that simplifiesthe task ofbuilding 3Dmodels and scenes. DFEis a dig-ital imagefiltering and enhancement code. LU is an LU decomposi-tionprogram. SPLAT isa volumerenderingapplication which is used inmulti-resolution volume visualization through hierarchical wavelet splitting. MGRID andWAVE5are Cversions of two Spec95FP ap-plications. SPARSE is animageprocessing codethat performs sparse matrixoperations, andfinally, XSEL is an image rendering code. These Cprograms are written in such a fashion that they can operate on inputs of differentsizes. Wefirst ran these applicationsthrough our simulator withoutany voltage scaling or processor shut-down. This version of an application is referred to as the baseversion or the base executionin theremainder of thispaper. The energy consumptions (which include energiesspent inprocessors, caches,interconnects, and off-chip mem-ory) underthe base execution are 272.lmJ, 388.3mJ, 197.9mJ, 208.4mJ, 571.0mJ, 466.2mJ, 292.2mJ, and 401.5mJ for 3D, DFE, LU, SPLAT, MGRID,WAVE5, SPARSE, and XSEL, respectively. Theenergy re-sultspresented in this section are given as normalizedvalueswith re-spect tothis base execution.Tocalculate the dynamic energy consumptions forcaches and mem-ory, weused the Cacti tool [14]. We approximated theleakage energy consumption by assuming that the leakage energy per cycle for4KB SRAM isequal to the dynamic energy consumed per access to a32byte data fromthe same SRAM. Note thatthis assumption tries to capture theanticipated importance of leakage energy in the future as leakage be-comes thedominant part of energy consumption for 0.10 micron (and below)technologies for the typicalinternal junction temperatures in a chip. In the shut-downstate, a processor and its caches consume only a smallpercentage of their original (per cycle) leakage energy. However, when aprocessor and its data and instruction caches in the shut-down state areneeded, they need to be reactivated (resynchronized). This resynchronization costs extra execution cycles aswell as extra energy consumption as noted in [3], and all these costs are captured in our simulations and includedin all our results.
Our first set of results,thenormalized energy consumptions with the different schemes, are presented in Figure 2. Each group of bars in this graphcorrespond to an application, and the last group of bars gives the average results across all eightapplications. The energy savings achievedby the VS scheme are not very large (6.55% on the average). There aretwo main reasons for this. The first one is the inherent char-acteristicsof some applications. More specifically, when there areno long idleperiods, VS is not applicable. The second reason is the lim-itednumber of voltage/frequency levels used in the default configura-tion (seeTable 2). In comparison, the SD scheme behaves in a different
aI
Eva mao /* s.smeRLt#E I UAC" E .
Figure 3: Theactiveandidle
pe-Finsure tio2:
Nraie eerrnods
ofprocessors in themx3-consumpt*ons. raw.croutine fromMGRID. manner. While it is not applicable in some cases (e.g.,in applications DFE, MGRID,SPARSE, and XSEL), the energy savingsit bringsare significant in caseswhere it isapplicable.VS+SDsimply combinesthe benefits of the VS and SD schemes, reducing to VS when SD is not applicable. The average energy savings (across all eight applications) achieved by SD and VS+SD are7.36%and 13.52%,respectively. The highestenergy savings are obtained by our ILP-based approach, which is 22.65% on the average. These results clearly show the potential ben-efits of ourILP-basedworkloadclusteringapproach.
Tobetter illustrate where our energy benefits are coming from, we give inFigure 3 the percentage oftimeeachprocessor spends in the activeandidlestatesforproceduremx3-raw.c,oneofthethirteen sub-programsinapplication MGRID.Weseefrom thisgraph that our ILP-based approach is abletoincrease the number of idleprocessors. We observed similar trends with most of otherprocedures in our applica-tions. These results explain the energy benefits observed in Figure 2.
V.
CONCLUSIONS
This paper proposes a workload clustering schemefor embedded MPSoCsthatcombinesvoltagescaling withprocessorshut-down. The uniqueness of the proposed unifiedapproach is that it maximizes the use ofprocessor shut-downbyclustering workloads (jobs) in asfew processors aspossible. We tested this approach along withthree al-temate schemes using asimulation-based platform andeight embed-ded applications. Ourexperiments show thatthisclustering approach isveryeffective inreducing energyconsumptionand generatesbetter resultsthan the threealtemativeschemesevaluated.
VI.
REFERENCES
[1] L. A.Barrosoet al.Piranha: A ScalableArchitectureBasedonSingle-Chip
Multiprocessing.ProceedingsofISCA'2000.
[2] D.Brooks et al.Wattch: aframework for architectural-levelpoweranalysis and optimizationsInProceedings ofISCA, Canada 2000.
[31 K.Flautner, N. Kim, S.Martin,D.Blaauw,and T.Mudge.DrowsyCaches:Simple techniquesforreducingleakage power.Proceedings ofISCA,2002.
[4] K.Govil,E.Chan,and H. Wasserman.Comparing AlgorithmsforDynamic
Speed-Setting ofaLow-Power CPU.Proceedingsofthe IstACMInternational
ConferenceonMobileComputingandNetworking, November 1995.
[5] D. Grunwald, P. Levis,K.Farkas, C. Morreym,and M.Neufeld. Policies for
Dynamic ClockScheduling.ProceedingsOSDI'2000.
[6] L.Kadayif,M.Kandemir,and U. Sezer. An IntegerLinear Programming Based ApproachforParallelizingApplicationsinOn-Chip Multiprocessors. In Proceedings ofDAC'2002.
[7] lp.solve.ftp:l/ftp.es.ele.tue.nl/pub/lp.solve/
[8] MAJC-5200.http://www.sun.com/microelectronics/MAJC/5200wp.html [91 MP98: A Mobile Processor.http://www.labs.nec.co.jp/MP98/top-e.htm.
[10] K.Olukotun, B. A.Nayfeh, L. Hammond, K. Wilson, andK.Chang. The Casefor a
Single ChipMultiprocessor.ProceedingsofASPLWS'1996.
[111 H.Saputraetal.Energy-ConsciousCompilationBased onVoltage Scaling. ProceedingsofACMSIGPLANJointConferenceLCTES'02 and SCOPES'02, Berlin, Germany, June,2002.
[121 YShin, K.Choi, and T. Sakurai. PowerOptimization of Real-Time Embedded Systems on Variable Speed Processors.ProceedingsoftheInternationalConference onComputer-Aided Design, November 2000.
[13] SIMICS.http://www.virtutech.com/simics/simics.html.
[14] S.Wiltonand N.Jouppi.Cacti:Anenhanced cacheaccessandcycletime model. IEEEJournal ofSolid-State Circuits, May 1996.
[151 Q.Wu,P.Juang, M.Martonosi, and D. W.Clark. Formalon-line methods for
voltage/frequencycontrol inmultiple clock domainmicroprocessors. Proceedings ofASPLOS'2004.