j ou rna l h o me pa g e:w w w . i n t l . e l s e v i e r h e a l t h . c o m / j o u r n a l s / c m p b
A
prescription
fraud
detection
model
Karca
Duru
Aral
a,
Halil
Altay
Güvenir
b,∗, ˙Ihsan
Sabuncuo ˘glu
c,
Ahmet
Ruchan
Akar
d,eaINSEAD,Technology&OperationsManagementArea,Fontainebleau,France bDepartmentofComputerEngineeringBilkentUniversity,Ankara,Turkey cDepartmentofIndustrialEngineering,BilkentUniversity,Ankara,Turkey
dDepartmentofCardiovascularSurgery,AnkaraUniversitySchoolofMedicine,Ankara,Turkey eAnkaraUniversityStemCellInstitute,Ankara,Turkey
a
r
t
i
c
l
e
i
n
f
o
Articlehistory:Received23November2010 Receivedinrevisedform 12September2011
Accepted13September2011
Keywords: Healthcarefraud Prescriptionfraud Datamining Outlierdetection
a
b
s
t
r
a
c
t
Prescriptionfraudisamainproblemthatcausessubstantialmonetarylossinhealthcare systems.Weaimedtodevelopamodelfordetectingcasesofprescriptionfraudandtestiton realworlddatafromalargemulti-centermedicalprescriptiondatabase.Conventionally, pre-scriptionfrauddetectionisconductedonrandomsamplesbyhumanexperts.However,the samplesmightbemisleadingandmanualdetectioniscostly.Weproposeanoveldistance basedondata-miningapproachforassessingthefraudulentriskofprescriptions regard-ingcross-features.Finaltestshavebeenconductedonadultcardiacsurgerydatabase.The resultsobtainedfromexperimentsrevealthattheproposedmodelworksconsiderablywell withatruepositiverateof77.4%andafalsepositiverateof6%forthefraudulentmedical prescriptions.Theproposedmodelhasthepotentialadvantagesincludingon-linerisk pre-dictionforprescriptionfraud,off-lineanalysisofhigh-riskprescriptionsbyhumanexperts, andself-learningabilitybyregularupdatesoftheintegrativedatasets.Weconcludethat incorporatingsuchasysteminhealthauthorities,socialsecurityagenciesandinsurance companieswouldimproveefficiencyofinternalreviewtoensurecompliancewiththelaw, andradicallydecreasehuman-expertauditingcosts.
©2011ElsevierIrelandLtd.Allrightsreserved.
1.
Introduction
Fraudisdefinedastheabuse ofaprofitorganization’s sys-temwithoutnecessarilyleadingtodirectlegalconsequences. LeviandBurrowsdefinefraudasamechanismthroughwhich thefraudstergainsanunlawfuladvantageorcauses unlaw-fulloss[1].Fraudconstitutesacriticalprobleminmanyareas suchashealthcare[2],banking[3],insurance[4],and telecom-munications[5]. Prescription fraud is definedasthe illegal acquisitionofprescription drugsforpersonaluse orprofit, andcouldbeobservedinnumerousways.Anyeffortaimingto identifythefraudulenttransactionsinsuchdomainsiscalled
∗ Correspondingauthor.Tel.:+903122901252;fax:+903122664047. E-mailaddress:guvenir@cs.bilkent.edu.tr(H.A.Güvenir).
asafrauddetectionprocess.Recentdatahavesuggestedthat traditionalmanualdetectionconductedbyhumanexpertsis quitecostlyasaresultofhighexpertwages,andlargesize ofthedatabases.Othermaindrawbacksofmanualdetection arethatindividualhumanexpertscannotrecognizethenewly emergedfraudpatternsspreadoutinthedatabase,and can-notmanagetodetectthefraudulentbehaviorthemomentitis attempted.Thus,customizeddataminingalgorithmsshould analyze theenormousdatabases oftheselargebusinesses, andthenhumanexpertscanfurtherinspectidentifiedrisky trasactions.
Having seen a yearly exponential increase in spending, abuse of healthcare systems isbecoming morecritical in
0169-2607/$–seefrontmatter©2011ElsevierIrelandLtd.Allrightsreserved. doi:10.1016/j.cmpb.2011.09.003
Table1–HealthcarespendinginTurkeybyyears.
BillionTL 2002 2007 2008
Totalsocialinsurancespending 7.6 20 24 Totalmedicamentspending 4.3 8.6 10.5 Totalhospitalspending 2.8 10.3 13 StatehospitalpaymentsbysocialSSA 1.8 6.4 7.5 SSA;SocialSecurityAgencyinTurkeyknownasSGK(Sosyal Güven-likKurumu).
Turkey as in many other countries [6]. As for the USA, accordingtoGeneral AccountingOffice,annual healthcare expenditureshaveapproachedtwotrilliondollars,whichis 15.3%oftheGrossDomesticProductby2007[7].TheNational HealthCareAnti-FraudAssociation(NHCAA)estimatedthat 3%ofallhealthcarespendingwhichaddsuptobe$68billion islost tohealthcarefraudintheUnitedStates.Other esti-matesarearound10%or$170billionforthislostamount[8]. Examplesforfraudinahealthcaresystemwouldbebilling for services and goods that are not rendered, performing medicallyunnecessaryoperationsorprescribingunnecessary medicines.
TheexpertsfromSocialSecurityAgency(SSA,knownas SGK)inTurkeycommonlydetectprescriptionfraudintheir audits. Currently, while auditing the hospitals, SSA officer examinesasmall sampleofthe hospitalprescriptionsand thenSSAchargesthehospitalbyaproportionalamount.This method isboth costly to conduct and does not guarantee any efficiencycoefficient. Itisworth noting, however, that undetected fraud continues tobe anenormous burdenon theTurkishhealth-caresystem.AccordingtoTurkishHealth Care Syndicate 2008 Health Care Report, fraud in health carehasboomedinTurkeyrecently[6].Havingseenayearly exponentialincreaseinspendingasshowninTable1,health care systems’ abuse is becoming more and more critical. In 2008, health care fraud was committed principally in Van, Eskis¸ehir, Erzurum, Siirt, Adana, Bursa, Zonguldak, Diyarbakır,andmanyothercitiesevenintheHeadCenterof theTuberculosisFightingDepartment.Thesefraudulentacts wereintheformoffakemedicamentreports,fakeinvoices, billing SocialSecurity Agency (SSA)for examinations, and treatmentsthat werenotrendered. Thetotalcost ofthese fraudulentactsbeing millionsofTL, andabout 300people were arrested regarding fraud charges recently. Indeed, Turkish healthcarelaws provide significant legal sanctions forfraud and abuse control (TurkishPenal Law-26.09.2004, No: 5237/204). In contrast, the perception of the Turkish societythattheprescriptionfraudisavictimlesscrimemake it even more widespread and strengthen the fraudulent chain between the pharmaceutical companies, physicians, pharmacies,andpatients.Sincenearlyhalfofthespending of the SSA is on medical drug payments, which summed upto10.5billionTLin2008[6],weseethatthecostofthe fraudulentprescriptionstotheSSAisnottolerable.Thistype offraudcompromisesofexcessivemedicineprescription,and disunityofpatients’featureswiththeprescribedmedicines. Theorthodoxmanualdetectionisconductedbyacommittee of assigned medical doctors in the SSA. When inspecting ahospital,ahumanexpertgoesthrough arelativelysmall sampleofthe prescriptionsassociatedwiththehospital.If
therearefraudulentandabusiveclaimsinthesample,then theagencychargesthehospitaltopaytheamountacquiredby multiplyingthepercentageofthefraudulentclaimsdetected inthesampleandthetotalcostoftheprescriptionsissued by the hospital in that inspection period. This method is bothcostlytoconductanddoesnotguaranteeanyefficiency coefficientfortheoutcome.
Inordertoenableanautomateduser-friendlysystemto overcometheabove-mentionedhandicaps,inthispaper,we proposeaprescriptionfrauddetectiontoolthatisableto high-lighttheprescriptionsthatconstitutehigherfraudprobability thresholdassessedbytheuser.Riskmeasurementsare cal-culated forcross-features ina knowledge-based setting to compare to thecommon practice bycertain distance met-rics. Thesystemincorporates anefficienton-linestructure thatcanbeintegratedwiththeelectronicon-lineprescription provisionsystemsalreadyinuseinhealthcareinstitutions. Althoughoriginallyintendedforprescriptionfrauddetection, anyothermedicalclaim(bloodtests,X-rays,MRIscans, biop-sies,etc.)supervisionconstitutes promisingareasoffuture applications of the proposed methodology. Theunderlying assumptionforbuildingsuchasystemisthatthefraudulent behaviorsrelatedtoacrossfeatureareoutlierswhen consid-eringthetotaldataset.
Restofthepaperisorganizedasfollows.Section2provides acomprehensive literaturereviewonfraud detection stud-ies.Thissurveyindicatesthattherearethreemaintypesof frauddetectiontechniquesproposedforhealthcare.Theseare supervised,unsupervised,andhybridsystems.Sincewework onadatasetwithoutanypriorknowledgeonprescriptions’ labeltobefraudulentornot,theproposedsystemis consid-eredasunsupervised.Section3discussesthedatastructure, theproposedmethodology,andtherelatedriskassessment formulations.Section4presentstheresultsofcomputational experiments for both the off-line and on-line applications usingrealdata.Theempiricalvalidationsoftheproposed sys-temanditsperformancecomparedtoahumanexpertarealso giveninthissection.Finally,wegiveconcludingremarksand furtherresearchdirectionsinSection5.
2.
Related
work
Therearevariousresourcesrelatingtofrauddetection.Fraud detectionbeingarelativelylargefield,mostofthestudies con-sidersoutlierdetectionasaprimarytool[9].Theinvestigators mainlyincorporateartificialintelligence,datamining,expert systems,fuzzylogic,statisticsandvisualization.Nonetheless, studiesonhealthcareinsurancefrauddetectionarelimited. Wecangrouptheexistingmethodologiesoffrauddetection asbeingsupervised,unsupervised,orasbeinghybridsofthe above.
2.1. Supervisedapproaches
Supervisedalgorithmsaretrainedbypreviouslylabeled train-ing setoffraudulentandlegitimatetransactions.Then,the algorithms allocate mathematical methodologies to assign scores ofsimilarity withthe fraudulent profiles. Themost popular applications of supervised algorithms are neural
networks.Inthis context,Kim etal. proposeaneural net-work modelfortelecommunicationsubscriptionfraud [10]. Inanotherstudy,Barseetal.introduceamulti-layerneural networktohandlesyntheticdatabase ofVideo-on-Demand [11].Forthecreditcardfrauddetectionproblem,Syedaetal. developafuzzyneuralnetworkmodelthatworkson paral-lelmachines[12].Afeed-forwardradialbasisfunctionneural networkwiththree-layersisintroducedbyGhoshandReilly [13].Thisneuralnetworkistrainedintwophasestoassign riskscorestonewcreditcardtransactionsperiodically.
Maesetal.compare neuralnetworksand Bayesian net-works.Backpropagationalgorithmisusedtotraintheneural networks[14].TheresultsindicatethateventhoughtBayesian networksaremoreaccurateandrequireashorttrainingtime, theyareslowerintheapplicationfornewinstances.Another BayesianNetworkisdevelopedbyEzawaandNorton,which hasfourstagesandtwoparameters[15].Theauthorsassert thatallthemethodsofregression,nearestneighbor,and neu-ralnetworksaretooslowfortheirdatainhand.
Other methods inthe literature are decision trees, rule induction,andcase-basedreasoning.Metanetal.introduce a real time dispatching rules selection system extracting knowledgefromthedatastreamcomingfromthe manufac-turer[16].Theincorporateddecisiontreedynamicallyupdates in response to changes in the manufacturer’s conditions. Enablingaflexibleandhigherqualitydecisions,thesystem istestedonsimulationrunswhichrevealsthattheproposed modeloutperformstheexistingalgorithmsintheliterature.
Asforthestatistical modeling,FosterandStine employ least squares regression and stepwise selection of predic-tors[17].Theyassertthattraditionalstatisticalmethodsare effectivetobeusedforfraud detection. Belhadjiet al. pro-pose the cooperation of human experts for choosing best indicators(attributes)forfrauddetection[18].Then,the con-ditionalprobabilitiesoffraudforeachindicatorarecalculated accordingly.Afterwards,Probitregressionsareusedto iden-tifythemostimportantindicators.Theflexiblethresholdsare adjustableforcustomizationregardingthecompany’sfraud policy.Some other techniquesin theliterature incorporate expert systems, association rules, and genetic algorithms. Pejic-Bachgivesanoverviewofprofilingintelligentsystems applicationsinfrauddetectionandprevention[19].
2.2. Unsupervisedapproaches
In the area of telecommunications fraud detection, Cortes et al. study temporal evolution of large dynamic graphs [20]. Thegraphs are built up bythe sub-graphs named as Communitiesof Interest (COI). Exponential weighted aver-age method is used to update sub-graphs daily. COIs are builtup bythe mobile phone accountsusing callquantity and durations. The study yields the specifications of the telecommunicationfraudsters.Inmedicalinsurancedomain, Yamanishietal.present theunsupervised SmartSifter[21]. Thisalgorithmworkswithcategoricalandcontinuous vari-ables.SmartSifterinvestigatesstatisticaloutliersbyHellinger distance.Onautomobileinsurancedata,Brockettetal.employ PrincipalComponentAnalysisofRIDITscoresonrank-ordered categoricalattributes[22].
2.3. Hybridapproaches
Two sub-categoriesareidentifiedintheliteratureas super-visedhybridsandunsupervisedhybrids.
2.3.1. Supervisedhybrids
Inthis category,supervisedneuralnetworks, Bayesian net-works,anddecisiontreesarethemethodologiesmostlyused tocreatehybrids.Chanetal.combinenaiveBayes,C4.5,CART, andRIPPERclassifiers[23].Theresultsgivebetterefficiency oncreditcardtransactions.KimandKimdevelopadecision treealgorithmtoclassifythe datainhand [24]. Theyusea weightingfunctiontocomputefrauddensity,andthenaback propagationneuralnetworkisusedtogenerateaweighted riskscoreoncreditcardtransactions.Heetal.classifythe gen-eralpractitionerdatasetbythek-nearestneighboralgorithm [25].Theoptimalweightsoftheattributesarecomputedby geneticalgorithms.
2.3.2. Unsupervisedhybrids
Cortes and Pregibon propose the use of daily updated telecommunicationaccountsummaries(signatures)[20].The fraudulentlabeledsignaturesaretheninsertedtothetraining set.Thistrainingsetisusedfortrainingthesupervised algo-rithmssuchastree,slipper,andmodel-averagedregression. Thealgorithmallowstheauthorstodriveconclusionsonthe natureofthefraudulentcalls.Moreover,Cortesetal.propose agraph-theoreticmethod[26].Thismethodisusedtovisually detectfraudulentinternationalcalls.Cahilletal.computea riskscoretoeach callregardingitssimilarity tofraudulent profilesanddissimilaritytotheaccount’ssignature[27].The signaturesareupdatedwithlow-scorecalls.Inthisupdating process,recentcallsaregivenhigherweightthanoldercalls. Thestudy byMoreauetal.indicatesthatsupervisedneural networkandruleinductionalgorithmsperformbetterthan twotypesofunsupervisedneuralnetworksinidentifyingthe shiftsbetweenshortandlongtermaccountbehaviorprofiles [28].Theinvestigatorsusedtheareaunderthereceiver oper-atingcharacteristiccurve(AUC)astheperformancemeasure. Therearealsostudiesinwhichunsupervisedapproachesare usedtoclassifytheinsurancedataintoclustersfor incorporat-ingsupervisedapproaches.Athreestepprocedureisproposed byWilliamsand Huanginwhich: k-meansisemployedfor clusterdetection,C4.5isusedfordecisiontreeruleinduction, anddomainknowledge,thenstatisticalsummariesand visu-alizationtoolsareutilizedforruleevaluation[29].Williams employsageneticalgorithmforthesecondsteptogenerate rules.Thisenablestheusertoexploretherules[30].
Brauseet al.present RBF neuralnetworksforscreening theoutputs ofassociationrulesforcreditcardtransactions [31]. Ormerod et al. present a Mass Detection Tool (MDT) fordetectionofmedicalinsurancefraud[32].Ethnographyis the coreelementoftheproposalforcapturingexpertiseto designthemethodology.TheMDTusesadynamicBayesian Belief Network of fraud indicators. Ortega et al. describes another medicalclaimfraud/abusedetectionsystem based ondataminingusedbyaChileanprivatehealthinsurance company[33].Theproposeddetectionsystememploys multi-layerperceptronneuralnetworks(MLP).Huang,etal.appliesa filter-basedfeatureselectionmethodusinginconsistencyrate
Table2–Attributesinthedatabase.
Feature Type Numberofvalues Explanation
Commercialnameoftheprescribeddrug Categorical 2659 2659medicinesofdifferent commercialnamesseeninthe database.
Marketpriceoftheprescribeddrug Continuous 2659 Pricesoftheeachmedicinein Turkishmarketin2007fixedby theHealthMinistry.
PrescriptionI.D.number Categorical 26,419 Identifyingnumbersforthe 26,419prescriptionsinthe database.
Age Continuous 85 Allagesbetween0and85
Sex Categorical 2 Female,male
Diagnosis Categorical 332 332differentdiagnosisseenin thedatabase
measureanddiscretization,toamedicalclaimsdatabaseto predicttheadequacyofdurationofantidepressantmedication utilization[34].
Thisstudy differsfrom theexisting onesin healthcare frauddetectioninthatthedomainknowledgelearnedcanbe usedas:(a) anon-linesystemtocheckifagiven prescrip-tioncarriesrisksoffraudandifsoinwhatrespects,(b)an off-linesystemtoprocessasetofprescriptionsandfilterout thosewithariskgreater thanathresholdtocheckfurther byhumanexperts,(c)self-learning abilityofthesystemby regularupdatesoftheintegrativedatasets.Thenextsection introducestheproposedmethodology.
3.
Proposed
approach
In general, fraud detection research focuses on nonlinear, black-boxsupervisedalgorithms,nonetheless,wecanassert thatlesscomplex,reliableandfasteralgorithmsareneeded. Giventhattheinstances(prescriptions)inourdatabasedonot havelabelsasfraudulentand legitimate,weincorporatean unsupervisedapproach.
Forauditingmedicaltransactions,weneedtwotools.One isforbatchscreening/auditing whichis anoff-linesystem andtheotherisforon-line/ontimetransactioncontrol.This imposes building up two systems that work interactively. Clearly,the on-linesystemshouldincorporatestrategiesto overcometheneedforre-processingthewholebatchof pre-scriptionsineverynewtransaction.Thedatastructureand size are also other design considerations. We fulfill these requirementsundertheassumptionthatthefraudulentcases areoutliersinthedatabase.
3.1. Datastructure
Thedatabaseinhandisalreadyanonymizedandallowsus toconsiderthefollowingfeaturesinprescriptionfraud detec-tion:commercialnameoftheprescribeddrug;marketpriceof theprescribeddrug,prescriptionnumber,age,sex,diagnosis forwhichthedrugisprescribed.Thecharacteristicsofthese featuresaregiveninTable2.Asweexplicatethenatureofthe datainhand,wealsoseethatthefollowingfeaturesare cor-related:medicineanddiagnosis;medicineandage;medicine andsex;diagnosisandthetotalcostofdrugsprescribedfor
thisdiagnosis;medicineandmedicineinteractionsina pre-scription.
Sincethereisnocorrelationbetweenthefeatureslikeage andsex;weignorethesecross-features.Ontheotherhand, considering the interactions betweendiagnosis and age as wellasdiagnosisandsexwecanreasonthatwedonotneedto includethesecrossfeaturessinceanysuchdiagnosisshould conveyspecificmedicinesintheprescription.Thesespecific medicinesshouldrevealanymismatchingbetweenthe diag-nosisandageorsex.Theseargumentstransformourdomain of6dimensionstosub-domainsof2dimensionswhichare illustratedbytheinteractionsdiscussedabove.
3.2. Methodology
Theseargumentstransformourdomainof6dimensionsinto2 dimensionalsub-domains,whichareillustratedbythe above-mentionedinteractions.Therefore,ourproblemisrefinedto deal with fivetwo-dimensional spaces.Working with inci-dence and risk matrices which are to be defined in the subsequently,andhavingtwopartsofconsiderationas on-lineandoff-lineprocessing,ourmethodology’sflowchartis asshowninFig.1.
3.3. Off-lineprocessing
WedevelopedaMatlab2008Am-file,fortheoff-linebatch pro-cessingofthedatabase.Thiscodeprocessesthedatabaseto createtheincidencematricesforallthedomains.
• Medicineandagedomainincidencematrix:MA. • Medicineandsexdomainincidencematrix:MS. • Medicineanddiagnosisdomainincidencematrix:MD. • Medicineandmedicinedomainincidencematrix:MM. • Diagnosisandcostdomainincidencematrix:DC.
Anincidencematrixentry(i,j)correspondstothenumber oftimestheithandjthtraitsofthecorrespondingfeatures areseentogetherinthedatabase.AsfortheDCmatrix,the rowlabelsarediagnosesandcolumnlabelsareindicesfrom1 to204.Theseindicesrepresent5TL(Turkishcurrency) inter-vals,butthelastinterval isforthediagnosiscoststhatare above2500TL.Foreverydiagnosiswithinaprescription,the totalcostsofthecorrespondingmedicinesarecalculatedand
thenumberoftimesadiagnosisi’stotalcostfallsintoacost intervaljistheincidencematrixentryDC(i,j).
Nowhavingalltheincidencematricesinhand,thecode createsriskmatricesbelow:
• Medicineandagerisks:MAR. • Medicineandsexrisks:MSR. • Medicineanddiagnosisrisks:MDR. • Medicineandmedicinerisks:MMR. • Diagnosisandcostcouple’srisks:DCR.
These matrices are built up bycalculating the risks for thecorrespondingincidencesinthecorrespondingincidence matrices.Forexample,forcalculatingtheMSR(i,j),weusethe correspondingriskmetricforMS(i,j).Weneedtokeepthe inci-dencematricesforon-lineprocessing,sowedonotdirectly updatetheincidencematricesforriskcomputations.
Havingalltheriskmatricesinhand,thecodegoesthrough alltherisksthataregreaterthanthethresholdsgivenbythe user.Theusercanindicateanythresholdhewantsforany oftheriskmatriceskeepinginmindthatmoreprescriptions wouldbeclassifiedasriskywhenthethresholdiskeptsmall. Thatis,thereisatradeoffbetweenthetruepositiverateand thehumanexpertscreeningtime.Theusershouldpredefine theleveloftradeoffheisreadytoaccept.
Given the thresholds, the code outputs the fraudulent prescriptions by indicating which types of fraud are seen withinthe prescriptions. That way, the humanexpert has thechancetorevisethemarkedprescriptions, whichsaves timeandmoneyinauditinglargedatabases,besideshaving acquiredalistofpossiblefraudulenttransactionstylesgiven thedatabase.
3.4. On-lineprocessing
Theon-lineprescriptionfrauddetectiontoolisaninteractive toolcodedinMatlabthathasagraphicaluserinterface. Con-sideringthenatureofthehealthcaresectorwhereon-line transactionoftheincominginvoicesisthecommonpractice, wecanassertthatthiskindofanon-linetoolisfundamental forinstantrealtimeauditing.
Thisinterfaceisdesignedtoenabletheusertoinsertnew prescriptionstothe databaseand auditanewprescription withouttheneedtore-runtheoff-linecode.Thus,new pre-scriptionauditingcanbedoneoncetheoff-linecodeisrunon theprescriptiondatabaseinavailable.Pleasenotethatsince thedatabaseweusedisinTurkish,allthegeneratedlistings inthe on-lineuser interface arein Turkish.Fig. 2shows a screenshotofthegraphicaluserinterfaceoftheauditingtool. AsseeninFig.2,theuserfirstneedstoinputthe prescrip-tionnumberaswellastheageandsexofthepatient.Then, intheboxbelowtheuserenterstheprescribeddrugandthe corresponding diagnosis by the add button. The drug and diagnosislistboxesarepopulatedbytheTurkishdrugnames anddiagnosislistsofthedatabase,whicharetheoutputsof theoff-linefrauddetectioncode.Theusercanchoosetocheck toseeiftheinputiscorrectbytheviewprescriptionbutton. Iftheprescriptioninputiscorrectlyspecified,theusermight choosetoaddtheprescriptiondirectlytothedatabase.Thatis achievedbyfetchingthecorrespondingrowsoftheincidence
and riskmatricesandupdatingthosebytheon-linecode’s input of the incoming prescription specifications. Alterna-tively,theusermightwanttoaudittheprescriptiondirectly. Thatway,inputoftheprescriptionisnotusedtoupdatethe incidenceandriskmatricespermanently.Thisispreferable sinceiftheincomingprescriptionisfraudulent,updatingthe incidenceandriskmatricesbythisinputwouldslightlyaffect the performanceof the code.This because increasing the numberofoutliersinadatabasewouldeventuallyleadthe outlierstobethe commontransactions.Thiswouldhinder thetooltodetectthosefraudulenttransactions.Asaresult, theusershouldaddtheincomingprescriptiontothedatabase iftheprescriptionisnotfraudulent,perhapsaftertheauditing process.Pushingtheauditbutton,theuserinstantlyreceives amessageindicatingeachleveloffraudriskregardingthe pre-scription.Lastly,thenewprescriptionbuttonenablestheuser toputinanewprescriptionrightafterauditinganotherone.
3.5. Riskassessment
Weintroducetheriskassessmentformulas,whichconsistof calculatingrisksgiventheincidencematrices.Asstated previ-ously,incidencematricesholdtheinformationregardingthe numberoftimesaninstanceshowsupinthedataset. 3.5.1. Riskmetricforcategoricalfeatures
Sex,diagnosis,andprescriptionmedicinesaretheun-ordered categoricalfeaturesinthedataset.Theincidencematrixentry (i,j)isthenumberoftimesthemedicineiisissuedtothe cor-respondingun-orderedcategoricalentryj.Medicine–Sex(MS), Medicine–Diagnosis(MD), and the Medicine–Medicine (MM) incidencematricesarethecategoricalmatrices.
Let us denotethe maximum incidenceentry of the ith medicineofanincidencematrixMFbyMaxMF(i),whereF rep-resentsthefeaturedomain.MaxMF(i)isthenumberoftimes themedicineiisissuedtothetraitthatismostissuedto.
Atthispointweintroduceariskestimationfunction,here after denoted asriskMF(i), that represents the likelihood of fraudwhentheithmedicineisprescribedforthejthtrait.We requiredthatfunctiontoreturnarealvaluebetween0and1. Here,theriskvalue1willrepresentthehighestpossibleriskof fraud,whereasthevalue0willrepresentthelowestpossible risk.ThehighestriskvalueisobtainedwhenMF(i,j)hasthe lowestvalue,thatistherarestcase.Further,wewantedthe riskfunctiontodropexponentially,whenMF(i,j)increased, andreachthevalue0whenitisequaltoMaxMF(i),themost commoncase.Havingtriedmanyriskfunctionsthatsatisfy thesecriteria,wefoundthattheriskfunctioninEq.(1)was themostsuccessfulone.
riskMF(i,j)= e
−(MF(i,j)/MaxMF(i))−e−1
1−e−1 (1)
Then,theriskmatrixoftheMedicineandafeaturedomain Fcanbedefinedas:MFR(i,j)=riskMF(i,j).
TheriskfunctioninEq.(1)employsanexponentialfunction inordertoachieveasteeptrendsincewepreferredhigh val-uesoffraudriskonlyforverysmallvaluesofMF(i,j)/MaxMF(i). That is, the sensitivityofthe risk functionto detectfraud shouldincreaseastheratioMS(i,j)/MaxMS(i)becomessmaller
Incidence Matrices New Prescription Insert the Prescription to P.A. Tool Historical Prescription Database Compute the Prescription Risks Legal Alarm for Investigation Pre-processing Compute the Prescription Risks Generate Incidence Matrices Fraudulent Prescriptions Report
Generate Report for the
Given
Thresholds
Prescription Risks Allow the transaction Add todatabase
Update IncidenceMatrices
OFF-LINE SYSTEM ON-LINE SYSTEM
Yes
No
Fig.1–Aschematicviewoftheflowchartmodeloftheproposedsystem.P.A:
sincethederivativeofe−xincreasesasxgetssmaller.Wethan normalizethevalueofe−(MF(i,j)/MaxMF(i))bysubtractinge−1and dividingby1− e−1inordertogetriskvaluesbetween0and 1forastraightforwardinterpretationoftherisklevels.Note thatheree−1and1−e−1areconstantvalues.
3.5.2. Riskmetricfororderedfeatures
Ordered features are features over which we can make a magnitudecomparison. Thoseare oftencalled as continu-ous features. Here, we define the refined formulations for the Age and Cost ordered features of our database. Con-sidertheMedicine– Age incidencematrix,denotedbyMA. Let Max(i) and Min(i) denotethe maximum and minimum ofagesthatthe medicineiisprescribedto,respectively.In other words, Max(i)={j:MaxMA(i)=MA(i,j)} and Min(i)={j: MinMA(i)=MA(i,j)}.Thentheagerangeofmedicine iisri= Max(i)−Min(i).Themodifiedriskmetricis:
riskMA(i,j)= e −(MA(i,j)/MaxMA(i))×(1−d i(j)/r)−e−1 1−e−1 (2) where, Vi=
kk×MA(i,k) kMA(i,k)(centroidageforithmedicine), and
di(j)=|j−Vi| (distanceofthejthagetothecentroidageof ithmedicine).
Then,the risk matrix of the Medicineand Age domain is defined as MAR(i,j)=riskMA(i,j). For the Diagnosis–Cost domain,theformulationisanalogousexceptforthatwedefine the entryDC(i,j) asthe number oftimes the diagnosis i is prescribedmedicinesoftotalcostfallingintotheintervalj.
4.
Computational
results
Wedevelopthe codeofthe proposedframeworkinMatlab 2008Arelease.Inthissystem,theusercanindicateany thresh-oldhewants foranyofthe riskmatrices keepinginmind thatthere isa tradeoffbetweenthe true positiverate and thehumanexpertscreeningtime.Giventhethresholds,the codeoutputsthefraudulentprescriptionsbyindicatingwhich typesoffraudareseenwithintheprescriptions.Thatway,the humanexperthasthechancetorevisetheoutputted prescrip-tions,whichsavestimeandmoneytoauditlargedatabases. Theon-lineprescription fraud detectiontoolisan interac-tivetoolthat hasagraphical user interface. Thisinterface isdesignedtoenabletheusertoinsertnewprescriptionsto thedatabaseandauditanewprescriptionwithouttheneed tore-runthe off-linecode.Weruntheoff-linecodeonthe databaseof87,785prescribeddrugs.Thetestswererunona PCwith64byteCore2Duo(3GHz).Thecodetakes414seconds toprocessthewholedataset.Asstatedabove,arunrequires theusertospecifyriskinessthresholdsofeachkindof con-firmationcheckprocedure.Thecoderevealstheprescriptions whichpossesshigherrisksthanthethresholds.Wehavetaken
severalrunsinordertorefinethepreferablethresholdforeach ofthedomains.
Theresultsindicatethatthesensitivitylevelsofeachofthe criteriaaredifferent.Thereasonforthatliesinthefactthat the sizesofthe incidencematricesare differentfromeach otherandthusthesparsenessandintensitycharacteristicsof each differ.Thatistosay,themaximumnumbersinarisk matrix’srow andthe rowsthemselveschange from matrix tomatrixforeachmedicineleadingtodifferentsetsofrisk indicatorsforthecorrespondingfeatures.Thus,each thresh-oldneedsaseparaterefinement.Knowledgeinferredneedsto bevalidatedandrefinedbyhumanexperts[35].Weachieve thisrefinementinthesupervisionofamedicaldoctorwho assessedthesignificancelevelsoftheoutputs sinceweare interestedinbuildingasystemthatproduceoutputs mean-ingfultothehumanexpertfraudauditorswhoaremedical doctorsinTurkey.Therefinedmodelforeach auditingtask usesthefollowingthresholdvalues:
• Medicine–DiagnosisDomain:0.85. • Medicine–AgeDomain:0.90. • Medicine–SexDomain:0.96. • Medicine–MedicineDomain:0.95. • Diagnosis–CostDomain:0.85.
Weconsiderfalsepositive,falsenegative,andtruepositive ratesaswellastheagreementrateasperformanceindicators foroursystem.Amedicaldoctorlabeledthefraudulent pre-scriptionsinarandomsampleof249prescriptionstakenfrom the database. The comparisonbetween the humanexpert labeling and the proposed systemhas ledtothe following resultswith17falsepositives,19falsenegatives,72true pos-itives,and141truenegatives.Theresultsaresummarizedin Table3.TheAUC(AreaUnderROCCurve)is85.7%.
Wehavecomparedoursystemwithtwoexistingmethods. EFD[36]performedworsewithatruepositiverateof26.4%, falsepositiverate5.9%,andAUCis60%.Themedicalclaim fraud/abusedetectionsystemproposedbyOrtegaetal.[33] achievedatrue positiverateof71%,falsepositiverate6%, withAUCis82.5%.
Aninterestingobservationabouttheauditresultsisthat the prescriptionslabeled asfraudulenttendtohave multi-plenumbersofreasonsforrisk.Forexample,letusconsider theprescription1592467whosedatabasevaluesaregivenin Table4.
Theoutputforthisprescriptionisas: PrescriptionNumber:1592467
• Incompatibility between Medicine: Iliadin Diagnosis: Glaukoma,Risk:0.96.
• Incompatibility between Medicine: Coraspin Diagnosis: Glaukoma,Risk:0.92.
• IncompatibilitybetweenDiagnosis:GlaukomaCost(TL):70, Risk:0.87.
Cosopt, being an ophthalmic suspension,is a legitimate itemintheprescription.Nonetheless,Iliadinisanasalspray and Coraspincontainsacetylsalicylicacid.Thismightbean indicatorthatthefraudsterstendtoaddseveralfraudulent
Table3–Performanceindicators.
Performanceindicators Explanation Performance
Falsepositiverate TotalNumbernumberoffalseofinstancespositives 6.09% Falsenegativerate NumberTotalnumberoffalseofnegativesinstances 7.63% Truepositiverate NumberNumberofoftruerealpositivespositives 77.4% Agreementrate(accuracy) NumberoftrueTotalpositives+numbernumberofinstancesoftruenegatives 85.54%
Table4–Prescription1592467.
Prescriptionno. Drug Age Sex Diagnosis Price(TL)
1592467 Iliadin 57 M Glaukoma 4.59
1592467 Cosopt 57 M Glaukoma 30.80
1592467 Cosopt 57 M Glaukoma 30.80
1592467 Coraspin 57 M Glaukoma 2.40
Fig.3–Insertingaprescriptiontotheprescriptionauditingtool.
itemsinaprescriptionthatcouldhavebeenlegitimate with-outthose.
Theon-linecodecanberunoncehavingtheoff-line pro-cessingdone.Forillustratingtheeffectivenessoftheon-line frauddetection tool,letusconsider aprescriptiongivento a 55 years old woman. Kindly note that the data base we workwithisinTurkish,whichmeansthatwehaveTurkish listingsintheon-linetool.Sheisdiagnosedwiththe upper respiration tube infection and is giventhe medicines Sudafed Syrup, Otrivine Pediatric Spray and Stafine Pomade. The ini-tial user interface is as seen in Fig. 3 after inputting the prescription. If the user chooses to view the prescription a message box appears as in Fig. 4. After validating the prescriptioninput,theusermightchoosetoaddthe prescrip-tion tothe database. If so, the messagebox appearsas in Fig.5.
Whentheuser choosestoaudittheprescriptiona mes-sage boxappears asinFig. 6. Here,the Medicineand Age non-conformationriskassessmentsare statedinthe input orderofthe medicines,just asthe MedicineandSex non-conformation.Consideringthediagnoses,theMedicineand Diagnosis risks are seen in the screen in the appearance orderofthemedicineanddiagnosiscouplesinthe prescrip-tion. Lastly, we see one value for the Diagnosis and Cost
non-conformationrisksincethereisonlyonediagnosisinthe prescription.
Consideringtheprescription,wherethediagnosisisupper respiration tract infection and the prescribed medicines are Sudafed Syrup,StafinePomadeand OtrivinePediatricSpray,we can statethatthe tooliseffective tocalculatenorisksfor themedicineanddiagnosisdomainforthefirstandthelast
Fig.5–Databaseupdatenotification.
Fig.6–Riskassessmentscreen.
medicinesandahighriskforthesecondsinceStafinePomadeis askincaremedicine.Forthissecondmedicineweseethatthe toolcalculatesahighrisk(0.85),whichisexpected.Thereisno riskassociatedwiththesexofthepatientandthemedicines. Nonetheless,bothSudafedSyrupandOtrivinePediatricSprayare pediatricmedicines.Thus,thetoolidentifies thehighrisks regardingtheageofthepatientas0.97forSudafedSyrupand 0.99fortheOtrivinePediatricSpray.
5.
Concluding
remarks
and
further
research
direction
We conclude by proposing a novel model for detecting casesofprescriptionfraudintendedtoprovideefficientand user-friendlyplatforms,and savefinancialresourcesatthe institutionalandnationallevels.Ourmethodologyproposes dividingupthe6dimensionalfeatures’domainintoseveral 2dimensionalsub-domainsconsideringtheinteractionlevels betweenthefeatures.Themethodologyconsistsofpopulating incidencematricesforeachoftheabovedomainsandthen incorporatinga distancebased data-mining approach. The riskmetricsemployedinthisdata-miningapproachreturn riskmeasuresforeachofthedomainsmentionedabove.This riskmeasure is scaled tobe between0 and 1, in order to giveastraightforwarddefinitionoftherisklevel.Foreachof thedomains,theusercanspecifythresholds.Thatway,the programalarmsforonlythoseprescriptionswithrisklevels higherthanthethresholds.
Theautomatedfrauddetectionmethodologygives consid-erablycompatibleresultswiththehumanexpertauditing.The systemisflexibleenoughforanintegratedon-line/on-time userinterface,anditson-lineincorporationis computation-allyinexpensive,itpresentsanovelandeasywaytokeeptrack ofhealthcaretransactionsinincidencematricesforauditing.
Theapproachproposedhereisabletohandleboth categori-calandorderedfeatures.Theoutputofthesystemiseasyto understandandinterpretbyhumanusers.Besides,the sys-temcanlearnandprocessaccordinglyastheinputdatashifts. Finally,itscoremethodologyisadoptabletomanyotherareas inhealthcareandpossiblyinotherindustries.
Giventheperformancemeasurementswithatruepositive rate of77.4% and afalse positiverate of 6%,we can con-clude that the proposed system works reasonablywell for the prescriptionfraud detectionproblem. Nonetheless, fur-ther refinement ofthe tool would require scaling the risk outputsacrossalldomains. Thiswouldmeanthat incorpo-ratingdifferentparametersfordifferentdomainswouldlead tothesameriskmeasurementsacrossalldomains.Besides, atoolcanbebuiltupwheretheusercanspecifythedomains hewantstoworkon.Effortsmustbeundertakentopromote cost-effective fraud detection modelsfor other healthcare practicesandinterventionsthatmayhaveanimpactonthe qualityofhealth-care.
Conflicts
of
interest
Thereisnoundisclosedethicalproblemorconflictsofinterest relatedtothispaper.
Acknowledgements
WethankCagdasBaranforassistanceinpreparationofthe typesoffraudandabuseinmedicalpracticeandMurat Kurt-cepheforvariousdiscussionsaboutthetopicofthispaper.
r
e
f
e
r
e
n
c
e
s
[1] M.Levi,M.Burrows,Measuringtheimpactoffraudinthe UK:aconceptualandempiricaljourney,BritishJournalof Criminology48(3)(2008)293–318.
[2] A.S.Kesselheim,D.M.Studdert,M.M.Mello,
Whistle-blowers’experiencesinfraudlitigationagainst pharmaceuticalcompanies,NewEnglandJournalof Medicine362(19)(2010)1832–1839.
[3] R.Wheeler,S.Aitken,Multiplealgorithmsforfraud detection,Knowledge-BasedSystems13(2–3)(2000)93–99. [4] S.Viaene,R.A.Derrig,B.Baesens,G.Dedene,Acomparison
ofstate-of-the-artclassificationtechniquesforexpert automobileinsuranceclaimfrauddetection,JournalofRisk andInsurance69(3)(2002)373–421.
[5] C.S.Hilas,P.A.Mastorocostas,Anapplicationofsupervised andunsupervisedlearningapproachesto
telecommunicationsfrauddetection,Knowledge-Based Systems21(7)(2008)721–726.
[6] TurkishHealthCareSyndicate2008HealthCareReport,2008 (Sa ˘glıkta2008Raporu,TürkSa ˘glıkSen).
[7] J.Li,K.Huang,J.Jin,J.Shi,Asurveyonstatisticalmethods forhealthcarefrauddetection,JournalofHealthCare ManagementScience11(3)(2008)275–287.
[8] USA’sNationalHealthCareAnti-FraudAssociationWeb Page,2009,http://www.nhcaa.org/eweb/StartPage.aspx. [9] X.Weng,J.Shen,Detectingoutliersamplesinmultivariate
timeseriesdataset,Knowledge-BasedSystems21(8)(2008) 807–812.
[10] H.Kim,S.Pang,H.Je,D.Kim,S.Bang,Constructingsupport vectormachineensemble,PatternRecognition36(2003) 2757–2767.
[11] E.Barse,H.Kvarnstrom,E.Jonsson,Synthesizingtestdata forfrauddetectionsystems,in:Proceedingsofthe19th AnnualComputerSecurityApplicationsConference,2003, pp.384–395.
[12] M.Syeda,Y.Zhang,Y.Pan,Parallelgranularneuralnetworks forfastcreditcardfrauddetection,in:Proceedingsofthe 2002IEEEInternationalConferenceonFuzzySystems,2002. [13] R.Ghosh,D.Reilly,Creditcardfrauddetectionwitha
neural-network,in:ProceedingsoftheTwenty-Seventh AnnualHawaiiInternationalConferenceonSystem Sciences,1994.
[14] S.Maes,K.Tuyls,B.Vanschoenwinkel,B.Manderick,Credit cardfrauddetectionusingBayesianandneuralnetworks, in:Proceedingsofthe1stInternationalNAISOCongresson NeuroFuzzyTechnologies,2002.
[15] K.Ezawa,S.Norton,ConstructingBayesiannetworksto predictuncollectibletelecommunicationsaccounts,IEEE Expert11(5)(1996)45–51.
[16] G.Metan,I.Sabuncuoglu,H.Pierreval,Realtimeselectionof schedulingrulesandknowledgeextractionviadynamically controlleddatamining,InternationalJournalofProduction Research48(23)(2010)6909–6938.
[17] D.Foster,R.Stine,Variableselectionindatamining:building apredictivemodelforbankruptcy,JournalofAmerican StatisticalAssociation99(466)(2004)303–313.
[18] E.Belhadji,G.Dionne,F.Tarkhani,Amodelforthedetection ofinsurancefraud,TheGenevaPapersonRiskand
Insurance25(4)(2000)517–538.
[19] M.Pejic-Bach,Profilingintelligentsystemsapplicationsin frauddetectionandprevention:surveyofresearcharticles, in:ProceedingsofInternationalConferenceonIntelligent Systems,ModellingandSimulation,2010,pp.80–85. [20] C.Cortes,D.Pregibon,Signature-basedmethodsfordata
streams,DataMiningandKnowledgeDiscovery5(2001) 167–182.
[21] K.Yamanishi,J.Takeuchi,G.Williams,P.Milne,On-line unsupervisedoutlierdetectionusingfinitemixtureswith discountinglearningalgorithms,DataMiningand KnowledgeDiscovery8(2004)275–300.
[22] P.L.Brockett,R.A.Derrig,L.L.Golden,A.Levine,M.Alpert, Fraudclassificationusingprincipalcomponentanalysisof RIDITs,JournalofRiskandInsurance69(3)(2002)341–371. [23] C.L.Chan,C.H.Lan,Adataminingtechniquecombining
fuzzysetstheoryandBayesianclassifier– anapplicationof auditingthehealthinsurancefee,in:H.R.Arabnia(Ed.), ProceedingsoftheInternationalConferenceonArtificial IntelligenceIC-AI’2001,2001,pp.402–408.
[24] M.Kim,T.Kim,Aneuralclassifierwithfrauddensitymap foreffectivecreditcardfrauddetection,in:Proceedingsof IDEAL2002,2002,pp.378–383.
[25] H.He,W.Graco,X.Yao,Applicationofgeneticalgorithms andk-nearestneighbourmethodinmedicalfrauddetection, in:ProceedingsofSEAL1998,1999,pp.74–81.
[26] C.Cortes,D.Pregibon,C.Volinsky,Computationalmethods fordynamicgraphs,JournalofComputationalandGraphical Statistics12(4)(2003)950–970.
[27] M.Cahill,F.Chen,D.Lambert,J.Pinheiro,D.Sun,Detecting fraudintherealworld,in:HandbookofMassiveDatasets, 2002,pp.911–930.
[28] Y.Moreau,E.Lerouge,H.Verrelst,J.Vandewalle,C. Stormann,P.Burge,BRUTUS:ahybridsystemforfraud detectioninmobilecommunications,in:Proceedingsof EuropeanSymposiumonArtificialNeuralNetworks,1999, pp.447–454.
[29] G.Williams,Z.Huang,Miningtheknowledgemine:thehot spotsmethodologyformininglargerealworlddatabases, LectureNotesinComputerScience(1997)340–348. [30] G.Williams,Evolutionaryhotspotsdatamining:an
architectureforexploringforinterestingdiscoveries,in: ProceedingsofPAKDD99,1999.
[31] R.Brause,T.Langsdorf,M.Hepp,Neuraldataminingfor creditcardfrauddetection,in:Proceedingsof11thIEEE InternationalConferenceonToolswithArtificial Intelligence,1999.
[32] T.Ormerod,N.Morley,L.Ball,C.Langley,C.Spenser,Using ethnographytodesignaMassDetectionTool(MDT)forthe earlydiscoveryofinsurancefraud,in:CHI’03Extended AbstractsonHumanFactorsinComputingSystems,2003, pp.650–651.
[33] P.Ortega,C.Figueroa,G.Ru,Amedicalclaimfraud/abuse detectionsystembasedondatamining:acasestudyin Chile,in:ProceedingsofDMIN’06,2006,pp.224–231. [34] S.H.Huang,L.R.Wulsin,L.Hua,J.Guo,Dimensionality
reductionforknowledgediscoveryinmedicalclaims database:applicationtoantidepressantmedication utilizationstudy,ComputerMethodsandProgramsin Biomedicine93(2)(2009)115–123.
[35] T.Aydın,H.A.Güvenir,Modelinginterestingnessof streamingassociationrulesasabenefit-maximizing classificationproblem,KnowledgeBasedSystems37(2) (2009)1713–1718.
[36] A.J.Major,D.R.Riedinger,EFD:ahybrid
knowledge/statistical-basedsystemforthedetectionof fraud,JournalofRiskandInsurance69(3)(2002)309–324.