Predicting informative spatio-temporal neurodevelopmental windows and gene risk for autism spectrum disorder

(1)

PREDICTING INFORMATIVE

SPATIO-TEMPORAL

NEURODEVELOPMENTAL WINDOWS AND

GENE RISK FOR AUTISM SPECTRUM

DISORDER.

a thesis submitted to

the graduate school of engineering and science

of bilkent university

in partial fulfillment of the requirements for

the degree of

master of science

in

computer engineering

By

O˘

guzhan Karakahya

October 2020

(2)

Predicting informative spatio-temporal neurodevelopmental windows and gene risk for autism spectrum disorder.

By O˘guzhan Karakahya October 2020

We certify that we have read this thesis and that in our opinion it is fully adequate, in scope and in quality, as a thesis for the degree of Master of Science.

A. Erc¨ument C¸ i¸cek(Advisor)

Can Alkan

Tunca Do˘gan

Approved for the Graduate School of Engineering and Science:

(3)

ABSTRACT

PREDICTING INFORMATIVE SPATIO-TEMPORAL

NEURODEVELOPMENTAL WINDOWS AND GENE

RISK FOR AUTISM SPECTRUM DISORDER.

O˘guzhan Karakahya M.S. in Computer Engineering

Advisor: A. Erc¨ument C¸ i¸cek October 2020

Autism Spectrum Disorder (ASD) is a complex neurodevelopmental disorder with a strong genetic basis. Due to its intricate nature, only a fraction of the risk genes were identified despite the effort spent on large-scale sequencing studies. To perceive underlying mechanisms of ASD and predict new risk genes, a deep learning architecture is designed which processes mutational burden of genes and gene co-expression networks using graph convolutional networks. In addition, a mixture of experts model is employed to detect specific neurodevelopmental periods that are of particular importance for the etiology of the disorder. This end-to-end trainable model produces a posterior ASD risk probability for each gene and learns the importance of each network for this prediction. The results of our approach show that the ASD gene risk prediction power is improved compared to the state-of-the-art models. We identify mediodorsal nucleus of thalamus and cerebellum brain region and neonatal & early infancy to middle & late childhood period (0 month - 12 years) as the most informative neurodevelopmental window for prediction. Top predicted risk genes are found to be highly enriched in ASD-associated pathways and transcription factor targets. We pinpoint several new candidate risk genes in CNV regions associated with ASD. We also investigate confident false-positives and false negatives of the method and point to studies which support the predictions of our method.

Keywords: Autism Spectrum Disorder, Graph Convolutional Networks, Deep Learning.

(4)

¨

OZET

OT˙IZM SPEKTRUM BOZUKLU ˘

GU ˙IC

¸ ˙IN B˙ILG˙I

VER˙IC˙I ZAMAN-UZAMSAL S˙IN˙IR GEL˙IS

¸ ˙IM ARALI ˘

GI

VE GEN R˙ISK˙I TAHM˙IN˙I

O˘guzhan Karakahya

Bilgisayar Mühendisli˘gi, Yüksek Lisans Tez Danı¸smanı: A. Ercüment Ç i¸cek

Ekim 2020

Otizm Spektrum Bozuklu˘gu (OSB), genetik sebeplerle ortaya ¸cıkabilen, zihin-sel geli¸simi olumsuz etkileyen karma¸sık bir hastalıktır. Karma¸sık do˘gasından dolayı, bu hastalı˘ga sebep olan risk genlerinin sadece kü¸cük bir yüzdesi, gen dizileme ¸calı¸smaları sayesinde tespit edilebilmi¸stir. Bu hastalı˘ga sebep olan et-menleri anlamak i¸cin, mutasyon yükü verisini gen ortak ifade ¸cizgeleri üzerinde kullanabilen bir derin ö˘grenme mimarisi tasarlandı. Ek olarak, hastalık i¸cin önem arz eden sinirsel geli¸sim periyotlarını tespit edebilmek i¸cin derin ö˘grenme mod-eline uzmanların karı¸sımı modeli de eklendi. Bu u¸ctan uca e˘gitilebilen sistem ¸cizge ba¸sına bir a˘gırlık ö˘grenerek bütün genler i¸cin bir olasılık atayabilmekte-dir. Modelimizin sonu¸cları, otizm geni risk tahmin gücünün en geli¸smi¸s mod-ellere kıyasla arttı˘gını göstermektedir. En yüksek risk penceresi olarak talamus ve serebellum beyin bölgesinin mediyodorsal ¸cekirde˘gini ve yenido˘gan/erken be-beklikten orta/ge¸c ¸cocukluk dönemine kadar olan periyot (0 ay - 12 ya¸s) belir-lenmi¸stir. Sonu¸clarımız, otizm ile alakalı bilinen anahtar biyolojik yollar ve gen hedefleri i¸cin iyi bir zenginle¸smeye de i¸saret etmektedir. OSB ile ili¸skili bir etiketi olmayan kopya sayısı de˘gi¸sikli˘gi bölgelerinde risk geni olmaya aday birka¸c gen gözlemlenmi¸stir. Yalancı-pozitif kesin referans genler, etiketlenmemi¸s olmasına ra˘gmen OSB ile ili¸skili olma olasılı˘gı yüksek genler ve yüksek sıralamalı yalancı-negatif kesin referans genler incelenmi¸stir.

Anahtar sözcükler : Otizm Spektrum Bozuklu˘gu, Ç izge Evri¸simsel A˘glar, Derin ¨

(5)

Acknowledgement

First of all, I would like to thank my advisor Asst. Prof. Erc¨ument C¸ i¸cek for his understanding and assistance throughout my study. It wouldn’t be possible for me to neither conduct this study, nor becoming a researcher without his guidance and patience. Therefore, i am very grateful for his continuous support.

I am also thankful to my jury members X and Y for reading my thesis and for accepting being in my thesis committee. I thank to Simons Foundation Autism Research Initiative for funding this research via the SFARI 640935 pilot grant awarded to Erc¨ument C¸ i¸cek.

I would like to thank ˙Ilayda Beyreli for her support on this work. She helped me to design the architecture and implement the source code for this study. She also shared her invaluable feedbacks with me during the whole process.

I feel very lucky for being in the community of Bilkent University for 7 years. During my studies, I earned invaluable friendships and collected wonderful mem-ories. I would like to thank Do˘gukan, Arda, Yusuf, Ozan and Muammer for their friendship for our good memories. I am also very grateful for all the good time we had with Ayberk, Sezernaz and Ya˘gmur, and for their precious friendship. I also would like to thank my colleague and friend ˙Ilayda for all of her support, feedbacks and efforts. She has a great amount of contribution to this thesis. I also thank Mustafa for his friendship and introducing me to the life-long hobby of board games.

Finally, I am endlessly thankful for all the efforts of my parents. They were always on my side, supporting me all the time. Without their love, support and belief, I wouldn’t be where I am now in the first place. I cannot pay back for all of their efforts whatever I do in return. I will be doing my best to make them proud.

(6)

List of Figures

4.1 The architectural model of DeepASD for genome-wide ASD gene

risk assessment. The model uses TADA features as well as 52 gene co-expression networks extracted from BrainSpan dataset. The feature set includes de novo loss of function (and missense) mu-tation counts, transmitted mumu-tation counts for control and case groups, pLI value, de novo mutation frequency values and protein truncating variant counts. TADA dataset is used by all GCN mod-ules and the gating network. This whole system produces a single probability value ˆy for each of the 25,825 genes and it is end-to-end trainable. . . 21

5.1 ROC and precision-recall curve distribution comparison between DeepASD and Krishnan et al. (a) Area under ROC curve distri-bution between the two methods. (b) Area under precision-recall curve distributions comparing the same methods as in (a). Both in (a) and (b), outlier points are depicted. The solid center line depicts the median, dashed line depicts the mean value for each panel. Box limits demonstrates lower and upper quartiles and whiskers denote 1.5 interquartile range. . . 22

(10)

LIST OF FIGURES x

5.2 Process of smoothing SVM output values. Using ten-fold cross-validation, 10 different isotonic regression model is fit. Then, we find the knots in each model and combine them to obtain (a). The transitions are not smooth and further smoothing is required. (b) Isotonic regression is applied once more on this knot collection to obtain a smoother curve. (c) Linear interpolation is used to obtain final mapping. . . 25 5.3 Probability value scatter plot of DeepASD and Krishnan et al.

(a) Probability values of E1 genes against all other genes for both methods. (b) Non-mental-health related genes compared to all other genes. Both panels contain the same 25,825 genes. y = x line (gray) is also drawn for visual aid. . . 26

5.4 Precision-recall curves for DeepASD, DAWN (PFC-MSC3-5 and

PFC-MSC4-6), Krishnan et al. and Zhang and Shen DAMAGES score. (a) The curve for E1 plus non-mental-health genes. (b) E1 + E2 genes are used. (c), all gold standard genes are used. All precision-recall curves have a cutoff rank of 2000. (c) has starting rank value of 5 whereas (a) and (b) have starting rank value of 1. 27

5.5 Enrichment analysis for WNT and MAPK pathways, CHD8,

RB-FOX (splice and peak), FMRP and TOP1 target genes, post-synaptic density complex and histone modification processes. P values are calculated using Binomial test. . . 29 5.6 The genes spanned by 16p11.2 and 15q11-13 CNVs. For each gene,

DeepASD rank is provided along with its evidence level if it is labeled. . . 34 5.7 The genes spanned by 15q13.3 and 1q21.1 CNVs. For each gene,

DeepASD rank is provided along with its evidence level if it is labeled. . . 35

(11)

LIST OF FIGURES xi

5.8 The genes spanned by 22q11 CNV. For each gene, DeepASD rank

is provided along with its evidence level if it is labeled. . . 36 5.9 The heatmap for average posterior probabilities assigned by each

GCN module of DeepASD to E1 genes. The heatmap shows 4 brain regions with 13 neurodevelopmental windows to illustrate average posterior probability for all 52 co-expression networks. . . 38 5.10 Heatmap illustrating the weights assigned by gating network to

each GCN module (network). 52 networks are plotted as 4 × 13 matrix for each brain region and for each neurodevelopmental period. 38 5.11 Tissue-specific frontal cortex PPI network from DifferentialNet

database constructed by using NetworkAnalyst system. Node sizes indicate betweenness centrality of nodes (larger nodes have higher betweenness centrality) and node colors map node degrees, red indicating higher degree. . . 43

(12)

List of Tables

4.1 Neurodevelopmental time periods proposed by Willsey et al.

(2013). Since the neurodevelopment mainly occurs during preg-nancy period, higher precision for this period is provided. Sliding window approach uses these 15 periods to create 13 windows. The window size is 3. Thus, the windows are constructed as [1-3], [2-4], [3-5], ..., [13-15]. . . 15 4.2 Brain region clusters used to construct 52 co-expression network

using BrainSpan data. These clusters are formed by using hierar-chical clustering based on transcriptional similarity of these brain regions (Willsey et al., 2013). . . 16

5.1 P value comparison for the rankings of DeepASD, Krishnan et al., DAWN and DAMAGES score using different set of gold standard genes for evaluation. Each P value is calculated using two-sided Wilcoxon rank-sum test. . . 26 5.2 DeepASD top 1% gene enrichment in BioPlanet 2019 dataset. Top

29 pathways are included out of 560 pathways. Enrichr tool is used to calculate enrichment values. Pathways are sorted in increasing order with respect to P values. . . 31

(13)

LIST OF TABLES xiii

5.3 Edge case genes with their 2-hop distance neighbor statistics. 2 genes per group described in section 5.5 are provided. CHD8 is also added to table as a reference since it is ranked as 1st _{by TADA} 3rd_{by DeepASD, and it is a well-established ASD risk gene. TADA} rank describes the ranking of a gene in TADA with respect to its q-value. E1, E2, E3/E4 and negative genes describe the count of such genes in 2-hop neighborhood. TADA genes column is the count of unlabeled genes within 2-hop neighborhood with their TADA q-value rank < 1000. . . 41

A.1 46 E1 ground truth gene list. These genes are the most confident ASD risk genes and also used for performance calculation along with non-mental-health related genes. Evidence weight of 1.0 is used while training with these genes. . . 60 A.2 67 E2 ground truth genes. Their evidence weight is set to 0.5 and

they are excluded while calculating performance metrics. These genes are the most confident ASD genes after E1 genes. . . 60 A.3 525 E3-E4 ground truth genes with evidence weight 0.25. They are

the positive ground truth genes with the lowest confidence. They are not included in performance metric calculation. . . 61 A.4 First 600 non-mental-health related genes. All non-mental-health

related genes have evidence weight of 1.0. They are used along with E1 genes for performance metric calculation. Remaining genes are given in Table A.5 . . . 62 A.5 Remaining 585 non-mental-health related genes. First 600 genes

are given in Table A.4 . . . 63 A.6 Gene list for WNT signaling process. This gene set is used to

(14)

LIST OF TABLES xiv

A.7 Gene list for MAPK signaling process. This set is used to perform enrichment analysis using DeepASD ranking for MAPK signaling. 65 A.8 CHD8 target gene list part 1 (of 3 parts) containing 750 genes. This

gene set is used for enrichment calculations of DeepASD ranking with respect to CHD8 targets. . . 66 A.9 Part 2 (out of 3) of CHD8 target gene list containing second 750

genes. . . 67 A.10 Part 3 (out of 3) of CHD8 target gene list containing last 584 genes. 68 A.11 RBFOX (splice) target genes used for calculating DeepASD

rank-ing enrichment. . . 69 A.12 Part 1 (out of 2) of RBFOX (peak) target gene list (first 588 genes)

that is used for DeepASD enrichment analysis. . . 70 A.13 Part 2 (out of 2) of RBFOX (peak) target gene list containing last

397 genes. . . 71 A.14 Part 1 (out of 2) of gene list for FMRP targets containing first

588 genes. These genes are used to calculate enrichment score of DeepASD for FMRP targets. . . 72 A.15 Part 2 (out of 2) of gene list for FMRP targets containing last 207

genes. . . 73 A.16 TOP1 target gene list that is used to calculate DeepASD

enrich-ment for TOP1 targets. . . 74 A.17 Part 1 (out of 3) of post synaptic density (PSD) complex gene

set containing first 588 genes. These genes are used to calculate enrichment of DeepASD ranking on PSD complex gene set. . . 75

(15)

LIST OF TABLES xv

A.18 Part 2 (out of 3) of PSD complex gene set containing second 588 genes. . . 76 A.19 Part 3 (out of 3) of post synaptic density (PSD) complex gene set

containing last 282 genes . . . 77 A.20 Part 1 (out of 2) of histone modifier gene list containing first 588

genes. These genes are used to calculate enrichment for DeepASD ranking on histone modifiers. . . 78 A.21 Part 2 (out of 2) of histone modifier gene list containing last 129

genes. . . 79 A.22 First 132 gene rankings and posterior probabilities generated by

DeepASD. Probabilities are obtained by averaging results from 200 epochs, excluding results coming from training data. . . 80 A.23 Genes with ranks between 133 to 258 and their posterior

probabil-ities generated by DeepASD. Probabilprobabil-ities are obtained by averag-ing results from 200 epochs, excludaverag-ing results comaverag-ing from trainaverag-ing data. . . 81

(16)

Chapter 1 Introduction

Autism Spectrum Disorder (ASD) is a neurodevelopmental genetic disorder with approximately a thousand genes are involved in the etiology [1]. Due to its complexity and high frequency among the population, many large scale exome [2, 3, 4, 5, 6, 7, 8] and genome [9, 10, 11, 12, 13, 14] sequencing studies are conducted to understand the cellular, functional and genetic structure of ASD. Despite all of the effort, the most comprehensive and recent study found strong evidence for 102 risk genes (F DR ≤ 0.1) after inspecting ∼ 36k samples [15]. Therefore, 90% of the risk genes are yet to be discovered and validated. The reason for this huge information gap stems from the complexity of disorder mechanics and the insufficiency of samples (i.e., trios). One of the most valuable information we have for gene risk assessment is de novo loss-of-function (dnLoF) mutations. These mutations are observed on various set of genes and they are scarce. The scarcity and variety of such mutations avert discovery of important signals. As a result, number of identified risk genes are low compared to the count of samples inspected. dnLoF mutations and their significance in ASD gene risk assessment are also described in Section 1 of Chapter 2. The mutation burden data is sparse since most of the genes in human genome do not have observed dnLoF mutation. Moreover, most of the dnLoF mutations are observed once in a gene. Although dnLoF mutations generally provide useful information, single observed dnLoF mutation might be insufficient to identify a gene as a risk gene, and might be

(17)

noise. Therefore, the data is sparse and have intrinsic noise, which makes it essential to design algorithms that is robust against this noise and capable of discovering missing risk genes. This leads to a broad literature on gene risk prediction methods.

The first type of gene risk assessment methods calculate risk using statistical methods on genetic burden data obtained from case-control and family studies [1]. These methods can also work with multiple traits thanks to the recent im-provements in these methods [16]. The second type is rather a gene discovery approach. In this line of work, gene disruptive mutations and de novo gene dis-ruptive mutation counts are used as prior risk to obtain posterior gene interaction network-adjusted gene risk values. These methods use guilt by association princi-ple and (i) provide a genome-wide ranking by propagating available data through the network and perform prediction for unlabeled genes and (ii) infer biological pathways, molecular functions and gene subnetworks that are associated with the disorder [1, 17, 18, 19, 20, 21, 22, 23, 24]. These methods generally utilize a sin-gle source of network data such as gene co-expression networks, protein-protein interaction networks, or brain-specific gene interaction data. Several methods combine a portion of these data, however, they are not capable of utilizing all of them simultaneously [19, 21, 25, 26], which reduces the overall prediction capabil-ity since many sources of information are discarded. For instance, gene expression values measured from various brain regions can be used to construct gene inter-action networks that model neurodevelopment. However, since these methods cannot process multiple networks at once, gene risk assessment per brain region and per developmental window cannot be performed. Moreover, it is not possi-ble to pinpoint important brain regions and neurodevelopmental time points for diesaese etiology. A common practice is to perform such an analysis indepen-dently from gene risk assessment as a separate downstream task [26, 27]. This is unfortunate as if the model was able to analyze all networks simultaneously, this task would have been embedded in the genomewide risk assessment task which then (i) would improve the prediction power by utilizing a diverse set of gene in-teraction information, and (ii) would let the model learn the importance of each brain region and neurodevelopmental period. This way, it is possible to obtain

(18)

important biological insights while possibly having a better ranking of the risk genes.

To address the issues stated above, we introduce a novel deep learning model (Deep Autism Spectrum Disorder algorithm - DeepASD ). Our model uses graph convolution and analyzes multiple gene co-expression networks to learn an em-bedding for each gene on each network. Then, it uses a gating network (mixture of experts) to learn the importance of each network for risk prediction. Hence, the model is interpretable in this sense. Our results show that mediodorsal nucleus of the thalamus and cerebellar cortex brain region and neonatal/early infancy to middle & late childhood period (0 month - 12 years), is the most important neu-rodevelopmental window for ASD gene risk prediction. We compare DeepASD with other state-of-the-art methods in terms of gene risk prediction performance. Additionally, we also show that the top percentile genes predicted by DeepASD are highly enriched in pathways that are known to be associated to ASD. We also inspect CNV regions that are associated to ASD and genes within these regions to show that DeepASD can prioritize risk genes that are inside these regions. We also discuss the candidacy of several unlabeled genes inside these CNV regions that are ranked high by DeepASD. Finally, we discuss new findings and confident false positive/negative predictions of the method: top-ranked negatively-labeled genes and bottom-ranked positively labeled ground truth genes.

The subsequent chapters have the following organization. Chapter 2 con-tains more background information such as descriptions of technical terms and concepts. In Chapter 3, previous studies and their findings on ASD gene risk prediction are discussed. In Chapter 4, all the methods and algorithms we have used such as graph convolutional networks and mixture of experts are described. In Chapter 5, we present our results. Finally, we conclude this thesis in Chapter 6 with a brief summary of the findings of this study.

(19)

Chapter 2 Background Information

2.1 De Novo Gene Disrupting Mutation

A gene is defined as a sequence of chromosomal DNA which can be translated to either functional or non-functional RNA, which then can be used to produce proteins [28]. Genes are the hereditary building blocks. Since humans have two copies of each chromosome, there exists at least two copies of each gene, which are called as alleles. Allele genes determine the phenotype, the physical traits, of a person. Although human DNAs are %99.9 similar to each other, the actual phenotypes could vary a lot as a result of the variation in the DNA. Most of the variants are harmless. However, some can cause harm when it alters the produced protein by that gene. Thus, the mutation location bears the most importance. There are multiple mutation types such as missense, nonsense, silent, frameshift etc. that may or may not affect the final protein of a gene. If, after a mutation, a gene cannot produce its functional protein, that gene is disrupted, or knocked out.

The term de novo (literally means ’of new’) could be used for mutations. When a variant is present on child’s DNA, but not present on parents’ DNAs, it is called a de novo variant. This happens when one of the parent’s germ cell is mutated,

(20)

or the fertilized egg is mutated during the early embryogenesis period.

De novo gene disrupting mutations are proved to be essential for ASD gene risk assessment [21]. When both parents are healthy and the child has ASD, it is likely that one of the de novo mutations caused disruption in one or more genes, and as a result the child has ASD. Therefore, genes with de novo gene disrupting mutations on ASD probands are good candidates for being risk genes.

2.2 Biological Pathways and DNA/RNA

Bind-ing Proteins

A biological pathway is defined as successive processes within a cell that alter the cell structure or produce a certain product. Biological pathways control various kind of processes within the body, and can be disrupted by mutations occurred in at least one of the genes involved.

Recent studies showed that there are many different pathways that are dis-rupted commonly among the affected ASD probands [5, 29, 30, 31]. These path-ways include Wnt-β cathenin, mTOR, and MAPK pathpath-ways. It is important to identify such pathways since all genes involved within a pathway become candi-date risk genes, which helps to narrow down the list of suspected genes.

Apart from the biological pathways, we are also concerned about some partic-ular target genes of a specific protein. A protein is called RNA-binding protein (RBP) if it is responsible for regulating mRNA translation and its maintenance. Similarly, there are also proteins that binds to DNA, which are called DNA-binding proteins (DBPs). DBPs and RBPs regulate production of other proteins. Some RBPs and DBPs are shown to be associated to ASD [32, 33, 34, 35]. We also inspect RBPs and DBPs related to ASD to find out possible risk genes, or use known RBP and DBP target genes to validate our findings.

(21)

show that the ranking generated by DeepASD is in fact sensible. The target gene lists of RBPs and DBPs, the genes involved in a biological pathway or the genes involved in a molecular process can be used to perform enrichment analysis. First, we divide gene ranking produced by DeepASD into deciles. Then, we calculate the number of genes that are in a pathway, target gene list etc. for a specific decile. Then, we plot a graph that shows the number of genes in a gene list for each decile. If the top decile contains more genes compared to other deciles, then we can infer that our ranking is in fact sensible. To quantify, we calculate a P -value indicating the probability of randomly obtaining such an enrichment profile. This analysis is performed and the results are described clearly in Chapter 5 (Results).

2.3 Copy Number Variation

After the Human Genome Project has been completed, it is become evident that some people have more (or less) copies in their genome, and this is quite common among the population [36]. This phenomenon is called copy number variation (CNV). CNVs could occur anywhere within genome by having multiple insertions or deletions of a region (or complex rearrangements of regions). CNV regions could span multiple genes, and can include more than a thousand nucleotides up to millions of nucleotides [37]. CNVs could be transmitted, but they could also occur de novo. CNVs could be harmless, or could cause several disorders such as ASD, Schizophrenia and Parkinson Disease [38].

Similar to de novo mutation analysis, researches show that there are particular CNVs that are associated to ASD [37, 39, 40, 41, 42, 43, 44]. Sebat et al. [45] identified de novo CNVs in 10% of the case group, whereas they identified 1% de novo CNVs in control groups, which shows us that similar to de novo gene disrupting mutations, de novo CNVs are also important for identifying risk genes for ASD. As an example, Chung et al. [43] point that maternal 15q11-13 CNV is associated to ASD. Maternal 15q11-13 location indicates the region in the longer arm of maternal chromosome 15. 11-13 denotes the band within the chromosome.

(22)

We inspect the CNVs occurred on 16p11.2, 15q11-13, 15q13.3, 1q21.1 and 22q11 locations which are pointed to be associated with ASD [26]. Since these CNVs span multiple genes, we identified the genes that lie inside the boundaries of these regions, and then plotted the rankings of these genes in Chapter 5 (Results).

2.4 Gene Co-expression Network

In human body, approximately 400.000 proteins are being produced [46]. Differ-ent cells across our body need to produce particular sets of proteins. Therefore, it is not necessary for all of our genome to be active in every cell of our body. Gene expression mechanism regulates this by adjusting the type and amounts of mRNA in a particular cell. We can utilize this information to identify genes that are actively working in a specific tissue, or we can calculate their co-expression.

Co-expression is calculated using correlation matrix between genes for a gene expression sample set. Once the gene expression data is given, we can calcu-late Pearson correlation from gene expression matrix to obtain correlation matrix which we can name as co-expression matrix. This matrix contains values between −1 and 1 that indicates the co-expression value of two genes. To obtain a net-work from a co-expression matrix, we apply a threshold on absolute co-expression values to convert them to boolean values. This newly obtained matrix will be the adjacency matrix for our gene co-expression network. A typical threshold value could be 0.7, which is often enough to capture strong co-expression between genes. This threshold value could be fine-tuned to obtain a network of desired size. A higher threshold yields smaller networks with stronger connections whereas a smaller threshold yields bigger networks with loose connections.

For ASD gene risk assessment, we focus on gene expression in various brain regions. BrainSpan dataset from Allen Brain Atlas [47, 48] provides 57 samples from 15 brain regions with ages varying from 8 postconception week to 40 years. We cluster these samples with respect to brain regions and then create a gene co-expression network for each brain region and neurodevelopmental period. These

(23)

spatio-temporal networks allow us to feed the neurodevelopmental process into our model. More detail on constructing these networks are provided in Chapter 4 (Methods).

2.5 Protein-Protein Interaction Networks

Protein-protein interaction (PPI) networks denote biochemical interactions of proteins. These interactions are often physical and they represent a specific func-tion within an organism. There are multiple types of protein-protein interacfunc-tions such as transient interactions, permanent interactions, covalent interactions, non-covalent interactions etc. Similarly, we can also categorize PPIs depending on the region of interest. Cell PPIs focus interactions occurring within a cell whereas tissue-specific PPIs include interactions specific to a tissue. There are many PPI databases such as BioGrid [49], HPRD [50] and STRING [51]. These databases are compiled using thousands of academic studies and provide various types of PPIs including the types we have described above. In the scope of ASD risk assessment, PPIs can be used to utilize functional similarity information among genes. A gene could be a risk gene provided that it is functionally similar to other risk genes. However, we didn’t use PPI networks during the training process of DeepASD. Instead, we used the top 1% genes selected by DeepASD as a gene filter for tissue-specific PPI networks and then we identified genes with the most edges in these filtered tissue-specific PPI networks. Such genes are candidate risk genes since they are functionally similar to many other candidate/known risk genes.

2.6 TADA Framework

He et al. [1] propose a statistical framework called as Transmission and De Novo Association (TADA) that uses transmitted and de novo gene disrupting mutation data to obtain an ASD gene risk P -value for each gene. This framework uses a

(24)

Hierarchical Bayes framework, which pools all mutation information to improve overall prediction and still keeps scores gene specific. This model assumes that some genes are disorder-related and the others are not, then it learns the param-eters of each group by performing likelihood maximization and then calculates the score afterwards [1]. This approach improves the previous ASD gene risk prediction power and creates a basis for our mutation burden dataset.

TADA framework is extended with more trios and more data to improve ASD gene risk prediction performance [15]. This new dataset (extTADA) is the most comprehensive to date and it is the dataset we used for our study. In addition to the mutation counts for control and case groups, the dataset also include pLI, intolerance of genes to mutations [52]. pLI values vary between 0 and 1. The value of 1 indicates that the gene is totally intolerant to mutations, i.e., a person with such a mutated gene cannot survive. On the contrary, pLI value of 0 indicates that a gene is totally tolerant to mutations and henceforth different versions of such a gene exist among the population. Our aim for our model is to make it learn the relationships between these values (mutation counts and pLI) and ASD gene risk association. Thus, we feed all these variables to our model instead of simply using TADA scores.

(25)

Chapter 3 Related Work

In this chapter, we discuss and explain related methods in the field.

3.1 DAWN

Detecting Association with Networks (DAWN) is a network-driven Hidden Markov Random Field (HMRF) based ASD gene risk detection algorithm [25, 53]. The DAWN algorithm consists of two main parts: Partial Neighbourhood Selec-tion (PNS) and HMRF parameter estimaSelec-tion. They use TADA scores for ASD as well as gene co-expression networks extracted from BrainSpan transcriptome dataset [48]. Among 4 brain regions and 13 neurodevelopmental periods, they have chosen prefrontal cortex (PFC) region and early fetal to early mid-fetal period (3-5) and PFC region and early mid-fetal to late mid-fetal (4-6) period networks since these networks are labeled as the most critical networks for ASD by Willsey et al. [27]. Since the algorithm is designed to operate on a single network, they provided results using these networks separately.

PNS step of the algorithm eliminates nodes with P -values higher than a thresh-old. Then, it applies a threshold based on the correlation between nodes. First order neighbors of selected nodes are taken and finally the network is constructed

(26)

using a special regression-based approach. This partial network concentrates on the known ASD risk genes along with their first order neighbors. Parameter es-timation part of the algorithm assumes two clusters. First cluster consists of the risk genes and second cluster consists of other genes. Assuming these clusters following a Gaussian mixture distribution, the model iteratively constructs these two clusters. The final version of the first cluster contains the risk genes, ranked by their FDR.

Although the risk genes selected by DAWN are shown to be associated with ASD, the method lacks a few features. DAWN cannot utilize more than one network, and it cannot provide a genome-wide ranking since it can only rank the selected genes.

3.2 Genome-wide Ranking by SVM-based

Clas-sifier

Krishnan et al. [26] propose an SVM based method for ASD gene risk assess-ment. They obtain a large gene interaction dataset by combining thousands of large genomics data and using a naive Bayes classifier. This dataset contains gene expressions, molecular functions, pathway data and various other genomics data. They perform SVM ten times with 5-fold cross-validation and obtain their genome-wide prediction values. While training, they assign evidence weights on input genes. These evidence weights are calculated using the information obtained from multiple sources. Highest evidence genes (E1) are curated from category 1 and 2 genes from Simons Foundation Autism Research Initiative (SFARI) [54] and genes from Online Mendelian Inheritance in Man (OMIM) [55]. Second evi-dence level genes (E2) are curated from category 3 SFARI genes. Third evievi-dence level genes (E3) are obtained from HUGE [56] and GAD [57] databases. Finally, E4 genes are collected from Gene2MeSH (http://gene2mesh.ncibi.org), SFARI category 4 genes and DGA [58] database. In their model, E1 genes are given a weight of 1.0, E2 genes are given a weight of 0.5 and E3 and E4 genes have 0.25

(27)

as their weights. Since there are more E3 and E4 genes compared to combined sum of E1 and E1 + E2 genes, it is important to reduce the overall importance of E3 and E4 genes because they contain considerable amount of noise. We use the same evidence weights in DeepASD since we use the same ground truth set Krishnan et al. curated.

An important analysis performed by Krishnan et al. is spatio-temporal brain region and neurodevelopmental period enrichment. Using the gene expression values for each brain region and for each neurodevelopmental time period, they obtained a heatmap showing the important brain regions and neurodevelopmen-tal periods. According to their findings, early feneurodevelopmen-tal to late midfeneurodevelopmen-tal periods and almost all of the brain regions are active in terms of gene expressions. Their find-ings indicate that genes responsible for neurodevelopmental processes are more active during early fetal to midfetal periods. In this study, we performed a similar analysis as well by letting our model to learn weights per each network during the training process. The explanation and discussion on that analysis could be found in Chapter 5 (Results).

3.3 DAMAGES Score

DAMAGES method uses gene expression data from central nervous system cells of 24 mice, coming from 6 brain regions [59]. 76% of the genes within the database are orthologous to human genes. Likely gene disrupting (LGD) mutation data are used to calculate DAMAGES score by a 2-step process. First, PCA is performed on log2 transformed expression intensity values, then, LASSO regularization is applied on the principal components obtained before with respect to mutation sources (probands versus siblings) to evaluate the contribution of each principal component. After this process, a DAMAGES score indicating ASD risk for each gene is obtained. Furthermore, the score is then converted to an ensemble score by applying logistic regression on (i) the combined dataset of ExAC metrics (pLI and mis-Z) and (ii) the original DAMAGES scores. In this study, instead of creating a network out of gene expression data, they have used the actual gene expression

(28)

data itself. This study provides improved results compared to Krishnan et al. and DAWN.

3.4 ST-Steiner

ST-Steiner is a method based on prize collecting steiner trees [20]. Similar to other studies in this area, ST-Steiner also uses gene co-expression networks. The algorithm picks the optimum tree that maximizes the profit where each node has a prize based on its TADA q-value and each edge has a penalty inversely correlated with co-expression value of two connecting nodes. ST-Steiner optimizes the following equation.

oF(F ) = X e∈EF c(e) + β X v /∈VF p(v) + ωκF, β ≥ 0, ω ≥ 0 (3.1)

β value in this equation is a multiplier for each node prize, c(e) is the cost function for edges and p(v) is the prize function. κF denotes the number of connected subgraphs (trees). Thus, this algorithm can find more than one tree. However, it performs on a single gene interaction network. Another feature of this algorithm is that it can model spatio-temporal development of brain and can incorporate multiple networks as a spatio-temporal cascade. The results of the algorithm on networks that belong to earlier time windows can be used to boost the node prizes of the genes in the following (later) time windows. Thus, the algorithm has a higher chance of selecting these nodes (genes). This way, spatio-temporal development of the brain can be modelled. However, the output tree size adjustment is time consuming, and this method cannot provide a genome-wide ranking since it only selects a subset of genes as candidate risk genes.

(29)

Chapter 4 Methods

4.1 Construction of Gene Co-Expression

Net-works

Using the BrainSpan data published by Allen Brain Atlas [48, 60], we have con-structed 52 networks based on 4 brain region clusters and 13 neurodevelopmental windows. BrainSpan data contain samples from 57 postmortem brains and 16 brain regions. Ages of these samples range between 8 postconception week (pcw) to 40 years, which is modeled using 13 neurodevelopmental windows constructed by applying a sliding window approach on 15 neurodevelopmental time periods. The sliding windows and the original time periods are given in Table 4.1. This spatio-temporal system of networks helps us to model human brain development. We generalized 16 brain regions into 4 clusters based on their transcriptional similarity as Willsey et al. [27] proposed. These 4 brain regions are (i) PFC-MSC (prefrontal cortex and primary motor-somatosensory cortex), (ii) MDCBC (mediodorsal nucleus of the thalamus and cerebellum), (iii) V1C-STC (primary visual cortex and superior temporal cortex) and (iv) SHA (striatum, hippocam-pus and amygdala). The brain regions inside each cluster is given in Table 4.2.

(30)

Each of the 52 networks are constructed after filtering samples according to their brain regions and ages. After filtering, each network is constructed by applying Pearson correlation among samples and then applying a threshold r = 0.8 on this absolute threshold value. We use 25,825 nodes in our model, so each network must contain exactly 25,825 nodes. However, after applying a threshold, some nodes are removed since they are isolated. We add self-loops to these isolated nodes to obtain the required node count. Each network is named after its brain region and time window. For instance, MDCBC9-11 network is constructed by using samples from mediodorsal nucleus of the thalamus and cerebellar cortex, and samples that are in age periods between 9 and 11 (inclusive).

Table 4.1: Neurodevelopmental time periods proposed by Willsey et al. (2013). Since the neurodevelopment mainly occurs during pregnancy period, higher pre-cision for this period is provided. Sliding window approach uses these 15 periods to create 13 windows. The window size is 3. Thus, the windows are constructed as [1-3], [2-4], [3-5], ..., [13-15].

Time window Age interval

1 Embryonic (4-8 pcw) 2 Early fetal (8-10 pcw) 3 Early fetal 2 (10-13 pcw) 4 Early mid-fetal (13-16 pcw) 5 Early mid-fetal 2 (16-19 pcw) 6 Late mid-fetal (19-24 pcw) 7 Late fetal (24-38 pcw)

8 Neonatal & early infancy (0 - 6 months)

9 Late infancy (6 - 12 months)

10 Early childhood (1 - 6 years)

11 Middle and late childhood (6 - 12 years)

12 Adolescence (12 - 20 years)

13 Young adulthood (20 - 40 years)

14 Middle adulthood (40 - 60 years)

(31)

Table 4.2: Brain region clusters used to construct 52 co-expression network using BrainSpan data. These clusters are formed by using hierarchical clustering based on transcriptional similarity of these brain regions (Willsey et al., 2013).

Cluster Name Brain Regions in Cluster

PFC-MSC

M1C (Primary motor cortex), S1C (Primary somatosensory cortex), VFC (ventral prefrontal cortex), MFC (medial prefrontal cortex), DFC (dorsal prefrontal cortex), OFC (orbital prefrontal cortex)

MDCBC MD (mediodorsal nucleus of the thalamus), CBC (cerebellar cortex)

V1C-STC

V1C (primary visual cortex), ITC (inferior temporal cortex), IPC (posterior inferior parietal cortex), A1C (primary auditory cortex), STC (superior temporal cortex)

SHA S (striatum), H (hippocampus), A (amygdala)

4.2 Ground Truth Gene Sets

DeepASD works in a semi-supervised setting. A subset of genes with various evidence levels are used as ground truth genes. There are 1,823 labeled genes aout of 25,825 genes. Using the networks and mutation burden (TADA) data, a score for each gene is calculated so that we can obtain a genome-wide ranking. We use the same labels (both positive and negative) provided by Krishnan et al. (2016) to provide a fair comparison.

Our positive ground truth set consists of 638 genes with 4 different evidence lev-els (E1, E2, E3 and E4). These genes are collected from SFARI [54], OMIM [55], DGA [58], GAD [57], HUGE [56] and Gene2Mesh (http://gene2mesh.ncbi.org/). E1 genes indicate the highest evidence and evidence levels decrease towards E4 genes. There are 46 E1 (with evidence weight 1.0) genes provided in Table A.1, 67 E2 (with evidence weight 0.5) genes provided in Table A.2 and 525 E3 and E4 genes (with evidence weight 0.25) provided in Table A.3. Similarly, there are 1,185 non-mental health related genes with a negative label provided in Table A.4 and Table A.5. The evidence weight for all non-mental health related genes is 1.0, which is the same as E1 genes. Performance metrics are calculated using E1 and non-mental health related genes as it is also done by Krishnan et al. (2016).

(32)

4.3 DeepASD Model

Parts of the DeepASD model and its training process are described in the follow-ing sections. DeepASD model is illustrated in Figure 4.1.

4.3.1 Graph Convolutional Network

Computer vision field achieved a milestone after convolutional neural networks (CNN) are first proposed [61]. Utilizing the spatial information within images us-ing filters (kernels) have proven successful over traditional computer vision tech-niques. Applying CNNs to graphs is not possible due to the irregular structure of graphs. Thus, spectral and spatial graph convolution algorithms are proposed to apply the same concepts on graphs, which are proven to be successful [62, 63, 64]. Nevertheless, these algorithms were slow. Kipf and Welling proposed the graph convolutional network (GCN) algorithm that estimates convolutional filters as the Chebsyhev expansion of the graph Laplacian [65, 66]. This algorithm uses message passing to perform convolution on first order neighbors. GCN learns a weight for each of its layer neurons. Graph convolution uses the formula below

Hk[i] = σ ˆD−0.5E ˆˆD−0.5Hk−1[i]Wk−1

(4.1) In the equation above, Hk[i] denotes the output of the single graph convolution operation of layer k for gene i, ˆD denotes the diagonal degree matrix, ˆE is the normalized adjacency matrix with self-loops, and Wk is the trainable weight ma-trix, and finally, σ is an activation function. We use Rectified Linear Unit (ReLU) as our activation function. Kipf and Welling proposed that cascading 2 layers of GCN yields the best results, and hence we apply the same. At the end of the sec-ond layer, we use Sof tmax as our activation function instead of ReLU to obtain probability values as a result. For DeepASD, we have d1 = 4 for our first layer hidden neuron count, and we have d2 = 2 for our final layer output neuron count. Each of the 52 networks are fed into a 2-layer GCN module, and they produce 52 probability values (for each network). Layer neuron counts are fixed for all

(33)

52 GCN modules. Probability values produced by each GCN module are then combined using mixture of experts model.

4.3.2 Early Stopping

Early stopping is a widely used technique in deep learning that provides an im-plicit regularization on some convex problems [67, 68]. DeepASD performs early stopping during cross-validation to prevent overfitting. As the complexity of a model increases, the model becomes more prone against overfitting. To avoid this, we generally use generalization methods or perform early stopping. In DeepASD, early stopping is performed by monitoring validation loss during the training process. A sliding window of size 7 is used to monitor the validation loss. If the validation loss decreases during the entire window, training process halts immediately.

4.3.3 Weight Decay

Weight decay is a simple regularization method that reduces the model weights by the provided amount after each epoch [69]. Square of all parameters are multiplied with weight decay parameter so that we penalize bigger weight values. This forces model to keep weights as small as possible while optimizing the loss function. During backpropagation, derivative of the term with weight decay becomes 2 × wd× w, wd being the weight decay parameter and w is the actual weight matrix, and it is subtracted from previous weights. Thus, parameters are ’decaying’ over time. We use the weight decay value of 1e − 4 for our model.

4.3.4 Dropout Regularization

Dropout is another method of regularization that is widely used. The main idea of dropout is to drop random neurons during training process [70]. This method

(34)

uses a parameter called p, which indicates the probability of dropping a neuron. Lower values of p makes the dropping seldom, whereas higher values of p could cause higher dropping rate. It is shown that using p = 0.2 for input neurons and p = 0.5 for hidden neurons generally gives the optimal results [70].

Randomly dropping neurons forces the model to perform using a different set of neurons each time. This way, we prohibit a couple of neurons becoming the main source of information, and the model becomes capable of performing with any subset of neurons. For DeepASD, we use dropout between the first and second layers of GCNs and p = 0.5 value is used for all GCN modules.

4.3.5 Mixture of Experts

Mixture of experts (MoE) is a method for combining multiple weak learners and dividing and sharing the feature space among these weak learners [71]. MoE con-sists of multiple learners and a gating network. In our case, the gating network is a single layer neural network without any hidden layers, and it acts as a su-pervisor that assigns a part of feature space to each learner. It takes the feature matrix X as input, and then outputs a weight vector ~v, which contains a weight per weak learner. The gating network applies Sof tmax to its outputs so that all the weights assigned are summed up to 1.

DeepASD uses MoE to combine outputs produced by each of the 52 networks. It takes the output of each GCN module and multiplies it with the weight assigned by gating network. Let ~H[i] be the output vector of all 52 networks for gene i.

~

v[i] = M oE(X[i]) is the weight vector produced by MoE for gene i. Final output of DeepASD for gene i then becomes y[i] = ~H[i] · ~v[i].

4.3.6 Cross-validation Setting

Cross-validation is a popular method used in machine learning for hyper-parameter optimization. We have used it to optimize hyper-parameters for DeepASD

(35)

as well. Additionally, we are using cross-validation to depict our results and com-pensate for the low E1 gene count. DeepASD uses an adjusted version of k-fold cross-validation with k = 5. All genes with labels are uniformly and randomly divided among 5 folds, and 1 fold is selected as test fold, whereas an other fold is selected as validation fold, and the rest are our training set. During an iteration, the model is trained until the model meets early stopping condition on validation fold, and then we record the performance on the test fold. Then, the validation fold changes as the test fold is fixed. This process is repeated until all folds but the fixed test fold becomes the validation fold. Then, the test fold changes and the whole process repeats until all folds become test fold at least once. With k folds, this process runs for k × (k − 1) folds, which is 20 for k = 5. We run this algorithm 10 times, and therefore obtain 200 results.

4.3.7 Optimizer and Loss Function

DeepASD uses Adam optimizer to perform training. Adam optimizer is an adap-tive momentum estimating algorithm [72]. It is developed to utilize advantages of two other optimizers: Adagrad and RMSProp. Adam optimizer uses a sep-arate learning rate for each parameter, as like Adagrad and RMSProp, and it uses both average first and second moments unlike the other optimizers. We use Adam optimizer with the default parameters provided by PyTorch. The learning rate we use in DeepASD is 7 × 10−4.

Categorical cross entropy loss with evidence weights are used in DeepASD since we have two classes and our task is to perform classification. We tried hinge loss, L1, L2 regression losses and focal loss, and cross entropy loss provided the best results among all [73].

(36)

Figure 4.1: The architectural model of DeepASD for genome-wide ASD gene risk assessment. The model uses TADA features as well as 52 gene co-expression networks extracted from BrainSpan dataset. The feature set includes de novo loss of function (and missense) mutation counts, transmitted mutation counts for control and case groups, pLI value, de novo mutation frequency values and protein truncating variant counts. TADA dataset is used by all GCN modules and the gating network. This whole system produces a single probability value ˆy for each of the 25,825 genes and it is end-to-end trainable.

(37)

Chapter 5 Results

5.1 Comparison

against

the

State-of-the-art

Methods

0 0.2 0.4 0.6 0.8 1

ASD

Area under precision-recall curve

0.75 0.8 0.85 0.9 0.95 1

Krishnan et al. DeepASD

Area under ROC curve

a b

Figure 5.1: ROC and precision-recall curve distribution comparison between DeepASD and Krishnan et al. (a) Area under ROC curve distribution between the two methods. (b) Area under precision-recall curve distributions comparing the same methods as in (a). Both in (a) and (b), outlier points are depicted. The solid center line depicts the median, dashed line depicts the mean value for each panel. Box limits demonstrates lower and upper quartiles and whiskers denote 1.5 interquartile range.

(38)

algorithm of Krishnan et al. (2016), which is run 10 times with a 5-fold cross-validation setting, obtaining 50 values of area under AUC and area un-der precision-recall (AUPRC) [26]. DeepASD uses a modified version of 5-fold cross validation and is also run 10 times to obtain results from 200 epochs. Deep-ASD achieves median AUC value of 0.9127 and mean AUC value of 0.9113. SVM based method of Krishnan et al. achieves median AUC of 0.8983 and mean AUC of 0.8961. DeepASD improves this state-of-the-art performance by 1.4%. Simi-larly, DeepASD provides 26% improvement in terms of AUPRC with its median AUPRC value of 0.6304 (mean: 0.6396) compared to 0.3695 median AUPRC value (mean: 0.377) of Krishnan et al. All of these performances are calculated using only E1 and non-mental-health related genes, as it is also done by Krishnan et al. The comparison between DeepASD and Krishnan et al. is illustrated in Figure 5.1.

We also compare our method to SVM based algorithm of Krishnan et al. in terms of the distribution of probability values calculated by each algorithm. Since DeepASD uses Sof tmax layers, it produces probability values. However, SVM approach of Krishnan et al. gives distance to the decision boundary. As suggested in their paper, these distance values are converted to probability values by apply-ing isotonic regression on these values. Then, they detect the ’knots’ (group of points with same value) and use cubic hermite spline to interpolate these ’knots’. This method gives a smooth transition and maps distance values to [0, 1] range. Such a method can be applied to any result coming from SVM. As an example, we have converted SVM output values to probabilities for intellectual disability (ID) using isotonic regression twice. Firstly, we have applied ten-folds cross-validation on original data to obtain knots for each of them. Then, we observed that it does not yield a smooth curve for whichever interpolation method we use. Thus, we applied the isotonic regression for a second time. Resulting knots are connected via linear interpolation (Figure 5.2). In Figure 5.3, probability values of Deep-ASD and Krishnan et al. are illustrated by scatter plots. Figure 5.3a depicts the probabilities assigned to E1 genes (red) vs all other genes (gray). Points above the x = y line favors DeepASD, meaning higher probabilities are assigned to those by DeepASD. Among 46 E1 genes, 33 of them are above the x = y line. This

(39)

clearly shows that DeepASD assigns higher probability to E1 genes compared to SVM method of Krishnan et al. The vertical bands are formed since SVM based algorithm of Krishnan et al. assigns a fixed probability to a group of genes due to the ’knots’ of isotonic regression. In Fig 5.3b, non-mental-health related genes are highlighted against all other genes. In this case, points below the x = y line are favorable for DeepASD since it indicates that Krishnan et al. assigns a higher probability value to these negatively labeled genes compared to DeepASD. 953 of the 1185 non-mental-health related genes are below the x = y line, meaning that DeepASD assigns a lower probability of being a risk gene to these genes. These probability values are calculated by taking the mean of 200 epochs, excluding the results coming from the training set. Thus, we can conclude that the model learns to assign high probability to known ASD risk genes by looking other similar known ASD risk genes, and the other way around for non-mental-health related genes.

Finally, we compare DeepASD with Krishnan et al., Zhang and Shen’s DAMA-GAES score [59] and DAWN [25] (with PFC-MSC3-5 and PFC-MSC4-6) in terms of precision-recall (Figure 5.4). Since all these methods produce a final ranking, we could utilize them to obtain precision-recall curves. Three plots are created based on the precision-recall values for E1 genes, E1 + E2 genes and all positive genes, respectively. We can observe that for the first two settings, DeepASD has a better precision-recall plot. For the last one, the performance gain between DeepASD and the other methods degrades. This degradation stems from the fact that E3 and E4 genes are relatively low evidence genes, and therefore they might not be good indicators for ASD etiology. In addition, count of E3 + E4 genes is almost 5 times the count of E1 + E2 genes. Thus, they dominate the resulting precision-recall plot. Still, DeepASD manages to produce better precision-recall curves compared to other methods when considering E1 and E2 genes as the ground truth. To see if the improvement od DeepASD is statistically better, we provide P values for each test with each gold standard combination. in Table 5.1. Table 5.1 verifies that DeepASD produces significantly better results for E1 and E2 genes as the ground truth genes, and Krishnan et al. produce better re-sults for all positive gold standard genes. DeepASD still yields improved rere-sults

(40)

a b

c

Figure 5.2: Process of smoothing SVM output values. Using ten-fold cross-validation, 10 different isotonic regression model is fit. Then, we find the knots in each model and combine them to obtain (a). The transitions are not smooth and further smoothing is required. (b) Isotonic regression is applied once more on this knot collection to obtain a smoother curve. (c) Linear interpolation is used to obtain final mapping.

compared to the remaining state-of-the-art methods for all positive gold standard genes. In Figure 5.4, all plots are shown for the top 2,000 genes since after that rank, all recall values become 1 and precision values get close to 0. Similarly, we start plotting the graph from the first rank except Figure 5.4c, where the starting rank is set to 5 to obtain a smoother precision-recall curve and avoid zig-zags for recall values around 0.

(41)

a

b

Figure 5.3: Probability value scatter plot of DeepASD and Krishnan et al. (a) Probability values of E1 genes against all other genes for both methods. (b) Non-mental-health related genes compared to all other genes. Both panels contain the same 25,825 genes. y = x line (gray) is also drawn for visual aid.

Table 5.1: P value comparison for the rankings of DeepASD, Krishnan et al., DAWN and DAMAGES score using different set of gold standard genes for eval-uation. Each P value is calculated using two-sided Wilcoxon rank-sum test.

P Value

Evidence Sets DeepASD Krishnan et al. DAMAGES score DAWN PFC-MSC3-5 DAWN PFC-MSC4-6 E1 7.33 × 10−22 _{5.07 × 10}−20 _{3.61 × 10}−16 _{2.23 × 10}−01 _{5.57 × 10}−01 E1+E2 3.7 × 10−30 _{1.11 × 10}−25 _{4.75 × 10}−21 _{1.73 × 10}−01 _{2.56 × 10}−01 E1+E2+E3+E4 7.69 × 10−37 _{2.97 × 10}−52 _{1.4 × 10}−13 _{1.82 × 10}−21 _{4.38 × 10}−17

(42)

0 0.2 0.4 0.6 0.8 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Krishnan et al. DAWN PFC-MSC4-6 DAWN PFC-MSC3-5 Zhang and Shen DeepASD Recall P re ci si on 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall P re ci si on 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Recall P re ci si on

a

bb

c

E1 genes

E1 + E2 genes

E1 + E2 + E3 + E4 genes

Figure 5.4: Precision-recall curves for DeepASD, DAWN (MSC3-5 and PFC-MSC4-6), Krishnan et al. and Zhang and Shen DAMAGES score. (a) The curve for E1 plus non-mental-health genes. (b) E1 + E2 genes are used. (c), all gold standard genes are used. All precision-recall curves have a cutoff rank of 2000. (c) has starting rank value of 5 whereas (a) and (b) have starting rank value of 1.

(43)

5.2 Enrichment Analysis

We use the average (final) ranking produced by DeepASD to perform enrichment analyses on various biological pathways related to ASD such as WNT (Table A.6) [74] and MAPK (Table A.7) [75] signaling, target gene lists of transcrip-tion regulators like CHD8 (Table A.8, Table A.9 and Table A.10) [76], RBFOX (Table A.11, Table A.12 and Table A.13) [77, 78], FMRP (Table A.14 and Table A.15) [79, 80] and TOP1 (Table A.16) [35] and biological processes and molecular functions such as post-synaptic density complex (Table A.17, Table A.18 and Ta-ble A.19) [81, 82] and histone modifications (TaTa-ble A.20 and TaTa-ble A.21) [2, 83]. Although these gene sets are not the gold standard, they are shown to be asso-ciated with the etiology. Thus, we aim to show that DeepASD gene ranking is enriched in these gene sets. We divided the whole gene set into deciles (2,582 genes in each decile except the last decile) and we perform enrichment analysis by calculating the overlap percentage of a particular gene set for each decile. Since the top-ranked genes are in the first decile, we expect to observe a high overlap percentage of the first decile compared to other deciles and we expect a decreasing overlap percentage towards the last decile.

Figure 5.5 shows enrichment results for the specified gene sets. We observe that the first decile is highly enriched in all of these gene lists, which is also supported by the calculated P values using the Binomial test. While calculating the P value, we assert our null hypothesis by assuming that the gene ranking is determined randomly, therefore each decile is evenly enriched. Among all of the pathways, target genes and biological processes we have inspected, DeepASD ranking is enriched in FMRP the most, followed by post-synaptic density genes and CHD8 target genes in terms of statistical significance. We also expect to see a decreasing enrichment pattern towards the last decile, however, such a pattern does not exist for any of the gene lists except FMRP target genes. This is because the remaining genes (after the first decile) are not ranked high by DeepASD. Either these genes are not ASD related or DeepASD cannot find sufficient evidence to assign higher risk probability to these genes. But still, having a high enrichment for all these sets show the ranking produced by DeepASD is sensible.

(44)

Figure 5.5: Enrichment analysis for WNT and MAPK pathways, CHD8, RBFOX (splice and peak), FMRP and TOP1 target genes, post-synaptic density complex and histone modification processes. P values are calculated using Binomial test.

(45)

In addition to the enrichment analysis described above, we also take the top 1% (258) genes (Table A.22 and Table A.23) from DeepASD rank-ing and checked their enrichment against other pathways usrank-ing Enrichr (https://amp.pharm.mssm.edu/Enrichr/). Enrichr found out that DeepASD’s top 1% genes are highly enriched in developmental biology, neuronal system, transmission accross chemical synapses and axon guidance pathways of BioPlanet 2019 dataset [84]. BioPlanet 2019 dataset contains 560 pathways. Table 5.2 con-tains top 29 pathways filtered taking samples with P < 10−4. We can observe that the top genes predicted by DeepASD regulates neurodevelopmental system and biological development. In addition, we can also observe that the circadian rhythm pathway is also enriched. Pinato et al. and Zuculo et al. suggest that ASD could cause dysregulations in circadian cycle, which could cause sleep disor-ders that exacerbate ASD associated behavioral disordisor-ders [85, 86]. The pathway with name ”RORA activates circadian expression” from BioPlanet 2019 dataset is also enriched, which supports the circadian rhythm disruption relation with ASD. A mutation in the top 258 genes predicted by DeepASD could cause disruption in the nerve growth factor (NGF) signaling pathway. NGFs are responsible for neuron growth and maintenance, therefore they have a crucial role in human neurodevelopment. Consequently, most of the pathways in Table 5.2 are related with neurodevelopment and nervous system. To compare, we can check the path-ways with the highest P values. Bottom 5 pathpath-ways with highest P values are as follows: Olfactory transduction, metabolism, GPCR ligand binding, class A GPCRs (rhodopsin-like) and carbohydrate metabolism. None of these pathways are known to be associated with ASD, which supports the fact that the pathways with the lowest P values are in fact good candidates for further research on ASD to reveal more about its complex nature.

(46)

Table 5.2: DeepASD top 1% gene enrichment in BioPlanet 2019 dataset. Top 29 pathways are included out of 560 pathways. Enrichr tool is used to calculate enrichment values. Pathways are sorted in increasing order with respect to P values.

Pathway Overlap P-value Odds Ratio

Developmental biology 27/420 8.21E-12 4.96

Neuronal system 22/283 2.01E-11 6.00

Transmission across chemical synapses 17/190 4.80E-10 6.91

Axon guidance 18/325 2.67E-7 4.28

Circadian rhythm 8/62 1.31E-6 9.96

Multi-step regulation of transcription by PITX2 6/28 1.32E-6 16.6

Interaction between L1-type proteins and ankyrins 5/26 1.85E-5 14.9

Signaling by NGF 12/221 3.30E-5 4.19

Long-term potentiation 7/70 3.37E-5 7.72

Presenilin action in Notch and Wnt signaling 6/48 3.46E-5 9.65

Integration of energy metabolism 9/125 3.69E-5 5.56

CRMPs in Sema3A signaling 4/16 4.42E-5 19.3

NGF signaling via TRKA from the plasma membrane 9/136 7.14E-5 5.11

Lipid metabolism regulation by peroxisome proliferator-activated receptor alpha (PPAR-alpha) 8/112 1.06E-4 5.52

L1CAM interactions 7/94 2.21E-4 5.75

Wnt signaling pathway 11/231 2.23E-4 3.68

YAP1- and WWTR1 (TAZ)-stimulated gene expression 4/26 3.28E-4 11.88

Energy metabolism 5/48 3.81E-4 8.04

RORA activates circadian expression 4/27 3.81E-4 11.4

Control of gene expression by vitamin D receptor 4/27 3.81E-4 11.4

GABA (A) receptor activation 3/12 4.33E-4 19.3

Opening of calcium channels triggered by depolarization of the presynaptic terminal 3/12 4.33E-4 19.3 Transcriptional regulation of white adipocyte differentiation 6/77 4.89E-4 6.02

Reelin signaling pathway 4/29 5.06E-4 10.7

CDO in myogenesis 4/29 5.06E-4 10.7

Inactivation of GSK3 by Akt causes accumulation of beta-catenin in alveolar macrophages 4/30 5.78E-4 10.3

GABA A and B receptor activation 5/53 6.06E-4 7.29

SMAD2/3 nuclear pathway 6/82 6.85E-4 5.65

Calcium regulation in the cardiac cell 8/149 7.35E-4 4.15

(47)

5.3 CNV Region Analysis

Certain copy number variants (CNVs) are associated with ASD. These CNVs are large regions that contain multiple genes. We inspect 5 CNVs that are particu-larly associated to ASD. These 5 CNVs are 16p11.2, 15q11-13, 15q13.3, 1q21.1 and 22q11 [26]. In this section, we consider novel genes that are within these CNVs and (i) not labeled or have weak confidence, (ii) assigned high posterior probability by DeepASD and (iii) assigned relatively lower posterior probability by other methods. Genes with evidence levels E1 and E2 are expected to be as-signed higher posterior probability since they also have strong prior information. Such genes are excluded from this analysis. Genes spanned by these CNVs are likely risk genes but still, they may not necessarily be ASD related. In Figure 5.6, Figure 5.7 and Figure 5.8, gene ranks for each CNV is given.

NIPA1 gene inside 15q11-13 location is ranked 619th _{although it is labeled as} non-mental health related. DAWN does not have a ranking for this gene since it is not co-expressed inside the network used by DAWN algorithm. DAMAGES score rank of NIPA1 is 5185 and Krishnan et al. ranks it as 3267th. TADA q-value for NIPA1 gene is 0.718 which indicates no significant association. This shows us that DeepASD could differentiate important genes even if they have a negative label and/or they have no significant prior knowledge. NIPA1 gene is not involved within any pathway or target gene lists we have discussed in section 5.2. According to NCBI entry of NIPA1, it encodes magnesium transporter in various epithelial and neuronal cells (https://www.ncbi.nlm.nih.gov/gene/123606). This gene is believed to be playing a role in neuron development and maintenance. Therefore, disruption of this gene may damage neurodevelopment and cause ASD as a result.

SLC25A1 is another gene of interest in 22q11 since it is not labeled and ranked as 430th _{by DeepASD. SLC25A1 is ranked as 2643}rd_{, 8333}rd _{and 12689}th _by Kr-ishnan et al. DAWN PFC-MSC4-6 and DAMAGES score, respectively. DAWN PFC-MSC3-5 does not have a ranking for SLC25A1. SLC25A1 has TADA q-value of 0.1819 and it is inside post-synaptic density complex. SLC25A1 protein

(48)

regulates the movement of citrate inside of mitochondria, and it is indicated that the dysregulation of mitochondrial metabolism may lead to neurodevelopmen-tal issues [87]. Therefore, mutations of this gene might be important for ASD symptomatology.

CORO1A gene, which is contained within 16p11.2 location, is another candi-date ASD risk gene that is ranked as 293th _{by DeepASD. CORO1A is ranked as} 19729th_{, 3745}th_{, 8863}rd_{and 408}th_{by Krishnan et al., DAWN (with both networks)} and DAMAGES score, respectively. CORO1A gene is unlabeled and its TADA q-value is 0.0922. Its prior information is relatively significant and it is inside post-synaptic density complex gene list. According to SFARI, CORO1A is a strong evidence ASD gene due to the detected two de novo gene disrupting mutation on CORO1A, and it is labeled as a risk gene by SFARI with 0.05 < F DR < 0.1. Our results, along with DAMAGES score ranking, also supports the scoring provided by SFARI for this gene.

(49)

0 10k 20k MAGEL2 (E1) UBE3A (E3-E4) SNRPN (E3-E4) SNURF-SNRPN (E3-E4) NDN (E3-E4) ATP10A (E2) MKRN3 NIPA1 (Neg.) SNURF HERC2P3 GOLGA6L2 NBEAP1 LOC727924 PAR5 GOLGA8EP OR4N4 NIPA2 OR4M2 CYFIP1 NPAP1 HERC2P2 GOLGA6L6 GOLGA6L1 OR4N3P CXADRP2 LOC440243 GOLGA8DP TUBGCP5 GOLGA8I 15q11-13 Rank 0 10k 20k ALDOA (E3-E4) SEZ6L2 (E3-E4) PPP4C (E3-E4) MAZ (E3-E4) DOC2A (E3-E4) PRRT2 TAOK2 (E3-E4) ASPHD1 KCTD13 (E3-E4) FAM57B YPEL3 MAPK3 (E3-E4) CORO1A C16orf92 MVP ZG16 SPN INO80E SLC7A5P1 GDPD3 C16orf54 PAGR1 TBX6 HIRIP3 (E3-E4) CDIPT TMEM219 QPRT KIF22 16p11.2 Rank

Figure 5.6: The genes spanned by 16p11.2 and 15q11-13 CNVs. For each gene, DeepASD rank is provided along with its evidence level if it is labeled.

(50)

5k 10k 15k 20k 25k BCL9 PRKAB2 GJA5 GJA8 FMO5 NBPF8 NBPF24 PDIA3P CHD1L ACP6 GPR89B GPR89C 1q21.1 Rank 5k 10k 15k 20k 25k TRPM1 (Neg.) CHRNA7 (E2) KLF13 OTUD7A MTMR10 ARHGAP11B HERC2P10 LOC283710 FAN1 15q13.3 Rank

Figure 5.7: The genes spanned by 15q13.3 and 1q21.1 CNVs. For each gene, DeepASD rank is provided along with its evidence level if it is labeled.

Predicting informative spatio-temporal neurodevelopmental windows and gene risk for autism spectrum disorder

PREDICTING INFORMATIVE

SPATIO-TEMPORAL

NEURODEVELOPMENTAL WINDOWS AND

GENE RISK FOR AUTISM SPECTRUM

DISORDER.

a thesis submitted to

the graduate school of engineering and science

of bilkent university

in partial fulfillment of the requirements for

the degree of

master of science

in

computer engineering

By

O˘

guzhan Karakahya

October 2020

ABSTRACT

PREDICTING INFORMATIVE SPATIO-TEMPORAL

NEURODEVELOPMENTAL WINDOWS AND GENE

RISK FOR AUTISM SPECTRUM DISORDER.

¨

OZET

OT˙IZM SPEKTRUM BOZUKLU ˘

GU ˙IC

¸ ˙IN B˙ILG˙I

VER˙IC˙I ZAMAN-UZAMSAL S˙IN˙IR GEL˙IS

¸ ˙IM ARALI ˘

GI

VE GEN R˙ISK˙I TAHM˙IN˙I

Acknowledgement

Contents

List of Figures

List of Tables

Chapter 1

Introduction

Chapter 2

Background Information

2.1

De Novo Gene Disrupting Mutation

2.2

Biological Pathways and DNA/RNA

Bind-ing Proteins

2.3

Copy Number Variation

2.4

Gene Co-expression Network

2.5

Protein-Protein Interaction Networks

2.6

TADA Framework

Chapter 3

Related Work

3.1

DAWN

3.2

Genome-wide Ranking by SVM-based

Clas-sifier

3.3

DAMAGES Score

3.4

ST-Steiner

Chapter 4

Methods

4.1

Construction of Gene Co-Expression

Net-works

4.2

Ground Truth Gene Sets

4.3

DeepASD Model

4.3.1

Graph Convolutional Network

4.3.2

Early Stopping

4.3.3

Weight Decay

4.3.4

Dropout Regularization