Analysis using top‐k skyline query of protein‐protein interaction reveals alpha‐synuclein as the most important protein in Parkinson’s disease

Parkinson’s disease is the second‐most‐common neurodegenerative disorder and can reduce patients’ quality of life. The disease is caused by abnormalities in dopaminergic neurons, such as reactive oxygen species (ROS) imbalance leading to programmed cell death, protein misfolding, and vesicle trafficking. Protein‐protein interaction (PPI) analysis has been demonstrated to understand better candidate proteins that might contribute to multifactorial neurodegenerative diseases, particularly in Parkinson’s disease. PPI analysis can be obtained from experiments and computational predictions. However, experiment data is often limited in interactome coverage. Therefore, additional computational prediction methods are required to provide more comprehensive PPI information. PPI can be represented as protein‐protein networks and analyzed based on centrality measures. The previous study has shown that top‐k skyline query, a method using dominance rule‐based centrality measures, reveals important protein candidates in Parkinson’s diseases. This study applied the top‐k skyline query to PPIs containing experiment and prediction data to find important proteins in Parkinson’s disease. The result shows that alpha‐synuclein (SNCA) is the most important protein and is expected to be a potential biomarker candidate for Parkinson’s disease.


Introduction
Parkinson's disease (PD) is a disease that can be recognized by several symptoms which may appear, such as decreased motor functions, autonomic dysfunction, hallucinations, and depression (DeMaagd and Philip 2015). As the disease may worsen and cause pneumonia, it can threaten the patients' life. Furthermore, the disease can lower patients' quality of life and impact their families and society (DeMaagd and Philip 2015). The disease burden was estimated to rise from 4.1 to 4.6 million in 2005 to 8.3 to 9.3 million in 2030 (Dorsey et al. 2007), which may broadly impact crowded nations, particularly several Asian countries such as China, India, and Indonesia. Currently, PD has been known as one of the most common neurodegenerative disorders with incidence ranging from 16 to 19 per 100,000 people per year (Twelves et al. 2003; Lebouvier et al. 2009; WHO 2004 and expected to overcome cancer as the second most common cause of death in 2040. Furthermore, the economic burden of PD direct and indirect cost of treatment reached US$ 1,100 million worldwide (Twelves et al. 2003; WHO 2004.
PD is a disease known as neuron dysfunction. It mainly impacts dopaminergic receptors due to several fac-tors such as reactive oxygen species (ROS)-induced cell death (Dias et al. 2013), protein misfolding (Tan et al. 2009), or changes of proteins that are responsible for vesicle trafficking (Esposito et al. 2012), G protein activations (Odagaki and Toyoshima 2006), and many proteins which should be noticed carefully. Proteins interact with each other in carrying out their function and often called protein-protein interaction (Chang et al. 2016). Proteinprotein interaction (PPI) is a good representation for unraveling protein functions, disease-disease, and diseasegene associations (Liu et al. 2015; Chang et al. 2016. Therefore, the PPI analysis to predict significant protein candidates that play a role during the disease progression provides a better understanding of multifactorial degenerative diseases, including PD. Currently, many databases store PPI information, such as STRING. STRING database (string-db.org) is a PPI database with the largest number of organisms and proteins (Szklarczyk et al. 2018). The database provides two types of interaction. The first one is experimental data obtained from experiments. The second type is prediction interaction data obtained from many methods, including co-expression analysis, detection of shared selective sig-nal across genomes, text-mining, and computational transfer knowledge based on gene ontology (Szklarczyk et al. 2018). STRING's experimental proteins interaction information was collected from other databases such as BIND, DIP, GRID, HPRD, IntAct, MINT, and PID.
PPI analysis is often limited by interactome coverage, where interactome is a set of PPI that can occur inside a cell (Yu and Fotouhi 2006). The interactome coverage is a ratio between PPI that occurred inside the cell and interactome often stated in percentage (%). For example, human is predicted to have 650,000 PPI (Stumpf et al. 2005). However, Human Protein Reference Database (HPRD) (https://hprd.org) , accessed in December 2019, only has 41,327 PPI information covering 6.3% interactome. Experimental data can be combined with prediction data To improve interactome coverage (Jansen et al. 2002; Liu et al. 2015. The PPI network can be represented as a graph with proteins as nodes and interactions as edges. The measure of centrality can be applied for finding the subnetwork, even the importance of a node in a network. Thus, data transformation can be done from a graph to an object with centrality measures as attributes. However, there were many centrality measures with different characteristics, which led to debate among the researchers to determine which centrality measures are better (Raman et al. 2014).
In PPI analysis, clustering is frequently used to predict proteins function (Hao et al. 2016). Previously, several studies focused on centrality measures and machine learning were conducted to reveal PPIs subnetworks that have an important role in certain diseases such as Diabetes (Usman et al. 2019). In this study, we try to better understand which proteins play a significant role in PD. Previously, Diansyah et al. performed the Skyline Query to predict PPI in PD (Diansyah et al. 2019). In this study, we performed Skyline Query, an algorithm for finding nondominated data, along with centrality measure to find significant proteins of PD. Skyline query (SQ) is an algorithm that shows the optimal solution for the problem with various criteria based on dominance rules (Borzsonyi et al. 2001). This algorithm is developed based on the maximal vector problem in mathematics. The result of SQ is a set of non-dominated objects called Skyline Objects. An object dominates another object only if it has the same score or a better score in all attributes and better at least in one attribute (Borzsonyi et al. 2001). Commonly, SQ is used to find the optimal object, for instance, a hotel or restaurant, that meets multiple conflicting criteria.
In this study, we employed SQ to find the significant proteins that have essential roles in the regulation of PD. The logic of finding skyline object is in line with finding significant proteins, which attribute values are not less than that of any other protein and has at least one attribute whose value is greater than that of any other protein. We employed top-k SQ, one of the variants of SQ, to overcome the weakness of SQ which is not robust against an increasing number of attributes. We used seven centrality measures, namely degree, betweenness, closeness, eigenvector, eccentricity, radiality, and bridging as attributes. We combined experiment data and prediction data to improve interactome coverage (Jansen et al. 2002).

Materials and Methods
We conducted this research in four stages. First, we collected the necessary data for this research. Second, we performed data pre-processing. This step included removing duplicate data, deleting unconnected networks, and transforming the network into centrality measures. Third, we applied the Top-k Skyline Query to find the significant proteins. Finally, we analyzed the results by conducting a literature review to determine whether the Top-k Skyline Query could be used to find the significant proteins. Figure 1 shows the flow chart of this research.

Dataset
We collected datasets from OMIM (https://omim.org/) and STRING database (https://string-db.org/) on March 11 th , 2020. The OMIM database was used to find proteins associated with PD. Moreover, the STRING database was used to find the protein interaction associated with PD. The first step was to find protein associated with PD from OMIM. The query at OMIM was conducted using "+" as a prefix for every word. The prefix was used to get the precise result. The query for this study in OMIM is "+Parkinson +Disease".
The second step is to find the PPI in STRING for proteins that we get from OMIM. For each protein associated with PD, there was a separate interaction file, so that we needed to combine the data into one file. This study was done by developing a program or scraper in Python 3.7 to automate this step. Figure 2 shows the pseudocode of data scraping. Moreover, in this study, we used the combination of the experimental dataset and prediction dataset from STRING.

FIGURE 2
Pseudocode of data scraping.

Data pre-process
We used Cytoscape (https://cytoscape.org/) for conducting pre-processing data. Two main steps in this study include data cleaning and data transformation. First, we visualized the PPI data to find any unconnected network. A Network that was not connected to the main (biggest) network would be removed. We assumed that the significant proteins are located in the back bound network, a collection of nodes with a high number of members and a high density. Thus, the unconnected networks to the back bound were removed. Next, we omitted the duplicate interaction data. The last step was to transform the data from the protein network into centrality measures. This process was done by using CentiScaPe 2.2 in the Cytoscape application (Scardoni et al. 2009; Scardoni andLau 2012). After data transformation was completed, proteins with seven centrality measures were exported into a comma-separated value file (csv). Next, the output was processed in further steps.

Centrality Measures
Centrality measures are a unit of measure to measure the important node in a network interaction and have been widely used for analysis in biological networks. Many centrality measures can be used to measure the importance of a node. In this study, seven values of centrality measures were used, namely degree centrality, betweenness centrality, closeness centrality, eigenvector centrality, radiality, eccentricity, and bridging centrality.
Degree centrality is the simple calculation of centrality. Degree centrality is obtained by counting the number of edges connected to the node. The probability of a protein becoming the center of regulation is directly propor-tional to the greater degree of centrality (Scardoni and Lau 2012).
Betweenness centrality can be obtained by calculating the shortest path by adding the shortest path through the node divided by the total number of shortest paths in the graph. The greater the betweenness centrality value, the more likely the node is often bypassed for communication between proteins so that the more relevant it is to become a regulatory protein (Scardoni and Lau 2012).
The calculation for closeness centrality is based on the number of shortest paths from one node to another node. The value of the number of shortest paths is used as a divisor of 1. Thus, the greater the value of closeness centrality, the more central the position of the protein is. Therefore, it can become a regulatory protein for other proteins in the network (Scardoni and Lau 2012) Eigenvector centrality is calculated based on the concept that if a node-i is connected to another node with a high score, node-i will also have a high score (Scardoni et al. 2009). The initial step in finding eigenvector centrality is to find the largest eigenvalue first, then using the largest eigenvalue, the eigenvector matrix will be obtained. The eigenvector centrality value was obtained by dividing the eigenvector matrix of a node by the determinant value of the eigenvector matrix. The greater the eigenvector centrality value indicates if the node interacts with other important proteins to become a regulatory center for other important proteins (Scardoni et al. 2009).
Radiality is based on the shortest path from one node to another node. Before adding up, the shortest path value is used to reduce (∆ G + 1) where ∆ G is the largest shortest path, after which it is added. The higher the radiality value of a node is functionally relevant to other nodes. The high values of radiality, eccentricity, and closeness centrality indicate the consistency of a node to become the center of the network (Scardoni and Lau 2012).
Eccentricity is calculated by finding the largest, the shortest path from one node to another node. According to Scardoni and Lau (2012), in biological terminology, eccentricity can indicate a protein's convenience reached by other proteins in the network. The greater the eccentricity value suggests that it is easy to influence other proteins in the network.
Bridging centrality is the result of the development of betweenness centrality. The bridging centrality value is obtained from the multiplication of the betweenness centrality and the bridging coefficient. A node with a high value of bridging centrality indicates if the node connects a node with a high degree to connect between clusters in the interaction network (Scardoni et al. 2009).
All values of centrality measures that have been described will be used as attributes for each protein. Furthermore, this data was used for the following process to select interesting objects based on seven criteria of centrality measures by using Skyline Query.

Skyline Query
Skyline query (SQ) is a method to find the non-dominated object; this algorithm chooses an interesting object from a dataset. An object is later categorized as an interesting object if not dominated by another object (Borzsonyi et al. 2001). For example, object A dominates object B if A has the same score or a better score in all attributes than B and better at least in one attribute (Borzsonyi et al. 2001). Then this rule in SQ is called the dominance rule.
In this study, the higher score in centrality measures means a higher chance of the protein being an important protein for every centrality measure. So, the dominance rule for Table 1 is the highest score in degree centrality and closeness centrality. The results of implementation SQ in Table 1 were the object A and C. Object B has the same score as object A in terms of degree centrality. However, it has a lower score in closeness centrality that makes object A dominates object B. Object D is dominated by A because it has the lowest score in every attribute compared to object A. Since no other object can dominate object A and C, object A and C became the skyline object for Table 1. However, SQ has a weakness: the more attributes that are used, the more skyline objects will be used so that the results are no longer relevant (Kontaki et al. 2008). This study used a developed SQ called top-k skyline query (topk SQ). Top-k SQ ranks skyline results to find the most important data in skyline objects. The ranking is done by searching the most dominant data. This method finds data that dominates other data, and the most dominant data was in the top result. This study used centrality measures as attributes and top-k SQ to analyze PPI.
Based on the concept of top-k SQ, a protein is a protein that is not dominated by another protein with the order by how many proteins it dominated. The most important result of top-k SQ is a candidate for important proteins related to the disease that was later further cross-checked. Since there are many centrality measures, this study only used basic centrality measures and the other two centrality measures. The basic centrality measures in graph theory are degree, betweenness, closeness, eigenvector, and eccentricity (Sharma et al. 2016). In this study, besides the basic centrality measures, we used radiality and bridging centrality.
We used top-k SQ to find an important protein of PD using seven centrality measures (degree, betweenness, closeness, eigenvector, eccentricity, radiality, and bridging). There are two interactions data types based on their resources, experiment data and experiment+prediction data. We used experimental data to determine whether interactome coverage in PD good enough for PPI analysis. This study used SQ, an algorithm for finding nondominated data, and centrality measure to find important proteins of PD.

Top-k Skyline Query
Top-k representative skyline query (top-k RSP) is a top-k SQ algorithm used to maximize data dominated by k skyline objects (Lin et al. 2007). The complexity for top-k RSP is O(kn 2 +kn), where n is the total number of data. This study chose a basic top-k SQ because the data is relatively small, and the process is done only once. Figure 3 shows the pseudocode of top-k RSP. Using SQ, the skyline objects of data in Table 1 is object A and object C. The object of D is dominated by object A which has a better score in all dimensions than object D. Object B is dominated by object A because it has a lower score in closeness centrality. However, it has the same degree centrality score. Object C dominates object B because it has the same score in closeness with a better score in degree. Object A and object C is incomparable because object A has a better score of closeness centrality; otherwise, object C has a better degree of centrality. No other data could dominate objects A and C, so objects A and C are skyline objects.

FIGURE 4 Visualization for top-k skyline query
Top-k SQ ranks the skyline objects by how much data were dominated by the skyline objects. As shown in Figure 4, object A dominates two data (D and B) while object C only dominates one data (B). Object A is the highest rank for top-k SQ because it dominates the most. Therefore, the top-k SQ for Table 1 is A, and the top-k SQ for Table 1 is A and C.

Data Analysis
The objective of this step is to analyze the result of top-k SQ. The proteins relations to PD were cross-checked with the experimental data, particularly the highest rank skyline object we get from top-k SQ. Further analysis would define whether experiment and experiment + prediction data can be used in the PD PPI analysis. We expect to see the effect of interactome coverage.

Results and Discussion
There were 271 proteins data related to PD obtained from OMIM, but only 252 proteins have interaction information in STRING. Therefore, proteins associated with PD which are not found in STRING were excluded from this study. Two hundred and fifty-two protein interaction files were merged into one for each interaction source. Table 2 shows the results from STRING after merging the interaction files.
From Table 2, there are 1,553 proteins with 4,868 interaction data with interaction source only from the experiment. Meanwhile, there are 1,848 proteins with 8,577 interactions from experiment and prediction interaction sources. Visualization of experimental data can be seen in Figure 5. Figure 5 shows many unconnected networks. Networks that are not connected to the main graph were deleted. After the deletion of the unconnected graph, data duplicate will be removed as well. Figure 5 shows the visualization for experiment data after data cleaning. In Figure 5 and Figure 6, the red nodes represent proteins associated with PD that we obtained from OMIM. Meanwhile, the green nodes represent protein without direct association with PD (interaction protein from STRING). However, deletion in duplicate data and unconnected networks will decrease the number of proteins and interactions. Table 3 shows the number of proteins and interactions before the protein networks were transformed into centrality measures. From Table 3, there are only 1,269 proteins and 4,198 interactions for the experiment data interaction source. Moreover, 1,682 proteins with 7,894 interactions left for the experiment+prediction data source. After data cleaning, PPI networks were transformed into centrality measures using CentiScaPe 2.2. There are two default outputs: a name and a shared name. Since FIGURE 5 Experiment data visualization before data cleaning.

FIGURE 6
Experiment data visualization after data cleaning. both contain the same protein name, the shared name was omitted. The transformation data results are proteins with seven centrality measures as attributes with one protein name (name, degree centrality, betweenness centrality, closeness centrality, eccentricity, eigenvector centrality, bridging centrality, and radiality). The data is transformed, then exported into a comma-separated value (CSV) as the input for top-k SQ.
Data with interaction source experiment is processed first. The maximum k for top-k SQ is 21 since there are only 23 skyline objects resulting from SQ. Proteins included in the top-21 SQ were SNCA (alpha-synuclein), PARK2 ( Table 4 shows the biological and experimental associations of genes with PD. However, the top-1 SQ is SNCA since SNCA dominates most data. The result for the top-3 SQ for experimental data can be seen in Table 5. SNCA is the most important protein because it dominates another protein (1,217 proteins). Meanwhile, the other protein dominates only vary from 0-14 proteins.
The following process used data interaction sources were experiment and prediction.
Maximum k for top-k SQ for experiment+prediction data is ten since there were only ten skyline objects.
Among ten skyline objects, the most important protein is SNCA. SNCA results from a top-1 SQ; it means that SNCA dominates another protein. Table 3 shows the result for top-3 SQ with experiment+prediction data as the interaction source. Based on Table 5, SNCA dominates 1663 another protein, so that it becomes the most important skyline object based on top-k SQ.
The execution time for the Python program is 0.2532 s for experimental data and 0.1508 s for experi-ment+prediction interaction data. Since both data types

Proteins
Association to Parkinson's disease (PD)

GPR37
Highly expressed in neuronal progenitor cells, in particular Wnt-dependent neurogenesis (Berger et al. 2017) GNAI2 Expression is increased during stress and plays important role to inhibits adenylate cyclase, to modulate cAMP mediated responsed beta adrenergic stimuli (Tsolakidou et al. 2010) SNCA (alphasynuclein) Located in presynaptic terminals and critical to regulate neurotramsiter release and vesicle trafficking (Mata et al. 2010) Commonly detected in Lewy bodies, which known as pathologic features of PD (Siddiqui et al. 2016)

PARK2
Controls program cell death and apotptosis (Konovalova et al. 2015) PARK2 germline mutations leading to cause neurons dysnfunctions (Veeriah et al. 2010) Mutations caused imbalance of program cell death and increase apoptosis (Konovalova et al. 2015) TH (Tyrosine Hydroxilase) An enzyme in dopamine biosynthesis. TH expression is foundly related to occurence of PD (Chen et al. 2017) HSPA8 Decreases during aging and may postulated to PD's, which may affect autophagy process due to response of ER stress by protein unfolding (Loeffler et al. 2016)

TRAF2
Overexpression of TRAF2/6 may induced by chronic inflammations and hypothized to be reason of occurence PD (Chung et al. 2013   return the same protein that is SNCA as the important protein, there is only one candidate for important protein. Among those proteins, at least five proteins were found related to the PD. One of the most important proteins was alpha-synuclein (SNCA). Biologically, SNCA was responsible for presynaptic terminals and critical to regulating neurotransmitter release and vesicle trafficking (Mata et al. 2010). In addition, Alpha-synuclein is commonly detected in Lewy bodies, which is known as pathologic features of PD (Siddiqui et al. 2016). Besides SNCA, several proteins are important in disease progressions; for instance, PARK2 controls programmed cell death and apoptosis (Chen et al. 2017). PARK2 germline mutations are the leading cause of neuron dysfunctions (Chi et al. 2018). PARK2 mutations caused an imbalance of programmed cell death and increased apoptosis (Konovalova et al. 2015). In GPCR classes, the GRP37 gene is highly expressed in neuronal progenitor cells, particularly Wnt-dependent neurogenesis (Berger et al. 2017). GNAI expression increases during stress and plays an important role in inhibiting adenylate cyclase, modulating cAMP, and mediating responses to beta-adrenergic stimuli (Tsolakidou et al. 2010). Lastly, tyrosine hydroxylase (TH) is an enzyme in dopamine biosynthesis, and since PD is related to dopaminergic neurons, TH expression is also related to the occurrence of PD (Chen et al. 2017).
SNCA is the first gene linked to PD. SNCA itself is thought to have an essential role in synaptic transmission (Mata et al. 2010). This protein has been given an identification name to show that SNCA is linked to PD and plays an important role: PARK1 (Klein and Westenberger 2012). SNCA is considered involved in the early onset of familial Parkinson's disease (FPD) as a major causative gene. It has been identified five mutations point in SNCA that cause autosomal dominant Parkinson's (Siddiqui et al. 2016). A study by Diansyah et al. (2019) found 14 proteins resulting from a skyline query in PD, and SNCA is one of the results. However, it still lacks information about the most important protein to PD. This study shows that SNCA is the most important protein for PD.
Experiment and experiment+prediction data give the same result proving its significance. It shows that this method can use experimental data in the PPI analysis for PD. Also, it indicates that interactome coverage in PD is good enough for PPI analysis since experiment data give an important protein as the result of this method. However, we need to do extended research to prove that interactome coverage in PD is sufficient.

Conclusions
Based on the result of this study, it can be concluded that the top-k skyline query can be used to find important proteins in Parkinson's disease (PD). Experiment and ex-periment+prediction interaction data sources for PD can be used in PPI Analysis using this method. The important protein for PD based on this study is alpha-synuclein (SNCA) that has been proven to have a significant role in this disease.