The Use of the Partner Surveillance Scale in Instagram: Psychometric Evaluation Based on the Graded Response Model

The use of social media, especially Instagram, has become an increasingly powerful form of daily activity. This social media affects the romantic relationship of people, where people in relationships can conduct surveillance on the behaviors of their partner. This study provides an analysis of the psychometric properties of the Indonesian version of the Partner Surveillance Scale which contains 15 items and used a 4-point Likert scale format. The study recruited 214 female university students aged 17-23 years old, who used Instagram. The Graded Response Model (GRM) method was applied. As a result, the Indonesian version of the Partner Surveillance Scale was proved to have good psychometrics properties and had good fit to the GRM. All assumptions of GRM were met and the scale had high reliability. But, it should be noted that some items did not fit well with the model. The results of this study also provide an alternative to the use of Confirmatory Factor Analysis (CFA) in analyzing polytomous data with GRM. This study concluded that the psychometric properties of the Partner Surveillance Scale were good.

In the 1 modern era, social media cannot be separated from everyday life. This phenomenon has attracted the attention of established psychological societies for example, the British Psychological Society (2012) and the American Psychological Association. The British Psychological Society (2012) published the paper "Guidance of the use of a social media by clinical psychologists". Three years later the APA published the article "Social Media: A Contextual Framework to Guide Research and Practice" (McFarland & Ployhart, 2015). The most updated research was conducted by Alhabash and Address for correspondence: muhammad.dwirifqi@gmail.com Ma (2017) who found that the millennial generation choose social media because there is ease in communication with family and friends, or ease in seeking information and interacting with other people.
Some studies have shown that Instagram is one of the social media that was particularly appealing to millennials (Instagram, 2018;Pew Research Center, 2018;Rainie, Brenner, & Purcell, 2012). According to a survey conducted by Instagram (2018), the total number of Instagram users is 800 million with most users ageing between 18 and 24 years. Research from PEW Research Center (2018) showed that women tend to use social media more than men. Rainie et al. (2012) in their research, suggested that ease of uploading photos and videos on Instagram becomes one of the more attractive features of the platform among social media users.
According to Farrugia (2013), partner surveillance is the tendency for a person to monitor activities of the partner in social media, for example how the partner interacts with other people and how they comment on a user's post on Instagram. In relation to partner surveillance in social media, this behavior has the aim to identify whether a person is trustworthy or it may be simply due to jealousy (Muise, Christofides, & Desmarais, 2014). Furthermore, the motive may be to simply know the level of romanticism between partners (Serafinelli, 2017). However, monitoring is not always done effectively since users can adopt anonymous accounts, making it difficult for social media users to know who exactly is being monitored (Elphinston & Noller, 2011;Marshall, Bejanyan, Di Castro & Lee, 2013;Tokunaga, 2011). Research in psychology had shown that partner surveillance constitutes the types of behaviors that are associated with trust and anxious attachment in intimate relations (Rodriguez, DiBello, Overup, & Neighbors, 2015) and is a development of the negative relational maintenance theory and the investment model (Tokunaga, 2016). The negative effects of partner surveillance is related to intimate partner violence (Rodriguez et al., 2015), depression and negative emotion (Marshall, 2012), as well as low quality relationships (Tokunaga, 2016). Research about partner surveillance continues to this day. One aspect which associates with its continuous growth is the availability of measurement instruments.
Results from the literature review had shown that there exist two measures that can be used to measure attitudes toward partner surveillance in social media namely: the Partner Surveillance Scale (PSS; Farrugia, 2013) as well as the Interpersonal Electronic Surveillance Scale (TIESS;Fox & Tokunaga, 2015). Although both of these measures have the same purpose, which is to observe behavior of social media users, there are some differences. TIESS cannot be used to measure aspects related with social media use apart from Facebook since the content of both measures contain features only found in Facebook and not in Instagram. However, PSS is quite flexible to adapt to the context of Instagram users. In addition, PSS uses a ranking format often known as the typical performance test. This form of measurement does not have a correct or incorrect answer because the goal is to describe tendencies toward a specific trait (Hubley & Zumbo, 2013).
The use of Likert scales is aimed to measure psychological traits and it has become an important part in the use of advanced statistical methods, namely Item response theory or IRT (Adams, Wu & Wilson, 2012;Forero & Maydeu-Olivares, 2009). IRT consists of a compilation of statistical models that define the relationship between unobserved individual characteristics and item characteristics to predict a specific response towards an item in a scale (Baker & Kim, 2004). In applying the IRT, the focus of the analysis is on the item level and not on the overall scale. The focus on the item level makes it possible for the researcher to design, revise, and optimize the use of the scale to fit specific goals (de Ayala, 2009 Although there exists research on the production of the Partner Surveillance Scale (Farrugia, 2013), there has not been any research in Indonesia to produce a similar scale with the characteristics, norms, and the local culture of Instagram users in Indonesia. Therefore this research was done to evaluate the psychometric properties of the Partner Surveillance Scale which contains 15 items and uses the 4 point Likert scale format. The current study was done among women using the Instagram social media with the GRM method.
Although GRM is nothing new and is available on numerous software, the application of psychological research in Indonesia is very limited when compared to application of CFM or RSM. This research is expected to give Indonesian researchers an introduction of the procedures to apply when interpreting the results of GRM analysis and analyzing the results of psychological traits. This research also produced an adapted scale which can be used for future research to test a range of other related variables.

Participants
The research participants were students from Universitas Islam Negeri (UIN) Syarif Hidayatullah Jakarta. A total of 214 students were recruited with an age range of 17-23 years old and were selected based on non-probability sampling. This sampling technique was used due to time constraints of recruiting students who were active on their Instagram accounts. Other criteria for the eligibility of participation in this research include being female, actively using Instagram, first to fourth year students in the faculty of religion or general science, and were active students. Furthermore, all participation in this research was voluntary. Ethical clearance was obtained from the Institute of Research and Community Service (Lembaga Penelitian dan Pengabdian Kepada Masyarakat-LP2M), UIN Syarif Hidayatullah Jakarta.

Measuring instruments
The Partner Surveillance Scale (Farrugia, 2013) was developed at the Rochester Institute of Technology, United States of America. This instrument was developed among a student sample and had good reliability (α = 0.84). This instrument was then translated to Bahasa Indonesia with the help of a professional translator using an online system. This instrument consisted of 15 items and used a Likert 4 point format with the following responses: Absolutely Disagree, Disagree, Agree, and Absolutely Agree, which is an adaptation of the original 5 point format. This adjustment was made based on suggestions from previous research to avoid disordered threshold (Adams, Wu & Wilson, 2012), especially for GRM models (García-Pérez, 2017). Disordered threshold occurs when the average response for the high category (e.g. absolutely agree) is lower compared to responses for the categories below it (e.g. absolutely agree). This would violate the assumption that higher agreement to a trait would correspond with a tendency to respond to the highest category in the scale (García-Pérez, 2017).
Data collection also considered demographic aspects of the respondents and included additional questions such as total followers in Instagram, age, income, relationship status: (a) in a relationship (b) was in a relationship but currently not in a relationship, and (c) had never been in a relationship. Four respondents who were not in a relationship were excluded from the analysis to minimize samples irrelevant to the context of the study. The items of the Partner Surveillance Scale can be seen in Appendix A. Following data collection, respondents with missing data were excluded from further analyses to avoid complex computing processes when dealing with missing data (for example modifying the estimation method).

Analysis procedure
Analysis was conducted using the Graded Response Model or GRM (Samejima, 1969). GRM is an IRT model that is used when the item scores are ordinal like Likert scales (Muraki, 1990;Samejima, 2016). Three assumptions must be met in GRM namely unidimensionality, local independence (Embretson & Reise, 2000) and monotonicity (de Ayala, 2009). This research tested three of those assumptions. Unidimensionality means that there is only one construct being measured, local independence refers to responses by the respondents that are statistically independent from responses to other items in a test, while monotonicity refers to the increase of a score corresponding to the higher level of the measured trait (de Ayala, 2009;Embretson & Reise, 2000).
In this research, the GRM model was estimated using marginal maximum likelihood (MML) using the program IRTPRO 3 (Cai, Thissen, & du Toit, 2015a). Basically the MML estimation method is used to estimate the standardized GRM model. The GRM equation is as follows: (1) In equation (1), * , refers to the cumulative probability which is symbolized with *, refers to the parameter of the discrimination power of the item i, refers to the parameter of threshold for option k on item i while refers to the estimation of the trait level of an individual. GRM is also known as indirect IRT because it is different from the dichotomous model, therefore the probability of choosing one response category cannot be conducted directly with formula (1) and so to calculate it for each response category formula (2) According to Standard 3.9 of the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999), the overall model fit index or the item level fit must be reported when the IRT is used. The overall model fit indices used were M 2. and RMSEA2. Should M 2 have p > 0.05 thus the data has unidimensional model fit (Maydeu-Olivares & Joe, 2006). If the RMSEA2 has a value < 0.04, there is model fit for the data (Huggins-Manley & Han, 2017). Reliability in IRT uses the coefficient of marginal reliability (Green, Bock, Humphreys, Linn, & Reckase, 1984), which is analogous to the alpha coefficient in classical test theory (Reise, 1999). If the values are higher than 0.80, the instrument has good internal consistency (Petscher, Mitchell, & Foorman, 2015).
After testing fit for the overall model, we proceeded with testing model fit at the item level with the − 2 method. Items were said to have good fit when p from − 2 > 0.05 (Kang & Chen, 2008), while other authors suggest p > 0.01 (Stover, McLeod, Langer, Chen, & Reeve, 2019). If the item had low accuracy, the author would test the assumptions of local independence using the LD 2 method to identify the source of the item's problem.
The value of LD 2 which ranges from 3 to 5 shows that there is local dependence in the low category between one pair of items (Chen & Thissen, 1997), while the value > 10 shows that there is a major violation of the assumptions which might warrant modification of the overall model (Cai, Thissen, & du Toit, 2015b). If the whole model has good fit but there remains evidence of local dependence, the content of the items need to be reevaluated and the author may consider dropping these items in future research (Depaoli, Tiemensma & Felt, 2018).

Results
The results of the analyses toward 15 items of the Partner Surveillance Scale (PSP) showed that the accuracy of the model fit was good ( 2 = 75.55, df = 60, p = 0.0848, RMSEA2 = 0.03 and marginal reliability = 0.94). This finding showed that the assumption of unidimensionality was met (p-value from M 2 = 0.08 > 0.05; RMSEA2 = 0.03 < 0.04). The reliability from PSS was 0.94 which indicates high internal consistency. Having established good fit with the data we proceed with an interpretation based on each item parameter (see Table 1). Table 1 contains information on the discrimination power (slope), threshold, and index of item accuracy toward the model. All of the values of the threshold are ordered from the lowest threshold to the highest and this applies to all items. The patterns show that the assumption of motononicity had been met. None of the discrimination power of the items was negative which shows that the items were functioning well to differentiate people with a high trait and a low trait. We also found that there was a large slope for each item because they often ranged from 0.5 to 2.5. However, this was not a major issue since onlyvalue larger than 4.00 would indicate a serious problem (Edelen & Reeve, 2007), which was not the case in this research. Although there was good fit for the overall model with GRM, the information presented in Table 2 shows that some items have low accuracy based on the − 2 statistical test, namely items 6 and 13 which had p < 0.05 and p < 0.01. This means the items have low accuracy. However in contrast to CFA, items with low accuracy are not excluded from the analyses because this can have negative implications to the IRT model (Crişan, Tendeiro, & Meijer, 2017), indicating the different philosophy between GRM and models based on Rasch measurement (see, Andrich, 2004;Linacre, 2010). The results of the assumption test shows local dependence, which would be reported so that readers can understand why the items did not meet the criteria of good fit, to give information for future research (see Table  1). Table 2, there was a violation of the local independence assumption for two items which had low accuracy. Item 6 with items 11, 13, 14 and item 13 with items 3, 6, 9, 11, all these item pairs showed LD 2 > 3, which exceeded the criteria of accuracy used. The violation of assumption between the items pairs are related to the unidimensionality assumption. However, violation of the local independence assumption in the research is not too strong as to violate the unidimensionality assumption. This is because the assumption on unidimensionality is most likely to be violated when the value of LD 2 > 10 which would require a modification of the whole model (Cai, Thissen, & du Toit, 2015b). The following information in figure 1 presents the ICC for each item and visually represents the characteristics of the items. Category 1 on all items, except for item 14 which is covered by categories that are in close proximity with it (option 0 and 2), function very well. For item 14, category 1 indicates that there is a continuum between disagree-absolutely agree. Participants tend to not choose these responses compared with other responses. We can see that the other categories are functioning well namely 0, 2, 3. This means that respondents with a low trait level tend to choose responses 0 and respondents with the high trait category tend to choose response category 3.

Based on information from
The use of GRM also produces estimates toward the total information curve (TIC). TIC supplies information related with estimation of the function of test information for each category of surveillance toward the partner. The x-axis shows the trait category of the respondent while the y-axis shows the overall information value and standard error (S.E). The value of the information is an inverse of the S.E., when S.E. is low than the information would be higher (See Figure 2).
As seen in figure 2, as far as the trait spans -3 logit to +1 logit, the magnitude of the information from the test was very high. This can be seen from S.E. which was below 0.30 meaning that this instrument is very informative on that category of partner surveillance. However, we must note that the test information curve was multimodal because it showed numerous peaks. The Information test curve that is multimodal needs specific attention from a statistical viewpoint for further interpretation. In general, the test information curve shows that PSS can give maximum information when it is used to measure respondents who exhibit low partner surveillance (-1 logit).

Discussion
With the aim to evaluate the psychometric properties of the PSS, the results of the analyses using GRM showed that the PSS instrument has a unidimensional factor structure which means that the instrument measures one construct of partner surveillance on Instagram. If the model did not have good fit,the use of the GRM needs to be modified to other models for example multidimensional GRM (de Ayala, 1994) which accommodates different dimensions in the data.
The results of the GRM analysis showed that 2 of the 15 items of PSS have low accuracy toward the GRM model. These two items were further analyzed with regards to their violations of local independence to identify the core issue of the low accuracy. However, when the response categories were ordered from low to high, it showed that the assumption of monotonicity of PSS was met.
After further investigation we found that numerous item pairs had relationship values larger than the criteria namely item 6 with item 11, 13, 14, and item 13 with item 3, 6, 9, 11. The relationship that was observed may have been caused by the wording of the items (similar sentences used for the items) or it could be due to statistical error. In this research, the relationship that occurred was due to statistical artifacts since the relationship of the sentences between items did not have a pattern that we could conclude with certainty.
Although local dependence did not have a significant effect on the parameter estimates (Chen & Thissen, 1997), future research can further explore this issue with other statistical methods that can accommodate the existence of independent variables that could be accounted for (e.g.: relationship status and account ownership of the partner) as an example, by using IRT-C (Tay, Vermunt, & Wang, 2013).
Heterogeneity of the population has not been considered in this research and it was not controlled. In addition, the researcher could compare between models, for example compare the unidimensional GRM model with the multidimensional GRM model (de Ayala, 1994) or the Bifactor GRM (Cai, Yang, & Hansen, 2011) which accommodates the possibility of having more than one dimension or items that are part of more than one dimension. In this study, comparison toward numerous models have not been conducted even though this is advised by past research (Depaoli et al., 2018), and this is what becomes the limitation of the current research.
Furthermore, as a function of the response category, all response categories were ordered from low to high which means that the assumption of monotonicity was met and threshold disordering did not occur (see, Adams, Wu, & Wilson, 2012). However in item 14, the category Disagree had a low response when compared with the other three categories for example when observing the ICC of the items. This can be caused by the small number of respondents who answered to this category (García-Pérez, 2017). It was also found that the response category "Absolutely Disagree" for most of the items were located beyond the range of -3 of the trait category (very low), which is related to the low probability of this response category to be chosen by the respondents.
Based on further investigation toward the test information curve, we found that the curve was multimodal (had more than one peak). The function of the test information is very important in IRT, and if estimation toward the test information was inaccurate, there would be error in the interpretation of the test (Zhang, 2012).
Although the research findings showed that overall there was large test information for both the low and high level partner surveillance, the multimodal curve showed that other factors were affecting test information that has not been accounted for in the current research. Some of these factors may include the size of the sample which may be too small to minimize bias on the parameter estimates of the items (Reeve, & Fayers, 2005;Zhang, 2012) as well as the high discrimination power of the item (Hambleton & Jones, 1994). The high item discrimination is in line with this research that showed that all items had high discrimination power, meaning that there is a high combination between discrimination power and threshold which creates variance of trait levels used in the computation of test information.
In numerous studies, it is common to find a relationship between test information function with reliability. However, caution must be made when using such an approach since the concept of reliability in IRT is different compared to classical approaches (Umar, 2012;2014). Computation of reliability for GRM requires modification and specific computation (Samejima, 1994), and considering the complex computation required, this approach cannot be applied in the current study. However, reliability can be obtained through other methods, namely using marginal reliability with a value of 0.94 that describes a very good internal consistency.
younger generation in Indonesia. In addition, the researcher also contributes in understanding the index of model fit of IRT that is required in the report of analyses using IRT (Maydeu-Olivares, 2013, 2015. IRT based on statistical models like GRM have different philosophical leanings with Rasch-based models, for example RSM which operationalization is easier to apply (see, Andrich, 2004;Linacre, 2010). Both of them have their benefits and limitations. In addition to this comparison, this research shows that the application of the GRM model has assumptions that must be tested and this would be difficult for the researcher in the case of not having a good fit of the model or the items. The approach applied to overcome GRM models that did not have good fit is different for CFA, which can be managed by using the modification index (Sörbom, 1989). However, such things are not available in IRT models.

Conclusion
Based on the results of the research, it can be concluded that the Partner Surveillance Scale has good psychometric properties and can be used to measure partner surveillance. In addition, this measure also shows that it is unidimensional with a good model fit index and very good marginal reliability, although there were some violations of the assumption of local independence and low accuracy for two items toward the model. These issues need to be further explored in future research. Overall, the model can be applied in future research and can provide a technical description of what should be done for similar analyses methods. This research can become reference for researchers in psychology to conduct analyses using GRM methods.

Suggestion
Future research can conduct tests of the Partner Surveillance Scale on samples that are not only female, so we can obtain a different description of the functions of items across genders. Future research could also investigate the basic psychological variables related with partner surveillance behavior through Instagram which has not been covered in the current research.

Acknowledgments
The researcher extends gratitude to the editors and two anonymous reviewers that had given valuable feedbacks to this article.

Funding
The researcher did not receive funding in any form in the research or publication of this research.

Author's contribution
BS was responsible for the theoretical foundation of the paper, data collection and writing the manuscript. MDKP was responsible for the data analyses, drawing conclusion, and writing the article.

Conflict of interest
The authors declare there is not conflict of interest in this research.