Cloning and in silico analysis revealed a genetic variation in osmotin‐ encoding genes in an Indonesian local cacao cultivar

Theobroma cacao L. is an important Indonesian estate crop, which suffers from biotic and abiotic stresses. TcOSM, which encodes osmotin as a response to pathogens and environmental stresses, is therefore a focus of interest in this research, aiming to characterize TcOSM in an Indonesian local cacao cultivar. Bioinformatics queries for putative TcOSM were performed against the reference genome of a Criollo‐type cacao cultivar. Based on nucleotide sequence determination, our results revealed two genes, TcOSM1 and TcOSM2, which have the highest similarity (≥ 90%) to the cacao reference genes. Heterozygosity was detected in the TcOSM1‐encoding gene, which contained two overlapping peaks in Sanger‐sequencing chromatograms. One of the alleles resulted from a single nucleotide change (G to A), leading to a same‐sense mutation that did not substitute corresponding alanine residue. Homology modeling using Phyre2 and structural alignment (superimposition) was conducted to examine the influence of genetic variations in TcOSM sequences upon the global protein structures. The result showed no significant changes (RMSD ≤ 0.206 Å, TM‐score > 0.5) in tertiary protein structures. Altogether, this research succeeded in characterizing TcOSM while providing a fundamental study for future cacao biotechnology endeavors.


Introduction
Cacao (Theobroma cacao L.) is a native plant of the Ama zon basin (Souza et al. 2018), which cultivation is dated back to ancient Mayan and Aztec civilizations. It spreads throughout the world through domestication (Zhang and Motilal 2016) and becomes one of the most valuable estate crop commodities of Indonesia (Mithöfer et al. 2017). Ca cao product derivatives are used as foods, liquors, and ad ditives for food flavoring and coloring. Unfortunately, the productivity of cacao in Indonesia suffers from pathogen attacks and drought. It has been reported that cacao plan tation suffers greatly from pathogenic fungal infections, i.e. (i) Moniliophthora perniciosa (syn. Crinipellis per niciosa) causes the Witches' Broom disease (Farquharson 2014); (ii) the Black Rot Pods caused by Phytophthora spp. (Ali et al. 2017); (iii) Moniliophthora roreri, which causes the Frosty Pod Rots disease (Bailey et al. 2018); and (iv) VascularStreak Dieback disease caused by Cera tobasidium theobromae (syn. Oncobasidium theobromae) (Ali et al. 2019).
Conventional breeding has successfully produced ca cao cultivars, which have better fungal resistance, pro ductivity, and performance amidst environmental stressors (Wickramasuriya and Dunwell 2018), albeit at a relatively long time. The plant molecular breeding studies can ad dress the low turnover of conventional breeding by im plementing known plant molecular trait markers. How ever, the detailed information of the trait markers, such as Quantitative Trait Loci (QTL) or Simple Sequence Re peats (SSR), is required for the basis in engineering plant phenotypes. Moreover, the full sequence of known resis tance gene sequence, its biochemical properties, genetic expression, and working mechanism are also necessary to be obtained in the effort to create a better resistance cacao cultivar against a pathogen.
Cacao, as well as other plant species, can defend against fungal invasion using sentinel plant systems. One of the defense mechanisms among higher plants is to pro duce the pathogenesisrelated (PR) family proteins. The expression of PR5 protein, such as osmotin, is acti vated by abiotic environmental stresses (van Loon et al. 2006). Researchers have succeeded in expressing the osmotinencoding gene from Tobacco (Nicotiana tabac cum L.) (Tzou et al. 2011) and Nightshade (Solanum ni grum L. var. americanum) (Campos et al. 2002), em ploying the heterologous expression system of the full or truncated osmotinencoding gene (Campos et al. 2008). The previous study reported partial gene encoding osmotin from cacao (Theobroma cacao L.) and engineered peptides derivation, which showed a significant inhibitory effect on pathogenic fungal strains (Falcao et al. 2016). However, the study of osmotins from Indonesian local cacao culti var, which derived from Criollo, Trinitario, or Forastero cultivars, remains unexplored.
This research aimed to characterize the osmotin pro teins from Indonesian local cacao cultivar using combined bioinformatics and molecular biology approaches. The bioinformatic study included gene determination using BLAST, multiple sequence alignment, and protein homol ogy modeling. Furthermore, the targeted gene was cloned and confirmed by Sanger DNA sequencing analysis. This research has succeeded in determining and characterizing two osmotin proteins from local Indonesian cacao culti var. Understanding the profile of osmotins will provide a new insight for selection in breeding strategies based on the existing genetic variation pools. This insight is valu able to overcome the issues in cacao plantation and other endeavors.

Sample collection and preparation
Ripe cacao fruits, indicated by purple to brownish pod color and harvested from healthy mature trees (threeto five years of age after planting), were bought from a local trader (PD Petani Kakao Lampung), Pringsewu Region, Lampung Province, Sumatera, Indonesia. Samples were wrapped using plastics, transported to the lab, and stored in 20°C freezer until use.

Bioinformatic queries of TcOSM in the public database
Bioinformatic queries were conducted using the os motin (AP24) amino acid sequence of N. tabaccum (AAB23375.1). The retrieved N. tabaccum osmotin amino acid sequence was further used for protein queries using BLASTP against the available sequence of T. cacao L. cultivar Criollo (GCF_000208745.1). The results were filtered using percent identity value ≥ 60% and low E value (≤ 1.0E90). Domain queries were also performed to confirm the BLAST hits against the Conserved Domain Database (CCD v3.17). Putative osmotins from Indone sian local cacao cultivar were validated as they contained ThaumatinLike Protein (TLP) motif.

DNA genomic isolation from cacao beans
The DNA genomic isolation was conducted using Plant Genomic DNA Mini Kit (Geneaid, Taiwan) as per the manufacturer instructions using 100 mg of a fresh plant tissue sample. Cacao beans were extracted from thawed pods. Cotyledons were separated from the testa and en dosperms. Cacao cotyledons were weighed up to 100 mg and lysed using mechanical force in a sterile mortar and pestle containing lysis buffer. Further purification of ge nomic DNA was done using provided columnbased pro tocols to remove the mucilage from cacao beans' testa, which ensures downstream treatment such as PCR ampli fication. The genomic DNA was eluted in sterile 30 µl ddH2O following the manufacturer's protocol. The intact genomic DNA was visualized using 0.8% agarose elec trophoresis (Chiong et al. 2017).

TcOSM amplification, ligation, cloning and sequencing
The primers (Table 1) were designed based on the predicted cacao osmotins mRNA sequences (XM_018118608.1 and XM_007040100.2) retrieved after bioinformatic queries mentioned before. The primer design followed the protocol described in the study of Nugroho and Handayani (2016). The amplifications were conducted using MyTaq™ HS Red Mix (Bioline, USA) with initial denaturation at 98°C for five min in 35programmed cycles as follows: denaturation at 98°C for 30 s, annealing at 55°C for 30 s, extension at 72°C for one min, and a final extension at 72°C for five min. Direct Sanger DNA sequencing was done to determine the sequence of TcOSM genes from the PCR amplicons.
Cloning procedures were conducted by inserting TcOSM1 and TcOSM2 into pTA2 cloning vector (Toyobo, Japan) followed by ligation using T4 DNA Ligase (Toy obo, Japan) in 14°C overnight. The ligation mix was transformed into E. coli DH5α competent cells (Takara, Japan). LuriaBertani medium supplemented with ampi cillin was used for screening of positive clones. The re sulting plasmid from the positive clone was selected and reamplified using PCR procedures mentioned above and subjected to sequencing at 1st Base, Selangor, Malaysia. DNA sequencing was repeated three times from three in dependently sampled positive clones to validate the find ings.

Sequence analyses and unrooted Neighbour-Joining (NJ) tree reconstruction
Nucleotide sequences were analyzed using Geneious Prime v2019.1.1 (Biomatters Ltd., New Zealand). For ward and reverse reads from Sanger sequencing were as sembled to build Contig. Consensus sequences from sev eral contigs were aligned using MUSCLE (Edgar 2004) against mRNA reference sequences (XM_018118608.1 and XM_007040100.2), osmotins from other taxa, and thaumatins. Multiple sequence alignment was done to as sist the sequence analyses for identifying genetic variation and pinpointing residues conservancy among studied pro teins.
Open reading frames for TcOSM1 and TcOSM2 were deducted to build correct translations, then submitted to BLASTP. The twenty highest BLASTP hits, including TcOSM1 and TcOSM2, were downloaded and subjected to multiple sequence alignment using MUSCLE in MEGAX (Kumar et al. 2018). A NeighbourJoining unrooted tree was reconstructed using the JonesTaylorThornton matrix to compute evolutionary distance based on 1,000 bootstrap resamplings.

TcOSM protein structures homology modeling
In silico modeling was conducted to simulate the structural changes caused by variations in amino acids in TcOSM1 and TcOSM2. The amino acid sequence of NtOSM (AAB23375.1), "Wildtype" TcOSM1 (XP_017974097.1), TcOSM1, and TcOSM2 without signal peptide sequences were submitted to Phyre2 server (Kelley et al. 2015). The protein structures were retrieved as PDB file and analyzed using PyMOL v2.3.2 software (Schrödinger Lnc., USA) to align the models with 120°r otation on the yaxis. The images were captured using [set ray_opaque_background, 0] followed by [png File name, dpi=600] commands. Structural identities were measured using indicators such as RMSD (RootMean Score Deviation) values and TMscores (Xu and Zhang 2010).

Cacao genome contains two TcOSM paralogs with highly similar amino acid sequences
The implementation of NCBI+blastp (protein BLAST search) has succeeded in identifying two paralogs of osmotin in T. cacao genome cultivar Criollo (Ta ble 2). Based on alignment with N. tabaccum os motin amino acid sequence (NtOSM AAB23375.1), two putative osmotin sequences of T. cacao, i.e. XP_017974097.1 and XP_007040162.2 were identified with their respective mRNA sequence counterparts as fol lows: XM_018118608.1 and XM_007040100.2. The mRNA sequences of osmotin paralogs from T. cacao showed single coding exons without the detected intron.
Intronless osmotin genes have successfully been elu cidated not only in cacao but also in other plant species such as strawberry (Wu et al. 2001). Single exon mRNA within the cacao osmotins brought confidence and ease in the primer design for their amplifications from genomic DNA samples. Our strategy is in agreement with the ex periment conducted by Chowdhury et al. (2015), which successfully cloned the osmotinlike gene from Solanum nigrum L using DNA genomic template and similar primer design rationale. The primers used in our study were de signed to flank the osmotin ORF; therefore, the start codon (ATG) was not included in the primer as it was located downstream of the priming area ( Figure 1). The absence of intron in osmotin made the amplification from the ge nomic DNA template straightforward in getting full ORF avoiding any introns.
The identified osmotin paralogs were highly similar to another with a ≥ 90% identity ( Figure 2a). Additionally, the two paralogs have relatively high identity to NtOSM (≥ 60% and were also confirmed to have the TLP domain ac cording to the CDD Hits. The CDD has been utilized suc cessfully to assign an unknown protein into a known sub family. The CDD functions based on the unique hallmark or amino acid sequence characterizing a specific domain or motif. A similaritybased CDD search can be used to infer a protein function and its evolutionary aspects (Fong and MarchlerBauer 2008). Additionally, CDD was also utilized to screen and validate a plant gene called COBRA using the designated domain or motif signature (Putranto et al. 2017). Therefore, the identified paralogs were con fidently justified as osmotin encoding genes after it con tained conserved TLP domain or motif.
The paralogs also showed the highest similarity to the ThaumatinLike Protein (TLP) of Durio zibethinus (XP_022773064.1) as they converged to the same branch (Figure 2b) of the reconstructed unrooted NJ tree. The node was wellsupported as it displayed a ≥ 70% boot strap value (Soltis and Soltis 2003). The branch was also complexed with other TLPs of Gossypium spp. This branch represents the common ancestry of osmotin or thologs within the Malvaceae since T. cacao, D. zibethi nus, and Gossypium spp. all belong to the same family.  (2), 2020, 84-94 The primers annealing (priming) areas for TcOSMs amplification.

TcOSM1 of Indonesian local cacao cultivar shows a genetic variation
The genomic DNA isolation from cotyledons of cacao was challenging due to the mucilaginous nature of ca cao beans surface, which can hinder DNA precipitation. The column's utilization to remove the mucilage from ho mogenate has proven successful in retrieving intact ge nomic DNA from cacao beans (Figure 3a, Lane 1), which later ensured the probability of success in the target gene amplification.
The genomic DNA was used as the template for TcOSM1 and TcOSM2 amplifications. Successful am plification was visualized using 0.8% agarose gel elec trophoresis, showing 800 bp of PCR amplicons ( Figure  3a, Lane 2 and 3). The successful amplification of osmotin genes from the genomic DNA template by other find ings (Chowdhury et al. 2015) demonstrates the straightfor wardness of the amplification of the intronless eukaryotic gene. The multiple sequence analyses demonstrated the conservancy of essential amino acid residues of TcOSM1  and TcOSM2 comparing to reference sequences. TcOSM1 sequence depicted a genetic variance as heterozygosity, compared to the predicted gene sequence from the genome of the Criollotype cacao used as reference (Figure 3b and c). Meanwhile, the TcOSM2 sequence was exactly identi cal to the reference sequence (XP_007040162.2).
The finding in TcOSM1 was confirmed using paired end (forward and reverse read) Sanger sequencing and confirmed after twopeaks detection, each corresponding to different nucleotides. In general, Sanger sequencing will produce regular spaced and similar height peaks to generate an efficient base calling (Tenney et al. 2007). However, two peaks or double traces can be detected in base substitutions or indels, causing heterozygosity (Hill et al. 2014). Here, the heterozygosity was supported by clear reads of sequencing chromatograms at both left and right flanking bases suggesting high confidence in the base calling results (Figure 3b and c).
The cloning procedures were conducted to circumvent the downside of direct DNA Sanger sequencing from PCR amplicons when double peaks were found. The cloning has succeeded in discerning a nucleotide difference be tween two strands of DNA, which are otherwise detected as double peaks in the heterozygotic alleles. The inde pendent PCR amplification of each strand enables us to make a distinction between the nucleotide variations. Each amplified strands were independently cloned into vectors and determined the nucleotide sequences. A similar pro cedure has been implemented to screen heterozygotic and homozygotic mutations caused by indels introduced dur ing gene editing by CRISPR/Cas9 technology (Lawrenson et al. 2015; Ma et al. 2015. Therefore, the obtained het erozygosity in this study is well confirmed (Figure 4).
One of the allelic versions in TcOSM1 heterozygos ity was considered a mutant allele as it differed from the reference sequence (XP_017974097.1). The mutant allele contained a nucleotide change from G to A. This mutation occurred in the codon GCG (wildtype), which mutated the codon to GCA (mutant). However, the mutation did not change the corresponding alanine residue (Figure 3d). Nevertheless, further analysis using homology modeling was conducted to predict the influence of genetic variation in TcOSM1 and TcOSM2 upon the global protein struc tures; thus also conjecturing their functionalities.

TcOSM1 and TcOSM2 have conserved essentials amino acid residues belonging to TLP family
The sequence analyses of TcOSM1 and TcOSM2 re vealed other interesting aspects (Figure 3d). TcOSM1 and TcOSM2 retained the Nterminal domain, which drives the peptide to the secretory pathway (Campos et al. 2002). Several PR family proteins, as well as TcOSM1 and TcOSM2, lack carboxylend signal peptide, which assists the osmotins transport to vacuole. This analysis has pre dicted that TcOSM1 and TcOSM2 are functionally ex pressed extracellularly or in the apoplastic regions. Furthermore, the possibility of eight disulfide bridges in TcOSM1 and TcOSM2 were retained since all sixteen cysteine residues were intact. Moreover, the sequence analysis detected Proline (P) -Glutamic Acid (E) -Ser ine (S) -Threonine (T)rich sequence (PEST motif) within TcOSM1 and TcOSM2 (Figure 3d). PEST motif is the tag for protein degradation, such as through ubiquitin proteasome complex (Zhai et al. 2017). Comparing to other PR family proteins, TcOSM1 and TcOSM2 were pre dicted to have no potential glycosylated conserved residue. Therefore, they are not posttranslationally glycosylated before secretion.
A further comparative study to ThaumatinI and ThaumatinII resulted in the identification of conserved TLP motif among osmotins from N. tabacccum and T. ca cao. The TLP motif was firstly identified in the sweet tasted protein of Thaumatococcus daniellii (Benn.) Benth. ex B.D. Jacks, a plant originated from Africa. It has a unique motif of GxGxCxTGDCGGL/VLxC with x correspond to any amino acids (Liu et al. 2010). This result was in agreement with the domain query hits using CCD (Table 2), stating that NtOSM, TcOSM1, and TcOSM2 possessed conserved TLP motif (Accession ID: cl02511).

In silico modeling revealed implications of the TcOSMs amino acid variations to the protein structures
The global RMSD (RootMeansSquared Deviation) val ues retrieved after structural alignments (superimposi tions) of paired TcOSMs (Figure 5e, f, g, and h). The values showed insignificant differences between aligned (superimposed) models (RMSD ≤ 0.206 Å ). The values depicted the global deviation length of the whole atomto atom aligned between the two models. The RMSD value ≤ 0.206 Å reflected highly identical structures even if the figures showed several unaligned protrusions among su perimposed models. It is not surprising that the calculated RMSD values were low, as the protein structures themselves had high identity (≥ 90%). However, The interpretation based on RMSD values can be slightly overestimated due to sig nificant errors (Kufareva and Abagyan 2011). Therefore, it was accompanied by analyses based on TMscores cal culation, which included statistical consideration (Xu and Zhang 2010).
The TMscore > 0.5 between two superimposed pro tein structures indicates that they have the same fold (Zhang and Skolnick 2004). Xu and Zhang (2010) men tioned that a TMscore of 0.5 equals the probability of one uniquely matched fold among those of 1.8 million ran dom superimposed protein pairs. TMscore > 0.5 signifies that to achieve a significant similarity of structural topol ogy between two superimposed structures, one shall con sider more than 1.8 million random pairs. Indeed, it is true, according to the same calculation, when TMscore = 0.6 means more than 90% of superimposed structures fall into the same fold (Xu and Zhang 2010). Therefore, it is safe to assume that all structures studied here are more than 90% to belong correctly to the same fold. Altogether, we can predict that the variations in amino acid sequences were not significantly meaningful in influencing the global structures of TcOSMs models.
The function of a protein can be inferred rationally from the one structure with known function since the dis tinction among structures reflects the uniqueness of their functions. The structure of NtOSM (AAB23375.1), which has been supported by Xray crystallography study and other assays (Min et al. 2003), was used as a reference for structure and function. The TcOSMs models were not sig nificantly different in structures toward the NtOSM as the RMSD value is 0.516 Å, while the TMscore value is still above 0.5.
The functionality of the TcOSMs could also be pre dicted since the clefts (green arrows), which presumably functioned in the binding to the receptors or ligands from fungal counterparts, were intact among the structures. The cleft is a signature of osmotins formed during the protein folding located between Domain I (blue) and Domain II (red) (Figure 5a and b.) (Ghosh and Chakrabarti 2008). Amino acid residues characterize the cleft of osmotin with acidic Rgroup. The residues within the cleft/pocket bind ing, i.e. arginine, glutamic acid, and three aspartic acids (Liu et al. 2010), were highly conserved in NtOSM and TcOSMs but varied in thaumatins (see Figure 3d and Fig  ure 5c and d, residues were highlighted in purple and with asterisks). These differences between thaumatins and os motins reflect their functionality and presumably evolu tionary. Moreover, two residues, the aspartic acids (Asp 101 and Asp 182), are located within the junction loops, possibly offering dynamic interactions in osmotin bind ing. Meanwhile, the other residues are located within the betasheets secondary structures in Domain I. The residues also are predicted to support the global threedimensional structure of osmotin via two distinct polar contacts: Asp 182 to protein backbone and two parallel residues, Glu 83 to Asp 96 (see Figure 5d, yellow dashedlines).
The osmotin is a protein exhibiting a widespectrum of antifungal affectivity, suggesting a specific target in the surface of the plasma membrane of the target cell. The binding mode of osmotin to target through the cleft's me diation with five highly conserved amino acid residues re mains elusive. The pathway of osmotin exerted cell cy totoxicity involves apoptosis in yeast model through ac tivation of MitogenActivated Protein Kinase (MAPK), particularly the mating integration modules: Ste4, Ste5, Ste7, Fus3, Ste11, Ste12, St18, Ste20, and Kss1 (Yun et al. 1998). Interestingly, the fungal Ste7 was rapidly phospho rylated after osmotin exposure (Yun et al. 1998), usually at its threonine and tyrosine residues, which later phos phorylated the partner proteins at the same sites (Bardwell 2006). However, the active receptor of osmotin remains to be determined since osmotin can modulate the regulatory fungal mating elements without activating the pheromone receptor nor its associated G proteins.
Osmotin action has been reported by the identifica tion of ORE20/PHO36 as the receptor of osmotin, which promotes a signal cascade pathway (Narasimhan et al. 2005). The ORE20/PHO36 is a surface protein with seven transmembrane domains that regulate lipid and phosphate metabolism of yeast (Yamauchi et al. 2003; Karpichev et al. 2002. The binding of osmotin to ORE20/PHO36 activates the RAS2cAMP/PKA (Miele et al. 2011), which later, in turn, suppresses the stress response gene expres sions (Narasimhan et al. 2001). The suppression of stress response genes, such as Msn2 and Msn4, leads to an in creased susceptibility to oxidative stress caused by radical oxygen species (ROS), which under metabolic burden will induce apoptosis/necrosis (Nakazawa et al. 2018). Al though many of the concerted elements remain to be deter mined, as well as the more precise mechanism of osmotin binding to ORE20/PHO36 via the five highly conserved residues resided within the cleft, osmotin can affect the yeast's cellular signaling, which in turn kill the yeast it self.
The bioinformatics approaches and molecular biol ogy protocols have uncovered the characteristics of os motins. The evidence produced by this study sufficiently supported the functionality prediction of the TcOSMs as antifungal protein. Besides, the mutation in one of the TcOSM1 alleles was considered to be harmless since it did not change the corresponding amino acid residue and lo cated outside the essential regions of osmotin.
Careful interpretations have also been made after the in silico analyses in predicting the intact activity of TcOSMs based on the structural changes. The TMscore (Xu and Zhang 2010), in conjunction with RMSD analy sis, has increased the confidence of interpretations made in this study. In silico modeling produced idealized mod els using assumptions, which can differ compared to the actual conditions in vivo. Therefore, better models or structures deduced using biophysical analyses and another method in analyzing structures such as the contactbased method (Ding et al. 2018) should be applied in the future. Further studies are also implied using biochemical assays and heterologous expression in other model organisms.
This research will open numerous other research in producing elite cacao cultivar. Moreover, it will also pro mote other research and applications in human health since osmotin also affects human fungal pathogens (Viktorova et al. 2017). Furthermore, osmotin resembles adiponectin, which exhibits antitumor activity, inhibits endothelial proliferation and migration, and reduces atherosclerosis (Anil Kumar et al. 2015).

Conclusions
The research has succeeded in determining the two os motin proteins, TcOSM1 and TcOSM2, from local cacao cultivars. Moreover, this research provided preliminary study opening opportunities to establish elite cacao culti vars using plant molecular engineering in Indonesia and other applications in human health.