Theoretical and Natural Science

- The Open Access Proceedings Series for Conferences


Theoretical and Natural Science

Vol. 35, 26 April 2024


Open Access | Article

Optimization method of protein coding region identification based on IHHO-CNN-LSTM

Siyuan Wu 1 , Tingting Yang * 2
1 Pomfret School
2 University of Wisconsin-Madison

* Author to whom correspondence should be addressed.

Theoretical and Natural Science, Vol. 35, 28-37
Published 26 April 2024. © 2023 The Author(s). Published by EWA Publishing
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Citation Siyuan Wu, Tingting Yang. Optimization method of protein coding region identification based on IHHO-CNN-LSTM. TNS (2024) Vol. 35: 28-37. DOI: 10.54254/2753-8818/35/20240862.

Abstract

Aiming at the current problem of insufficient identification accuracy of coding regions in DNA sequences, this study proposes a protein coding region identification method based on IHHO-CNN-LSTM. Firstly, the data preprocessing of DNA sequences is transformed into feature vectors, and then the protein coding region identification model based on CNN-LSTM is established. To address the limitations of parameter selection of CNN-LSTM, a hybrid strategy improved Harris Hawk Optimization (HHO) algorithm is introduced to achieve adaptive parameter searching of CNN-LSTM, so as to obtain the optimization model of white matter coding region identification based on IHHO-CNN-LSTM. The improved model was used to accurately distinguish coding and non-coding regions. Two benchmark datasets, HMR195 and BG570, are selected for five-fold cross-validation, and the results show that the AUC values of the model designed in this paper are 0.9854 and 0.9895, the corresponding identification accuracy is 0.9527 and 0.9645, respectively, which are significantly better than other models, and also have a significant advantage in terms of computational efficiency. The proposed method can efficiently and accurately identify protein coding regions, which can help promote the related research in the field of genetic engineering.

Keywords

protein coding region identification; CNN-LSTM; Harris Hawk Optimization algorithm; hybrid strategy

References

1. Mudge, J.M.; Jungreis, I.; Hunt, T.; Gonzalez, J.M.; Wright, J.C.; Kay, M.; Davidson, C.; Fitzgerald, S.; Seal, R.; Tweedie, S., et al. Discovery of high-confidence human protein-coding genes and exons by whole-genome PhyloCSF helps elucidate 118 GWAS loci. Genome Res 2019, 29, 2073-2087, doi:10.1101/gr.246462.118.

2. Zhang, X.; Zhang, K.; Wu, J.; Guo, N.; Liang, J.L.; Wang, X.W.; Cheng, F. QTL-Seq and Sequence Assembly Rapidly Mapped the Gene for the Purple Trait in. Sci Rep-Uk 2020, 10, doi:ARTN 232810.1038/s41598-020-58916-5.

3. Alfieri, F.; Caravagna, G.; Schaefer, M.H. Cancer genomes tolerate deleterious coding mutations through somatic copy number amplifications of wild-type regions. Nat Commun 2023, 14, doi:ARTN 359410.1038/s41467-023-39313-8.

4. Badia-Bringu´e, G.; Canive, M.; Fernandez-Jimenez, N.; Lav´ın, J.L.; Casais, R.; Blanco-V´azquez, C.; V´azquez, P.; Fern´andez, A.; Bilbao, J.R.; Garrido, J.M., et al. Summary-data based Mendelian randomization identifies gene expression regulatory polymorphisms associated with bovine paratuberculosis by modulation of the (mediated inflammatory response. Bmc Genomics 2023, 24, doi:ARTN 60510.1186/s12864-023-09710-w.

5. Yu, R.Z.; Abdullah, S.M.U.; Sun, Y.N. HMMPolish: a coding region polishing tool for TGS-sequenced RNA viruses. Brief Bioinform 2023, 24, doi:10.1093/bib/bbad264.

6. Bin Syed, M.A.; Ahmed, I. A CNN-LSTM Architecture for Marine Vessel Track AssociationUsing Automatic Identification System (AIS) Data. Sensors-Basel 2023, 23, doi:ARTN 640010.3390/s23146400.

7. Devi, T.K.; Baluprithviraj, K.N.; Mohan, M.M.; Devi, S.U.; Sakthivel, P.; Rajeshwari, P.; Vinodha, B. Non-invasive method for the prediction of micronutrient deficiency using sequential learning techniques. Comp M Bio Bio E-Iv 2023, 10.1080/21681163.2023.2228914, doi:10.1080/21681163.2023.2228914.

8. Fan, Y.X.; Xiong, H.; Sun, G.C. DeepASDPred: a CNN-LSTM-based deep learning method for Autism spectrum disorders risk RNA identification. Bmc Bioinformatics 2023, 24, doi:ARTN 26110.1186/s12859-023-05378-x.

9. Akpamukcu, M.; Ates, A.; Akdag, O. Combination of electromagnetic field and harris hawks optimization algorithms with optimization to optimization structure and its application for optimum power flow. J Chin Inst Eng 2023, 46, 754-765, doi:10.1080/02533839.2023.2238759.

10. Ismail, W.N.; Alsalamah, H.A.; Zhou, X.J.; Nguyen, L.; Zhu, G.H. Efficient Harris Hawk Optimization (HHO)-Based Framework for Accurate Skin Cancer Prediction. Mathematics-Basel 2023, 11, doi:ARTN 360110.3390/math11163601.

11. Karnavas, Y.L.; Nivolianiti, E. Harris hawks optimization algorithm for load frequency control of isolated multi-source power generating systems. Int J Emerg Electr P 2023, 10.1515/ijeeps-2023-0035, doi:10.1515/ijeeps-2023-0035.

12. Liu, G.C.; Luan, Y.H. Identification of Protein Coding Regions in the Eukaryotic DNA Sequences Based on Marple Algorithm and Wavelet Packets Transform. Abstr Appl Anal 2014, Artn 40256710.1155/2014/402567, doi:Artn 40256710.1155/2014/402567.

13. Lehilahy, M.; Ferdi, Y. Identification of exon locations in DNA sequences using a fractional digital anti-notch filter. Biomed Signal Proces 2023, 80, doi:ARTN 10436210.1016/j.bspc.2022.104362.

14. Gao, X.H.; Zhang, H.M. Radar coherent integration algorithm for detection of complexly maneuvering target with ex-tended velocity and acceleration scopes. Digit Signal Process 2022, 129, doi:ARTN 10368910.1016/j.dsp.2022.103689.

15. Akiyama, Y.; Miyata, H.; Komiyama, M.; Nogami, M.; Ozawa, K.; Oshita, C.; Kume, A.; Ashizawa, T.; Sakura, N.; Mochizuki, T., et al. The identification of affinity peptide ligands specific to the variable region of human antibodies. Biomed Res 2014, 35, 105-116, doi:10.2220/biomedres.35.105.

Data Availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Authors who publish this series agree to the following terms:

1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.

2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.

3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open Access Instruction).

Volume Title
Proceedings of the 2nd International Conference on Modern Medicine and Global Health
ISBN (Print)
978-1-83558-395-1
ISBN (Online)
978-1-83558-396-8
Published Date
26 April 2024
Series
Theoretical and Natural Science
ISSN (Print)
2753-8818
ISSN (Online)
2753-8826
DOI
10.54254/2753-8818/35/20240862
Copyright
26 April 2024
Open Access
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited

Copyright © 2023 EWA Publishing. Unless Otherwise Stated