Theoretical and Natural Science

- The Open Access Proceedings Series for Conferences


Theoretical and Natural Science

Vol. 5, 25 May 2023


Open Access | Article

FineTuning-based BERT for Spam Emails Identification

Qingyao Meng * 1
1 Department of Computer Science, University of California, Davis, US.

* Author to whom correspondence should be addressed.

Theoretical and Natural Science, Vol. 5, 210-213
Published 25 May 2023. © 2023 The Author(s). Published by EWA Publishing
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Citation Qingyao Meng. FineTuning-based BERT for Spam Emails Identification. TNS (2023) Vol. 5: 210-213. DOI: 10.54254/2753-8818/5/20230406.

Abstract

In this world of information explosion, people require more effective ways to filter useful information from millions of data. Email as one of the most frequently used form of communication, carries important messages, yet along with messages of fake news, misinformation and scams known as spam emails. Manually categorizing them from non-spam emails requires a lot of time and money and other human along with material resources. In order to deal with this, deep learning, or natural language processing models in particular, is introduced to categorize emails faster and cheaper. The Natural Language Processing model used here is called Bidirectional Encoder Representations from Transformers (BERT). Since BERT is already a pre-trained model, the main task is to do the Fine-Tune part on it, with a dataset that contains around 5000 emails (85% spam emails and 15% non-spam ones). After that the model is tested on a group of 5 emails including 3 commercials/spams and 2 non-spam emails. The result shows that this model could separate them by giving commercials scores closer to 1 (spread from 0.5 to 0.7) and non-spam emails scores close to 1(spread from 0 to 0.1). Therefore, it can be concluded that this model works on small sets of data.

Keywords

BERT, Spam Emails Identification.

References

1. Monkeylearn 2022 What is text classificationhttps://monkeylearn.com/what-is-text-classification/

2. Cai G et al. 2022 Privacy‐preserving CNN feature extraction and retrieval over medical images International Journal of Intelligent Systems

3. Yu Q et al. 2020 Improved denoising autoencoder for maritime image denoising and semantic segmentation of USV China Communications 17(3) 46-57

4. Pandey S et al. 2022 RNN‐EdgeQL: An auto‐scaling and placement approach for SFC International Journal of Network Management e2213

5. Zhang X et al. 2021 January Benchmarking LF-MMI, CTC And RNN-T Criteria For Streaming ASR. In 2021 IEEE Spoken Language Technology Workshop (SLT) (pp. 46-51) IEEE

6. Tachtarget 2020BERT language model https://www.techtarget.com/searchenterpriseai/definition/BERT-language-model

7. Kaggle 2018 SMS Spam Prediction https://www.kaggle.com/code/jepsds/sms-spam-prediction/data

8. Miro 2022 https://miro.medium.com/max/828/1*p4LFBwyHtCw_Qq9paDampA.png

9. TFhub 2022 bert_en_uncased_preprocesshttps://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3

10. DevlinJ et al. 2018 Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

11. Vaswani AShazeer N Parmar N Uszkoreit J Jones L Gomez A N ... & Polosukhin I 2017 Attention is all you need Advances in neural information processing systems 30.

Data Availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Authors who publish this series agree to the following terms:

1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.

2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.

3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open Access Instruction).

Volume Title
Proceedings of the 2nd International Conference on Computing Innovation and Applied Physics (CONF-CIAP 2023)
ISBN (Print)
978-1-915371-53-9
ISBN (Online)
978-1-915371-54-6
Published Date
25 May 2023
Series
Theoretical and Natural Science
ISSN (Print)
2753-8818
ISSN (Online)
2753-8826
DOI
10.54254/2753-8818/5/20230406
Copyright
25 May 2023
Open Access
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited

Copyright © 2023 EWA Publishing. Unless Otherwise Stated