Series Vol. 5 , 25 May 2023
* Author to whom correspondence should be addressed.
In this world of information explosion, people require more effective ways to filter useful information from millions of data. Email as one of the most frequently used form of communication, carries important messages, yet along with messages of fake news, misinformation and scams known as spam emails. Manually categorizing them from non-spam emails requires a lot of time and money and other human along with material resources. In order to deal with this, deep learning, or natural language processing models in particular, is introduced to categorize emails faster and cheaper. The Natural Language Processing model used here is called Bidirectional Encoder Representations from Transformers (BERT). Since BERT is already a pre-trained model, the main task is to do the Fine-Tune part on it, with a dataset that contains around 5000 emails (85% spam emails and 15% non-spam ones). After that the model is tested on a group of 5 emails including 3 commercials/spams and 2 non-spam emails. The result shows that this model could separate them by giving commercials scores closer to 1 (spread from 0.5 to 0.7) and non-spam emails scores close to 1(spread from 0 to 0.1). Therefore, it can be concluded that this model works on small sets of data.
BERT, Spam Emails Identification.
1. Monkeylearn 2022 What is text classificationhttps://monkeylearn.com/what-is-text-classification/
2. Cai G et al. 2022 Privacy‐preserving CNN feature extraction and retrieval over medical images International Journal of Intelligent Systems
3. Yu Q et al. 2020 Improved denoising autoencoder for maritime image denoising and semantic segmentation of USV China Communications 17(3) 46-57
4. Pandey S et al. 2022 RNN‐EdgeQL: An auto‐scaling and placement approach for SFC International Journal of Network Management e2213
5. Zhang X et al. 2021 January Benchmarking LF-MMI, CTC And RNN-T Criteria For Streaming ASR. In 2021 IEEE Spoken Language Technology Workshop (SLT) (pp. 46-51) IEEE
6. Tachtarget 2020BERT language model https://www.techtarget.com/searchenterpriseai/definition/BERT-language-model
7. Kaggle 2018 SMS Spam Prediction https://www.kaggle.com/code/jepsds/sms-spam-prediction/data
8. Miro 2022 https://miro.medium.com/max/828/1*p4LFBwyHtCw_Qq9paDampA.png
9. TFhub 2022 bert_en_uncased_preprocesshttps://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3
10. DevlinJ et al. 2018 Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
11. Vaswani AShazeer N Parmar N Uszkoreit J Jones L Gomez A N ... & Polosukhin I 2017 Attention is all you need Advances in neural information processing systems 30.
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.