Theoretical and Natural Science
- The Open Access Proceedings Series for Conferences
Series Vol. 5 , 25 May 2023
* Author to whom correspondence should be addressed.
In this world of information explosion, people require more effective ways to filter useful information from millions of data. Email as one of the most frequently used form of communication, carries important messages, yet along with messages of fake news, misinformation and scams known as spam emails. Manually categorizing them from non-spam emails requires a lot of time and money and other human along with material resources. In order to deal with this, deep learning, or natural language processing models in particular, is introduced to categorize emails faster and cheaper. The Natural Language Processing model used here is called Bidirectional Encoder Representations from Transformers (BERT). Since BERT is already a pre-trained model, the main task is to do the Fine-Tune part on it, with a dataset that contains around 5000 emails (85% spam emails and 15% non-spam ones). After that the model is tested on a group of 5 emails including 3 commercials/spams and 2 non-spam emails. The result shows that this model could separate them by giving commercials scores closer to 1 (spread from 0.5 to 0.7) and non-spam emails scores close to 1(spread from 0 to 0.1). Therefore, it can be concluded that this model works on small sets of data.
BERT, Spam Emails Identification.
1. Monkeylearn 2022 What is text classificationhttps://monkeylearn.com/what-is-text-classification/
2. Cai G et al. 2022 Privacy‐preserving CNN feature extraction and retrieval over medical images International Journal of Intelligent Systems
3. Yu Q et al. 2020 Improved denoising autoencoder for maritime image denoising and semantic segmentation of USV China Communications 17(3) 46-57
4. Pandey S et al. 2022 RNN‐EdgeQL: An auto‐scaling and placement approach for SFC International Journal of Network Management e2213
5. Zhang X et al. 2021 January Benchmarking LF-MMI, CTC And RNN-T Criteria For Streaming ASR. In 2021 IEEE Spoken Language Technology Workshop (SLT) (pp. 46-51) IEEE
6. Tachtarget 2020BERT language model https://www.techtarget.com/searchenterpriseai/definition/BERT-language-model
7. Kaggle 2018 SMS Spam Prediction https://www.kaggle.com/code/jepsds/sms-spam-prediction/data
8. Miro 2022 https://miro.medium.com/max/828/1*p4LFBwyHtCw_Qq9paDampA.png
9. TFhub 2022 bert_en_uncased_preprocesshttps://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3
10. DevlinJ et al. 2018 Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
11. Vaswani AShazeer N Parmar N Uszkoreit J Jones L Gomez A N ... & Polosukhin I 2017 Attention is all you need Advances in neural information processing systems 30.
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open Access Instruction).