employee
The article examines modern methods and approaches to solving the problem of classifying news texts, which is an urgent problem in the context of a large amount of information available to users. News classification plays a key role in optimizing the information retrieval process, contributes to the creation of personalized content and helps analyze social trends, which is especially important in the era of digitalization. In the course of the work, the main concepts and principles related to text processing and analysis are considered, including the stages of text preprocessing, dictionary compilation, tokenization, creation of batches from text sequences and text classification. Special attention is paid to various architectures of recurrent neural networks (RNNs), their features, advantages and disadvantages in the context of the text classification task. Recurrent neural networks are a powerful tool for processing sequential data, such as text, and allow for context-based classification. Experiments have been conducted with various models of recurrent neural networks, optimal parameters have been selected to ensure high classification accuracy of news texts, and the best model has been identified - GRU_model512_2layers_dropout_epoch10, consisting of two recurrent layers of the GRU architecture, containing 512 neurons each in a hidden layer, with a dropout of 20%, trained on 10 epochs. It takes up less memory space (by 10 MB) than a model with the LSTM architecture and the same parameters, since the GRU architecture has a simpler structure. In this regard, it is also faster to learn (17 s/epoch faster than the LSTM architecture model). It also shows higher accuracy (91.6%) than models with simpler architectures, which are prone to overfitting. For the software implementation of the news text classification algorithm, the Python programming language is used, as well as the open source PyTorch machine learning framework and the NLTK natural language processing library. The process of classifying a news text is performed in the following sequence: loading the text, processing it, classifying it, and outputting the category to which the text belongs. To train the models and verify the results, a dataset containing samples of four categories of news texts is used.
NEURAL NETWORKS, TEXT CLASSIFICATION, NATURAL LANGUAGE PROCESSING, TOKENIZATION, RECURRENT NEURAL NETWORKS, TEXT PREPROCESSING