Business text classification with imbalanced data and moderately large label spaces for digital transformation

,

decision-making (Kim et al. 2006;Ur-Rahman and Harding 2012).Traditionally, text classification has been reliant on the company's taxonomy, which organizes business concepts of interest in a hierarchical manner (Arslan andCruz 2022, 2023a).In this taxonomic-based approach, the aim is to identify relevant business concepts from the taxonomy within the business articles.The relevance of an article is then determined based on the frequency of occurrences of taxonomy concepts within it.However, manual text classification methods pose challenges, particularly when dealing with large volumes of business information, often leading to errors and inefficiencies.
In the contemporary landscape of data abundance, digital transformation (Arslan and Cruz 2023b) has become imperative for businesses striving not just to survive but to thrive in an intensely competitive environment.Digital transformation, which involves strategically integrating technology to streamline operations, improve decision-making, and unlock value on a large scale (Read et al. 2011), has emerged as the driving force behind reshaping how organizations operate, communicate, and innovate.A crucial aspect of digital transformation lies in automating tasks that were previously performed manually.This shift holds particular significance for enterprises grappling with vast amounts of textual data (Trincado-Munoz et al. 2023;He and Sun 2023;Kiener et al. 2023).Transitioning from manual text classification to automated machine-based methods marks a substantial stride towards harnessing data for actionable insights.This shift not only enhances the efficiency of business text classification but also minimizes the inherent risks of errors often associated with manual text classification.
To effectively drive the digital transformation process, transitioning from manual to machine-based text classification requires robust methods.Existing text classification methods involve selecting relevant features from the data to assign target categories or labels, typically classified into two forms: single-label and multi-label classification.This paper focuses solely on multi-label classification, a task where one text document can be assigned one or more labels (Read et al. 2011).After conducting a literature review, we identified BERT (González-Carvajal et al. 2020) and Problem Transformation approaches (Liu et al. 2017) as widely used models for multi-label text classification.These machine learning models are adept at handling extensive amounts of text, uncovering significant features embedded within the text and classifying it into various categories.Given these observations, we believed that these methods could be beneficial for our case study.However, their performance in scenarios where the dataset is imbalanced and features moderately large label spaces remains unexplored.
To bridge this gap, our study evaluates the effectiveness of BERT (González-Carvajal et al. 2020) and Problem Transformation approaches, including Binary Relevance, Classifier Chains, and Label Powerset, in classifying business texts using an imbalanced dataset containing a moderately large label spaces (a total of 80 distinct labels in our case).Our evaluation methodology involves several critical stages.Initially, we prepare the data, striving to reduce the issue of class imbalance inherent in the dataset.Subsequently, we proceed to model training, wherein each method undergoes training on the preprocessed dataset, with BERT fine-tuned for optimal performance.Finally, we conduct model evaluation, assessing the performance of each model using metrics such as Accuracy, Precision, Recall, and F1-score.The paper's structure is outlined as follows: "Background" Section offers a review of the multi-label text classification approaches used in this article."Analyzing text classification models for business-related text" Section introduces the proposed work.In "Results" Section, we delve into the results."Discussion" Section presents the discussion, and lastly, "Conclusion" Section concludes the paper.

Background
Problem Transformation approaches constitute a versatile family of techniques employed in multi-label classification tasks.Their primary objective is to systematically convert the inherent complexity of the original multi-label problem into one or more simpler, more tractable classification tasks (Spolaôr et al. 2013).These approaches prove invaluable when dealing with multi-label problems that encompass an extensive array of potential labels.In such scenarios, the sheer number of possible labels can render the multi-label classification problem computationally expensive and particularly challenging.
Among the Problem Transformation techniques, several have gained prominence due to their effectiveness and adaptability.Three of the most widely employed Problem Transformation approaches include Binary Relevance, Classifier Chains, and Label Powerset (Luaces et al. 2012).Each of these techniques offers distinct advantages in handling multi-label classification challenges, and their selection often depends on the specific characteristics of the dataset and the nature of the problem at hand.
In Binary Relevance method (Read et al. 2021), a separate binary classifier is trained for each label, and each classifier predicts whether the input belongs to that particular label.The main advantage of the Binary Relevance method is its simplicity and flexibility.It can work with any binary classifier, and the classifiers can be trained independently, making it easy to add or remove labels without affecting the performance of other classifiers.However, the method does not consider any correlations between the labels, which may affect the overall accuracy of the multi-label classification task.
The Classifier Chain method (Read et al. 2021) uses a chain of binary classifiers to predict the labels.In this method, the labels are treated as a sequence, and the classifiers are trained in the order of the label sequence.The main advantage of the Classifier Chain method is its ability to model the correlations between labels, which can lead to improved accuracy in the multi-label classification task.However, the method can be computationally expensive, especially if there are many labels in the dataset.
The Label Powerset method (Read et al. 2014) involves transforming the multi-label problem into a multiclass problem.In this method, each unique combination of labels is treated as a separate class, and a multiclass classifier is trained to predict the class for each input.The main advantage of the Label Powerset method is its ability to handle any number of labels, and it can capture complex dependencies between labels.However, the method suffers from the curse of dimensionality, as the number of classes grows exponentially with the number of labels in the dataset.
In addition to Problem Transformation approaches, fine-tuning a pre-existing BERT model has gained significant traction as a popular and effective strategy in the existing literature (Lee et al. 2020).The process of fine-tuning a pre-trained BERT model for multi-label text classification involves training the model on a specific dataset, providing both labels and corresponding text inputs.During this training phase, the weights of the BERT model are iteratively adjusted to optimize its performance on the designated multi-label text classification task.Numerous studies have delved into sophisticated techniques for multi-label classification (Bogatinovski et al. 2022;Haghighian Roudsari et al. 2022;Huang et al. 2023;Zeng et al. 2024;Lefebvre et al. 2024).Nevertheless, their adaptability to imbalanced business-related datasets with moderately large label spaces remains unexplored in the literature.To bridge this gap, this article endeavors to present a comprehensive comparative analysis of four distinct techniques using a businessrelated dataset.This study seeks to furnish valuable insights into the efficacy of these methods for multi-label classification tasks within a business context, with the aim of informing and inspiring future research in this vital domain.

Analyzing text classification models for business-related text
In this section, we will present an in-depth overview of the dataset selected for our multi-label classification tasks.We will explore the dataset's structure, composition, and the preprocessing steps we conducted to ensure its suitability for our analysis.Subsequently, this dataset will serve as the foundation for training, fine-tuning, and the rigorous evaluation of multiple multi-label classification models.Through this exploration, we aim to provide a comprehensive understanding of the dataset's role in our study and offer transparency regarding our methodology for business text classification.

Dataset
As a critical initial step in our analysis of various multi-label classification models, we used the business dataset sourced from the French company, FirstECO.This dataset underscores our dedication to utilizing real-world data to ensure the integrity of our study.Constructed by extracting business news from various online sources spanning the period from 2017 to 2022, it comprises 28,941 texts, each potentially corresponding to one or more of 80 distinct labels.These labels encompass diverse aspects of the business domain, including Intangible Development, Activities, Products, Material Investment, Increased Standby, Financial Development, Company Life, Geographical Development, and Public Finances, among others.To maintain confidentiality, we cannot disclose the entire list of labels, and therefore, we present only seven labels from the dataset.Table 1 showcases text examples from the business dataset.Each text pertaining to business is tagged with two or more labels.
Each label in the business dataset is represented by varying numbers of text examples, ranging from 25 to over 4000 (as illustrated in Fig. 1).This variance highlights the dataset's imbalance, where some labels contain significantly more text examples than others.To address this issue, we have taken measures to ensure balance by setting a minimum threshold of 50-100 text examples per label.This confirms that each label has a sufficient representation in the dataset, reducing the effects of imbalance distribution of labels.Text preprocessing is an essential stage in the data pipeline, serving the critical purpose of cleansing and formatting text data to make it suitable for text classification.This multifaceted process takes raw text data as input and applies a series of essential preprocessing steps to each text entry.To begin, it carefully eliminates punctuation and numeric characters from the text, leveraging the power of regular expressions.This The hospital in Pau is considering privatizing the management of its parking lot, with the tender expected to be launched by the end of 2017.The Pau hospital ( 64) is proposing a project to privatize its 1,450-space parking lot by September 2019.The parking would become paid, allowing for better management of available spaces.The tender is expected to be launched before the end of 2017 and would cover a 10-year management lease "Public equipment", "Contract" 5 The Ardèche-based company Ekibio acquires the Rhône-Alpes brand Pléniday and intends to expand internationally.Ekibio-subsidiary of the holding company Compagnie Biodiversité: Périgny, 17) specializes in the production of organic and fair-trade food products sold in specialized stores.In late February 2020, it acquired the dietary food brand Pléniday, which includes 31 dietary food references distributed across three ranges (low-sodium, hypoallergenic, and low-carbohydrate) and sold in specialized stores.This operation will enable Ekibio to expand internationally and establish subsidiaries in the medium term "Sale", "Acquisition", "Establishment of units abroad" initial cleaning step ensures that extraneous symbols and digits do not introduce noise into the subsequent analysis.Following the initial cleanup, the text is subjected to a harmonizing transformation: it is converted to lowercase, ensuring uniformity in the text's case, which is vital for text analysis.Concurrently, the text is divided into distinct tokens using a tokenizer, segmenting it into manageable units for further processing.Once tokenized, the text undergoes another refinement process by having stop words excised from its content.These stop words, drawn from a predetermined set, are words like "the" "and" and "in" which are commonly occurring and often carry little meaningful information.Their removal streamlines the text and enhances the classification accuracy in subsequent analysis.Finally, the preprocessed words, now refined and devoid of unnecessary clutter, are thoughtfully reintegrated into a coherent string.This newly processed text is then systematically appended to a fresh list, which serves as a repository of the clean and formatted text data.

Implementation
The implementation centers around the execution of Problem Transformation techniques and the fine-tuning of the BERT model.This endeavor is not merely an exercise in technical prowess, but a strategic demonstration of which model can serve as the most practical choice for revolutionizing the landscape of business text classification.It caters specifically to companies poised to embark on digital transformation strategies, offering them a glimpse into the cutting-edge tools that can redefine how they interpret and utilize textual data in their evolving business ecosystems.

Problem transformation approaches
The process starts by importing necessary modules for data preparation, model training, and evaluation.The imported modules include GaussianNB and MultinomialNB for Naive Bayes classification, and Accuracy_score for evaluation metrics, train_test_split for splitting the data into training and testing sets, and TfidfVectorizer for transforming the text data into feature vectors.The scikit-multilearn library (Szymanski and Kajdanowicz 2019) is also imported to support multi-label classification problems.The process then creates an instance of TfidfVectorizer to convert the text data into feature vectors.The TfidfVectorizer is set to use inverse document frequency and normalization.Next, the process creates an instance of MultiLabelBinarizer (Pedregosa et al. 2011) and applies it to the labels.The MultiLabelBinarizer transforms the list of labels into a binary matrix where each row corresponds to an instance and each column corresponds to a unique label.Then, the data is split into training and testing sets using train_test_split, with a test size of 20%.Finally, the process returns X_train, X_test, Y_train, and Y_test, which are the feature matrices and label matrices for the training and testing sets, respectively.These matrices are used to train and evaluate multilabel classification models based on Problem Transformation approaches, which are; Binary Relevance, Classifier Chain, and Label Powerset in our case.Finally, the performance of these approaches is evaluated on the testing dataset using various metrics such as Accuracy, Precision, Recall, and F1-score (see Table 2).

Fine-tuning BERT
To fine-tune the existing BERT-based model for text classification, the model "bert-basemultilingual-cased" (Devlin et al. 2018) is chosen as it supports multiple languages.The process of fine-tuning starts with importing the necessary libraries such as NumPy (Oliphant 2006), Pandas (Reback et al. 2020;McKinney 2012), Scikit-learn (Kramer and Kramer 2016), PyTorch (Imambi et al. 2021), andTransformers (Wolf et al. 2020).Then, a number of hyperparameters are set, including Max_Len, which is set to 80 and represents the maximum length of input sequences.Train_Batch_Size is set to 16, and Valid_Batch_Size is set to 8. The process also specifies the number of Epochs to train the model, which is set to 5, and sets the learning rate to 1e−05.Additionally, a pre-trained BERT tokenizer using the BertTokenizer class is used from the transformer's library.
Furthermore, the data is split into training and testing datasets using a Train_Size of 0.8.The training dataset is created by randomly sampling 80% of the data from the original dataset, while the testing dataset is created by dropping the samples in the training dataset from the original dataset.The final training dataset has 23,153 samples, while the testing dataset has 5788 samples.The BERT model is trained on the training dataset by feeding batches of input sequences to the model, computing the loss, and optimizing the weights using backpropagation.Finally, the model's performance is evaluated on the testing dataset using Accuracy, Precision, Recall, and F1 score (see Table 2).
Accuracy, F1-score, Precision, and Recall are commonly used performance metrics that can be used to evaluate the effectiveness of a classifier.Accuracy measures the fraction of instances that are correctly classified by the classifier.Precision measures the fraction of correctly identified positive instances among all instances predicted as positive, while Recall measures the fraction of correctly identified positive instances among all positive instances in the data.Using these definitions, we can compute Accuracy, Precision and Recall as follows: where True positives (TP) are instances that are positive and are correctly classified as positive by the classifier.False positives (FP) are instances that are negative but are incorrectly classified as positive by the classifier.True negatives (TN) are instances that are negative and are correctly classified as negative by the classifier and False negatives (FN) are instances that are positive but are incorrectly classified as negative by the classifier.However, Accuracy may not be a suitable metric to use when the classes are imbalanced.This is because a classifier that simply predicts the majority class for all instances would achieve high accuracy even if it performs poorly on the minority class.To address this problem, we can use the F1-score, which is a harmonic mean of Precision and Recall.It combines both Precision and Recall into a single metric that balances the trade-off between them.The F1-score is defined as: The F1-Score ranges between 0 and 1, with a value of 1 indicating perfect Precision and Recall.Note that Precision measures the accuracy of the positive predictions made by the classifier, while Recall measures the completeness of the positive predictions made by the classifier.

Results
Table 2 displays the performance of four different methods for multi-label text classification on a dataset with 80 possible labels.The first method, Binary Relevance, achieves an accuracy of 0.730, F1-score of 0.936, Precision of 0.952, and Recall of 0.922.This method creates a separate binary classifier for each label and assigns a label to each text independently.The second method, Classifier Chains, achieves an accuracy of 0.103, F1-score of 0.539, Precision of 0.590, and Recall of 0.495.This method builds a chain of classifiers where each classifier considers the predictions of the previous classifiers in the chain.The third method, Label Powerset, achieves the Accuracy of 0.143, F1-score of 0.278, Precision of 0.350, and Recall of 0.230.This method transforms the multi-label classification problem into a multi-class classification problem by assigning each unique combination of labels to a single class.The fourth method, fine-tuned BERT, achieves the highest accuracy of 0.895, F1-score of 0.978, Precision of 0.948, and Recall of 0.988.
The lowest F1-score (i.e.0.278) of the Label Powerset method in the multi-label classification problem can be attributed to several factors (Read et al. 2014).Firstly, Label Powerset assumes label independence, disregarding potential correlations among labels present in real-world scenarios.This oversight makes it challenging to accurately predict label combinations, particularly with many labels.Moreover, the curse of dimensionality worsens the problem by exponentially expanding the feature space, resulting in overfitting and lower generalization capability.Finally, the computational complexity of training one model per label in the Label Powerset method can lead to inadequate training or overfitting, further impacting the F1-score.

Discussion
The study delves into the multi-label classification task, characterized by an imbalanced dataset and moderately large label spaces.While existing literature offers numerous studies on multi-label classification, this paper's contribution lies in its focus on text classification within a business-related dataset.The distinction between an ordinary dataset and a business dataset representing business opportunities lies in their specific focus, content, and purpose.Whereas ordinary datasets cover a wide range of information across various domains, business datasets are meticulously curated to capture data pertinent to potential business endeavors.However, classifying text within business datasets presents unique challenges.Unlike ordinary datasets where features or keywords are typically explicit and easily discernible, business texts may lack clearly defined keywords associated with specific concepts.For example, a text on hiring practices might not explicitly mention keywords like "recruitment".This inherent ambiguity poses a challenge for classification models in accurately tagging texts with associated taxonomy concepts.Consequently, text classification in the business domain necessitates more advanced methodologies to address implicit features and contextual intricacies, ensuring precise categorization and thorough analysis of business-related texts.
The business dataset supplied by the company served as the foundation for our experiment, where we compared the performance of four methods: Binary Relevance, Classifier Chains, Label Powerset, and fine-tuned BERT.We evaluated their effectiveness using metrics such as Accuracy, F1-Score, Precision, and Recall.Our analysis revealed that the fine-tuned BERT method outshone the other three, boasting high scores across all metrics.While Binary Relevance also demonstrated strong performance, Classifier Chains and Label Powerset lagged, particularly on the dataset.These results underscore the advantage of fine-tuning the pre-trained BERT model, as it allows for adaptation to specific applications.Despite BERT's pre-training on extensive text corpora, which grants it a deep understanding of language nuances, fine-tuning tailors the model to the task at hand by training it on a smaller, task-specific dataset.To replicate the results of a fine-tuned BERT model, one must adhere to the same pre-processing steps, architecture, and hyperparameters as the original experiment.Using identical evaluation metrics for comparison is also crucial.However, the choice of dataset for fine-tuning the BERT model heavily influences its performance.Different tasks necessitate distinct datasets, and the quality and size of the dataset greatly impact the model's generalization ability.Hence, selecting an appropriate dataset tailored to the specific task is vital for achieving optimal results.

Conclusion
Digital transformation necessitates reorganizing to maximize technology's potential and seamlessly integrating it across business operations.Through our case study, we showcased how leveraging machine learning models can automate text classification in business contexts.Despite encountering challenges related to imbalanced data and moderately large label spaces, our evaluation unveiled the superior performance of finetuned BERT compared to other methods.Additionally, the Binary Relevance classifier demonstrated good performance.This paper serves a beacon, illuminating the path towards enhanced text classification and understanding within the realm of businessoriented applications.In the age of digital transformation, where the efficient processing and comprehension of vast volumes of textual data are paramount, this paper provides a strategic solution.By embracing the principles of fine-tuning BERT models or employing traditional Binary Relevance classifiers, companies can harness the power of existing models to accurately classify their textual datasets.This precision empowers them to extract valuable insights from classified business data, automate decision-making processes, and retain a competitive edge in a swiftly evolving business environment.

Table 1
Sample extracts from the business dataset

Table 2
Comparative analysis of different multi-label classification approaches