Introduction
The Babel Sentiment Project uses a simple 3-label codebook with the classical negative, neutral, and positive labels, corresponding to the information available at poltextlab/xlm-roberta-large-pooled-MORES. In case of results in Slovakian or Czech language the corresponding information is available at visegradmedia-emotion/Emotion_RoBERTa_pooled_V4
The sentiment model is a consolidated version of the emotion model, where emotions like anger, fear, disgust, and sadness are categorized as negative, while joy is classified as positive. The emotion model, however, does not include a neutral label; instead, it assigns none of the emotions when more than one might be present. Consequently, the sentiment model returns either the label "no sentiment" or "both positive and negative", rather than a distinct neutral category. While the model can classify sentences as a whole, analyzing longer sentences can yield more accurate results by breaking them down into individual clauses.
The model was prepared to make predictions on sentence-level data, meaning you should provide input that is segmented into sentences in order to achieve optimal performance. If you are intending to submit data to the NER Babel Machine, we recommend submitting your data there first, as the processed file will be split into sentences, making it a suitable dataset for the Sentiment Babel Machine. The Sentiment Babel Machine currently uses a pooled model, that was trained on data in the following languages: Czech, English, French, German, Hungarian, Polish and Slovak, but we encourage you to also submit datasets not covered under this list, as results may be useful for additional languages due to the nature of large language models.
You can upload your datasets here for automated sentiment coding. If you wish to submit multiple datasets one after another, please wait 5-10 minutes between each of your submissions. There are two possibilities for upload: pre-coded datasets or non-coded datasets. The explanation of the form and the dataset requirement is available here.
The upload requires you to fill the following form on metadata regarding the dataset. Please upload your dataset, and in case of a pre-coded dataset, if available, please attach the codebook used beside the dataset.
The non-coded datasets should contain an id and a text column. The column names must be in row 1. You are free to add supplementary variables to the dataset beyond the compulsory ones in the columns following them.
Pre-coded datasets must contain the following columns: id, text, label. The column names must be in row 1. Uploading a pre-coded sample is optional, but it can help us with calculating performance metrics and fine-tuning the language model behind MANIFESTO Babel Machine. The detailed rules of validations are available here. The mandatory data format of label is numeric(integer), based on the following:
- 0: Negative
- 1: No sentiment or Neutral sentiment
- 2: Positive
After you upload your dataset and your file is successfully processed,
you will receive the sentiment-coded dataset and a file (in CSV format) that includes the predictions by the poltextlab/xlm-roberta-large-pooled-MORES model.
If the files you would like to upload are larger than 1 GB, we suggest that you split your dataset into multiple parts.
If you have any questions or feedback regarding the Babel Machine, please let us know using our contact form. Please keep in mind that we can only get back to you on Hungarian business days.
Submit a dataset:
The research was supported by the Ministry of Innovation and Technology NRDI Office within the RRF-2.3.1-21-2022-00004 Artificial Intelligence National Laboratory project and received additional funding from the European Union's Horizon 2020 program under grant agreement no 101008468. We also thank the Babel Machine project and HUN-REN Cloud (Héder et al. 2022; https://science-cloud.hu) for their support. We used the machine learning service of the Slices RI infrastructure (https://www.slices-ri.eu/)
HOW TO CITE: If you use the Babel Machine for your work or research, please cite this paper:
Sebők, M., Máté, Á., Ring, O., Kovács, V., & Lehoczki, R. (2024). Leveraging Open Large Language Models for Multilingual Policy Topic Classification: The Babel Machine Approach. Social Science Computer Review, 0(0). https://doi.org/10.1177/08944393241259434