In a previous blog about How to discover the main topics in your blogs through NLP, we explored the importance of document classification through tags or keywords. Furthermore, we addressed the question of the appropriate tags selection method and the strategic advantage from a company's marketing and SEO point of view, emphasizing that these tags should be chosen by the company's administrators and not by the author himself. We then demonstrated how to semi-automate of this process by using LDA topic modeling for tag selection and a subsequent manual tag curation.
Now, we will describe how to utilize OpenAI's GPT API, specifically GPT-4, to perform blog classification and assign relevant tags to Rootstrap’s blog collection. This collection currently has over 250 blogs encompassing a wide range of topics, including technical, social, and marketing subjects.
Blog classification with OpenAI API
Now it's classification time! The idea is to assign one or many of the curated tags to each one of the blog posts, and to create a tool to infer or suggest tags for incoming new documents. One option was to use the model trained with LDA to do the classification. The LDA model could provide information about the most probable topic and the most important keywords for each document, and it would be possible to use it to classify new documents. Unfortunately, the LDA model cannot perform the classification task automatically. This is because, with the curated tags, the LDA model's results would still require manual verification after each classification to adjust the resulting tags accordingly.
Then we came up with the idea of giving OpenAI language models a try. Although the ultimate objective of GPT is to predict the next word in a sentence, based on that simple premise, it can achieve incredible results in a wide variety of tasks. It has demonstrated excellent abilities in coding (code analysis, generation and test writing), writing-related abilities (summarization, text correction, text style change), and has shown slightly less accuracy on information retrieval.
Therefore, we decided to try using GPT-4 to classify a blog into the different tags that we have already preselected. To do this, we utilized the Chat Completion API of OpenAI and chose to use the GPT-4 model. Before feeding the blog's body to the model, we cleaned the text to remove some noise like HTML tags.
Using the OpenAI API
Utilizing the OpenAI API in Python is a straightforward process that begins with the installation of the OpenAI library through pip:
CODE: https://gist.github.com/santit96/63223588a5bb684a8539b53d987f57cf.js?file=install_openai.sh
Once installed, you can easily make use of the OpenAI API by calling the chat completion method, providing the necessary parameters: The instructions or system prompt (which can set the model’s behavior and personality), the user or content prompt (the input or requests for the model to respond to), the desired model (in this case, GPT-4), and any additional parameters as required. Here is an example of how the code may look like:
CODE: https://gist.github.com/santit96/63223588a5bb684a8539b53d987f57cf.js?file=openai_api.py
It's important to note that to use the OpenAI API, you must have an API key and credits in your OpenAI account. Also, it is advisable to handle all possible exceptions that the API may throw. A list of these exceptions and detailed information on how to use this API can be found in the official OpenAI documentation.
Prompt building
The most challenging aspect of our project was constructing the system prompt. The objective was to instruct the model to generate tags corresponding to each blog, from a predetermined list of tags. At the beginning the Language Model (LLM) produced accurate tags for the blogs, but some of those tags were not included in the provided list, the model simply invented them!. To address this issue, we emphasized the importance of selecting tags exclusively from the provided list.
After several iterations of refining the prompt, we finally succeeded in guiding GPT-4 to generate the desired output. Our solution involved instructing the model to return all the tags from the list, along with the probability that each tag belongs to the document. To achieve this, we provided the model with an example of the desired output and instructed it to incorporate its own probabilities while copying the provided output format. Below is the final prompt we used:
CODE: https://gist.github.com/santit96/63223588a5bb684a8539b53d987f57cf.js?file=prompt.md
Results
GPT-4 could correctly output, for each blog, the list of all the tags with the probabilities, without missing tags and without adding extra ones.
For example, for an AI related blog this is the (cropped) output of the OpenAI API:
CODE: https://gist.github.com/santit96/63223588a5bb684a8539b53d987f57cf.js?file=api_response.md
After obtaining the output for all the blogs, we applied a threshold on the probability to filter the tags. This allowed us to assign only the tags with a high probability to each document.
However, in some cases, the probabilities assigned were low, even for the tags with the highest probability, despite instructing the model to include at least one tag with high probability. We had to set a special threshold to handle these particular cases.
Summary
In this blog, we demonstrated how to perform blog classification into tags using GPT-4. The process involved topic modeling with LDA to identify the main topics and keywords within a collection of blogs. We manually curated the tags based on the topics obtained and prepared the data for blog classification. We then utilized the OpenAI API, specifically GPT-4, to generate tags for each blog post. Building an effective prompt was a significant challenge, as we needed to ensure that the model only generated tags from the provided list. After refining the prompt and instructing the model to include all tags with their respective probabilities, we obtained the desired output.
In conclusion, it is clear that GPT-4 is a valuable tool for various text-related tasks, including classification. By constructing the right prompt, we were able to create an effective blog classification application with GPT-4 without the need for training a specific model. This demonstrates its versatility: how it can handle diverse machine learning tasks beyond its primary objective of language generation.