Summary
I tested out one of the LLMs (GPT-4o-mini) on the TweetQA dataset that I worked on before. With some prompt engineering, I can send multiple tweet-question pairs to the model and obtain the predictions from the model in one single request. With the first 120 predictions obtained, the base model achieved average scores of 74% on BLEU-1, METEOR, and ROUGE-L evaluation metrics. The next plan is to test out more samples when I can feed thousands of tweet-question pairs to the model smoothly.
Introduction
About two and a half years ago, I was doing a capstone project with three other students to complete my software engineering degree program at Penn State World Campus. A significant part of the project is to train/fine-tune a machine-learning model capable of answering questions based on the tweet(s) from the TweetQA dataset. You can find my backup copy of the project report here. Using a BERT base model with fine-tuning, we achieved an average score of 0.71 (71%) for the BLEU-1, METEOR, and ROUGE-L evaluation metrics. For comparison, the best model developed by others at that time (2022) can achieve an average score of 80%.
Now (2024) with the hype around the latest Large Language Models (LLM), I decided to try using one of the most popular LLM models (ChatGPT) to attempt the TweetQA dataset, to check how far the foundational model has improved. I decided to start with the cheapest gpt-4o-mini model, wrapped with the LangChain orchestration framework. To obtain the most accurate and repeatable results, I set the model’s temperature to 0. You can find the reference notebook here.
Prompt Engineering? Experiment
My first goal was to build a prompt template from ChatPromptTemplate so that the GPT model could understand that I was trying to get the answer based on a tweet and question pair. The next image shows the initial prompt setup and response obtained.
I was getting a pretty good response. Except that, I want a precise answer instead of a full sentence. I’m not sure what I am doing can be called prompt engineering, but I tried to add some additional context to the initial prompt to make the response from the model better.
The response looks pretty good now. But I have new problems.
I am currently still on the free tier of OpenAI API, where my rate limit for gpt-40-mini model is 3 requests per minute and 200 requests per day. With more than a thousand tweet-question data pairs that I would like to test out, there is no way I can finish the testing quickly.
However given that one API request can send a maximum of 60,000 tokens (based on the rate limit), instead of asking just one tweet and one question per request, I might be able to ask one hundred tweets and one hundred questions per request.
The challenge now is, how do I make the model (GPT) understand that I am asking 100 questions with 100 tweets, and expecting 100 answers separated by a delimiter (say “#”) for further processing.
I start with a simple trial as shown in the next image:
The initial response looked good when I asked three tweet-question pairs in a single request. However, when I try to combine more Tweet-Question pairs into a single request, I always get fewer or more answers from the GPT. It seems the GPT can't get how many pairs of tweet-question I had asked and return the exact number of answers required. With some experiments on the initial prompt, I ended up with the following:
This prompt template worked well when I sent up to 40/50 tweet-question pairs combined in one request. However when I try to send more tweet-question pairs, I would be getting fewer answers than expected, or the answers are numbered (like 1.xxx, 2.xxx) instead of separated by the delimiter “#”
Initial Results and Next Plan
With the 120 prediction results from the GPT, the initial BLEU, METEOR, and ROUGE-L scores have average scores of 0.74(74%). This certainly looks promising as with further optimizations and feedback the model may be able to break the highest score achieved before.
However, I suspect the scores could be lower than actual as the answers returned by GPT can be out-of-order when I ask too many tweet-question pairs at once. The next plan is to test out the full validation datasets (with > 1000 data points) when I am upgraded to the tier 1 usage plan by OpenAI.