mercari AI

Blog

Fine-Tuning an LLM to Extract Dynamically Specified Attributes

Hello, I am @andre, a machine learning engineer on the AI/LLM team at Mercari.

In a previous article, we discussed how our team utilized commercial LLM APIs to build an initial feature to support our customers and improve the platform’s selling experience.

This article will describe one of our past experiments in fine-tuning a 2-billion parameter large language model (LLM) using QLoRA, to extract dynamically specified attributes from user-generated content, and compared the performance with GPT-3.5 turbo—a much larger model. Results show that the fine-tuned model outperforms the bigger model in terms of extraction quality while being significantly smaller in size and less costly. We hope this article will provide valuable insights into what it takes to fine-tune an LLM effectively.

Background

In a Japanese customer-to-customer (C2C) marketplace, specific details could impact the quality of a listing description. However, understanding the precise details in a user-generated listing description can be tricky. This is due to several challenges, including:

  • Wide variety of user-generated content: Each seller describes their listings differently.
  • Category specificity: What’s essential varies from one category to another.
  • Time sensitivity: User-generated content continuously evolves.

By accurately extracting existing key attributes from listing descriptions, we can gain a deeper understanding of the contents written by our customers—specifically, in this case, the sellers. Figure 1 below illustrates an example of a listing description and the extracted values. For the purpose of this article, the illustration shows an example of a listing written in English; however, most listings within Mercari are written in Japanese. Such insight can also help us guide our customers to enhance their listings, making them more appealing and effective.

Illustration of the extracted attributes from a sample listing description
Figure 1. Illustration of the extracted attributes from a sample listing description

Why not just use light-weight, conventional, non-LLM models?

  • Dynamic and varied attributes: The way attributes are described can change frequently, leading to high maintenance requirements and the need for continuous model re-training. Having a model that could handle dynamically specified attributes could go a long way.
  • Generalization capability: Large language models (LLMs) have the potential to generalize far better than conventional ML models with much less training data, even for handling out-of-distribution data.
  • Multi-linguality: Most listings in Mercari are written in Japanese, however, with the huge variety of goods being exchanged, there are also listings written in other languages, such as English and Chinese. The multilingual capability of recent LLMs are expected to be able to handle such varieties better than conventional ML models.

On the other hand, why not just use existing commercial LLM APIs?

  • Cost of commercial APIs: Though commercial LLM APIs are becoming more affordable, at the time this article is written, the sheer number of requests in a production environment would still make them prohibitively expensive.
  • Control over hallucinations: It’s more difficult to manage and minimize hallucinations purely through prompt engineering with commercial APIs.

Given these considerations, we decided to experiment with fine-tuning our own model. For this experiment, we used a single A100 GPU with an 80 GB memory VM instance (a2-ultragpu-1g) from GCP to fine-tune a large language model using QLoRA. Our short-term goal was to see if we can build a model that could achieve similar or even better performance than GPT-3.5 Turbo despite being significantly smaller and cheaper to run in production.

Dataset and Base Model

To tackle our task, we first defined the input and output requirements for the model:

  • Input: A text description of the listing and a list of attribute keys to extract. For example:
    • Listing description: A Mercari T-shirt size M, blue. Used only once and kept in a clean wardrobe after.
    • Attribute keys: size, color, original retail price
  • Output: The extracted attributes and their values. For example:
    • Size: M
    • Color: Blue
    • Original retail price: NONE

To build our dataset, we gathered historical descriptions along with their attributes. Since attribute keys can vary across item categories, we started by focusing on the 20 categories with the highest listings on our platform.

We structured the data into inputs and outputs and integrated these pairs with specific prompts, which were then used to fine-tune the LLMs. We experimented with various prompts written in English and Japanese; however, the prompt generally contains the following.

  • An initial prompt sentence, telling the model that it will receive an instruction below and instructing it to respond accordingly.
  • The instruction, mentioning that it will be given a description text in the context of an online marketplace listing, and instructing the model to extract a list of attribute keys from the input text. It also tells the model to respond following a specific format.
  • The input text, containing the listing description text from which we want to extract attributes.
  • The output text, containing the response text with the attribute keys and the extracted values.

Below is an example of the prompt templates we experimented with, written in Japanese:

以下に、あるタスクを説明する指示があり、それに付随する入力が更なる文脈を提供しています。
リクエストを適切に完了するための回答を記述してください。

### 指示:
次の文章はオンラインマーケットプレイスに投稿されているリスティングの情報です。
その文章から{attr_names}の情報を探し出してください。
妥当な情報が存在したら「{attr_name}: <内容>」で応答してください。逆に存在しない場合はかならず「{attr_name}: NONE」で応答してください。

### 入力(文章):
{input}

### 応答:
{output}

Once the dataset was ready, our next step was identifying potential LLMs for fine-tuning. The Nejumi Leaderboard for Japanese LMs, curated by the Weights and Biases Japan team, was one of our primary resources. It comprehensively evaluates various large language models’ capabilities in handling Japanese text. After testing and experimenting with several models, we decided to move forward with the gemma-2b-it model provided by the team at Google (paper, HF).

Parameter efficient fine-tuning with QLoRA

To embark on our fine-tuning journey, we used QLoRA—a cutting-edge approach known for its efficient fine-tuning. As cited from the original paper, QLoRA significantly reduces memory usage, allowing one to fine-tune a 65B parameter model on a single 48GB GPU while preserving the full 16-bit fine-tuning task performance. The image below illustrates how QLoRA compares to full fine-tuning and LoRA methods.

Illustration of how fine-tuning with QLoRA works under the hood
Figure 2. Illustration of how fine-tuning with QLoRA works under the hood (adapted from the original figure on QLoRA: Efficient Finetuning of Quantized LLMs)

Now, let’s dive into the fine-tuning process!

Initially, we load the pre-processed dataset previously stored as W&B artifacts into memory.

...
with wandb.init(entity=ENTITY_NAME, project=PROJECT_NAME, job_type=JOB_TYPE_NAME, tags=["hf_sft"]):
    artifact = wandb.use_artifact(ENTITY_NAME+'/'+PROJECT_NAME+'/train_test_split:latest', type='dataset')
    artifact_dir = artifact.download()

loaded_dataset = load_dataset("json", data_dir=artifact_dir)
train_data = loaded_dataset["train"]
eval_data  = loaded_dataset["test"]
...

Then, we define the LoRA configurations (hyperparameters) and target modules. One example of the modules and configurations that we experimented with is as follows:

...
target_modules = ['q_proj','k_proj','v_proj','o_proj','gate_proj','down_proj','up_proj','lm_head']

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0.05,
    bias="none",
    target_modules = target_modules,
    task_type="CAUSAL_LM",
)
...

And then, the fine-tuning hyperparameters and quantization configurations. Following is an example of the configurations that we experimented with:

...
training_args = TrainingArguments(
    output_dir=base_dir,
    report_to="wandb",
    save_strategy="epoch",
    evaluation_strategy="epoch",
    num_train_epochs = 1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    optim='adamw_torch',
    learning_rate=2e-4,
    fp16=True,
    max_grad_norm=0.3,
    warmup_ratio=0.1,
    group_by_length=True,
    lr_scheduler_type="linear",
)

nf4_config = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_use_double_quant=True,
  bnb_4bit_compute_dtype=torch.bfloat16
)
...

Once the above are set up, we then load the base model and tokenizer from HuggingFace:

...
model_path = "google/gemma-2b-it"
tokenizer = AutoTokenizer.from_pretrained(model_path, add_eos_token=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map='auto', quantization_config=nf4_config,
)
...

We then use the SFTTrainer from HuggingFace to begin fine-tuning:

...
trainer = SFTTrainer(
    model,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    packing=True,
    max_seq_length=1024,
    args=training_args,
    formatting_func=create_prompt,
)
# Upcast layer norms to float 32 for stability
for name, module in trainer.model.named_modules():
  if "norm" in name:
    module = module.to(torch.float32)

run = wandb.init(entity=ENTITY_NAME, project=PROJECT_NAME, job_type="start_finetuning", config=config)
st = time.time()
trainer.train()
elapsed = time.time() - st
run.log({"elapsed_time (seconds)": elapsed})
run.finish()
...

Finally, we merge and save the fine-tuned model:

...
new_model = NEW_MODEL_PATH_AND_NAME
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

base_model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
)
merged_model= PeftModel.from_pretrained(base_model, new_model)
merged_model= merged_model.merge_and_unload()

merged_model.save_pretrained(new_model+"-merged",safe_serialization=True)
tokenizer.save_pretrained(new_model+"-merged")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
...

Post-training Quantization and Model Evaluation

Post-training quantization aims to see if we can further shrink the model size while maintaining satisfactory performance. We used the llama.cpp library—an open-source tool that enables post-training model quantization and faster inference using LLMs in C/C++.

Here’s an overview of the steps we followed using llama.cpp for model conversion and quantization. Note that some steps might be outdated by the time of publication, so we recommend referring to the llama.cpp repository for the latest information:

  1. Clone the Repository: Clone the llama.cpp GitHub repository and run the build commands using the appropriate settings. Detailed instructions can be found here.
    • Note: Since support for Gemma models was added around the end of February 2024, ensure you use the correct version of llama.cpp.
  2. Convert the Model: Convert the fine-tuned model, previously stored in the HuggingFace format, to a format compatible with llama.cpp.
  3. Select Quantization Method: Choose the quantization method and start the quantization process. The 4-bit precision method (q4_k_m) worked well for our use case.
  4. Convert and Quantize: the resulting model is stored in the GGUF format.

After the post-training quantization finishes, we evaluated the model in GGUF format and compared its performance. As of our experiment, GPT-4o (including the mini model) was not available. Therefore, considering its cost and latency advantages, we chose GPT-3.5 turbo (specifically, gpt-3.5-turbo-0125) as our baseline model for performance comparison.

Some key metrics for the evaluation:

  • BLEU Score: This score provided insights into the quality of extracted attribute values compared to the actual values.
  • Model Size and Latency: We also checked the resulting model size and latency to assess cost-efficiency and readiness for production use.

Here are some key findings from our quick experiment:

  • The final 4-bit precision GGUF model (q4_k_m) is a QLoRA fine-tuned version of the gemma-2b-it model.
  • The model is approximately 95% smaller than the gemma-2b-it base model downloaded from HuggingFace.
  • The model achieved a BLEU score slightly more than five percentage points higher than gpt-3.5-turbo-0125.
  • Additionally, an initial rough estimate at the time of the experiment showed that using the fine-tuned model could reduce the cost by more than 14 times compared to using gpt-3.5-turbo-0125. However, given the rapidly changing pricing structures of commercial models, this figure should be taken with a grain of salt.

In summary, the final model is significantly smaller—approximately 95%—than the original base model from HuggingFace and achieves a BLEU score higher than gpt-3.5-turbo-0125.

Conclusion

This experiment demonstrates the practicality of fine-tuning our LLM for attribute value extraction from user-generated content as an effective alternative to commercial LLM APIs. By utilizing QLoRA, we managed to fine-tune the gemma-2b-it model efficiently, reducing its size by around 95% compared to the original base model. Despite this significant size reduction, our fine-tuned model still outperformed gpt-3.5-turbo-0125 by achieving a higher BLEU score, thus validating the efficacy of our approach in both performance and resource optimization.

Besides the improvements in performance and cost savings, our hands-on approach provided better control over the model’s behavior, helping to mitigate issues like hallucinations more effectively than prompt engineering alone. We hope this article offers valuable insights and practical guidance for those looking to fine-tune their models and transition away from expensive and less controllable commercial APIs. By leveraging advancements in large language models and innovative techniques like QLoRA, there are significant opportunities for future development and optimization.