Portfolio

Code Review

이번 코드 리뷰는 데이터 셋에서 문제가 있었던 신아현 씨의 코드를 가져왔습니다.

지난번 코드의 문제점

instruction dataset을 sample로 사용했는데 이것이 안됐다는 말이 있었음.

이는 dataset의 구조적인 차이 + model 선정에서의 문제가 있을 수 있고, text 모델의 경우 기본적으로 많은 vram을 요구하므로 colab에서 실행하는 경우 이를 고려해야한다.

수정한 code


import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments, Trainer, AutoModelForSequenceClassification
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from peft import get_peft_model, LoraConfig, TaskType

# 각 라이브러리의 버전을 출력합니다.
import transformers
import accelerate
import evaluate
import peft


print(f"Transformers version: {transformers.__version__}")
print(f"Accelerate version: {accelerate.__version__}")
print(f"PEFT version: {peft.__version__}")

Transformers version: 4.42.4 Accelerate version: 0.32.1 PEFT version: 0.12.0


dataset = load_dataset("MBZUAI/LaMini-instruction", split="train[:5000]")

#############################################################################
# split을 통해서 데이터셋의 일부만을 가져올 경우 아래의 코드를 사용한다
# 데이터셋의 features 출력
print("Features:")
for feature, dtype in dataset.features.items():
    print(f" - {feature}: {dtype}")

# 첫 번째 샘플 출력
print("First sample:")
print(dataset[0])

# 각 열의 유니크 값의 개수 출력
for feature in dataset.features.keys():
    unique_values = dataset.unique(feature)
    print(f" - {feature} has {len(unique_values)} unique values.")
#############################################################################


# #############################################################################
# # 그냥 일반적인 load_dataset을 사용할 경우 이 코드를 사용한다.
# # 데이터셋의 각 split을 검사합니다.
# for split in dataset.keys():
#     print(f"Split: {split}")
#     # 데이터셋의 features 출력
#     print("Features:")
#     for feature, dtype in dataset[split].features.items():
#         print(f" - {feature}: {dtype}")

#     # 첫 번째 샘플 출력
#     print("First sample:")
#     print(dataset[split][0])

#     # 각 열의 유니크 값의 개수 출력
#     for feature in dataset[split].features.keys():
#         unique_values = dataset[split].unique(feature)
#         print(f" - {feature} has {len(unique_values)} unique values.")
#     print("\n")
# #############################################################################

Features:

instruction: Value(dtype='string', id=None)

response: Value(dtype='string', id=None)

instruction_source: Value(dtype='string', id=None) First sample: {'instruction': 'List 5 reasons why someone should learn to code', 'response': '1. High demand for coding skills in the job market\n2. Increased problem-solving and analytical skills\n3. Ability to develop new products and technologies\n4. Potentially higher earning potential\n5. Opportunity to work remotely and/or freelance', 'instruction_source': 'alpaca'}

instruction has 5000 unique values.

response has 4961 unique values.

instruction_source has 1 unique values.


# 데이터셋 로드 및 전처리 (split 사용)
dataset = load_dataset("MBZUAI/LaMini-instruction", split="train[:5000]")
dataset = dataset.map(lambda x: {'input': x['instruction'], 'output': x['response']})
dataset = dataset.remove_columns([col for col in dataset.column_names if col not in ['input', 'output']])
train_data, eval_data = train_test_split(dataset, test_size=0.1)

# 딕셔너리에서 Dataset 객체로 변환
train_dataset = Dataset.from_dict(train_data)
eval_dataset = Dataset.from_dict(eval_data)

# 아래 코드는 split 사용하지 않는 경우의 예시입니다.
# # 데이터셋 로드 및 전처리 (split 사용하지 않음)
# dataset = dataset['train'].map(lambda x: {'input': x['instruction'], 'output': x['response']})
# dataset = dataset.remove_columns([col for col in dataset.column_names if col not in ['input', 'output']])
# train_dataset, eval_dataset = train_test_split(dataset, test_size=0.1)
#
# # 딕셔너리에서 Dataset 객체로 변환
# train_dataset = Dataset.from_dict(train_data)
# eval_dataset = Dataset.from_dict(eval_data)


# 레이블을 고유한 정수로 매핑
unique_labels = list(set(train_dataset['output']) | set(eval_dataset['output']))
label_to_id = {label: idx for idx, label in enumerate(unique_labels)}

# 토크나이저 로드
# 중요한 것! VIT의 경우 이미지 분류 모델이기 때문에 텍스트를 분류하지 못하는 친구다
model_checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

def preprocess_function(examples):
    inputs = examples['input']
    targets = examples['output']
    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")
    model_inputs['labels'] = [label_to_id[label] for label in targets]  # 레이블을 정수로 변환
    return model_inputs

train_dataset = train_dataset.map(preprocess_function, batched=True)
eval_dataset = eval_dataset.map(preprocess_function, batched=True)

# 모델 로드 및 LoRA 설정
num_labels = len(unique_labels)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)
config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,
    lora_alpha=16,
    target_modules=["classifier"],
    lora_dropout=0.1,
    bias="none"
)
lora_model = get_peft_model(model, config)

# 훈련 인자 설정
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-4,
    ######################################
    # 램 초과를 막기 위한 배치 사이즈 감소 
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    per_device_eval_batch_size=4,
    ######################################
    num_train_epochs=3,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    fp16=True
)

import evaluate
import numpy as np
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

torch.cuda.empty_cache()

# Trainer 정의 및 학습
trainer = Trainer(
    model=lora_model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)
trainer.train()


Map: 100%

4500/4500 [00:01<00:00, 2999.98 examples/s]

Map: 100%

500/500 [00:00<00:00, 2558.59 examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/usr/local/lib/python3.10/dist-packages/transformers/training_args.py:1494: FutureWarning: `evaluation_strategy` is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use `eval_strategy` instead
  warnings.warn(
  
[1686/1686 01:21, Epoch 2/3]

| Epoch | Training Loss | Validation Loss | Accuracy |
| --- | --- | --- | --- |
| 0 | 8.619600 | 8.621735 | 0.000000 |
| 2 | 8.446500 | 8.899992 | 0.002000 |

`TrainOutput(global_step=1686, training_loss=8.507365922486654, metrics={'train_runtime': 82.1928, 'train_samples_per_second': 164.248, 'train_steps_per_second': 20.513, 'total_flos': 532769187373056.0, 'train_loss': 8.507365922486654, 'epoch': 2.997333333333333})`


# 모델과 토크나이저를 로컬에 저장
model.save_pretrained("./trained_model")
tokenizer.save_pretrained("./trained_model")

print("모델과 토크나이저가 로컬에 저장되었습니다.")

모델과 토크나이저가 로컬에 저장되었습니다.


from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import os

# 로그 테스트
print("로그 테스트 - 이 메시지가 출력되면 로그 설정이 정상적으로 작동하는 것입니다.")

# 모델과 토크나이저 로드
try:
    print("모델 로드 중...")
    model_path = "./trained_model"
    if not os.path.exists(model_path):
        print("모델 경로가 존재하지 않습니다:", model_path)
        raise FileNotFoundError(f"모델 경로가 존재하지 않습니다: {model_path}")

    model = AutoModelForSequenceClassification.from_pretrained(model_path)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    print("모델 및 토크나이저 로드 완료")
except Exception as e:
    print("모델 로드 실패:", e)
    raise e

# 텍스트 입력 예시
input_text = "This is a test sentence."

# 텍스트 토큰화 및 전처리
try:
    print("텍스트 토큰화 중...")
    encoding = tokenizer(input_text, return_tensors="pt")
    print("텍스트 토큰화 완료")
except Exception as e:
    print("텍스트 토큰화 실패:", e)
    raise e

# 모델을 동일한 디바이스에 이동
try:
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print(f"디바이스: {device}")
    model.to(device)
    encoding = {k: v.to(device) for k, v in encoding.items()}
    print("디바이스로 이동된 encoding:", encoding)
except Exception as e:
    print("디바이스 이동 실패:", e)
    raise e

# 모델 예측 수행
try:
    print("모델 예측 수행 중...")
    model.eval()
    with torch.no_grad():
        outputs = model(**encoding)
    logits = outputs.logits
    print("예측 완료")
except Exception as e:
    print("모델 예측 실패:", e)
    raise e

# 예측된 클래스 출력
try:
    predicted_class_idx = logits.argmax(-1).item()
    predicted_class = model.config.id2label[predicted_class_idx]
    print("Predicted class:", predicted_class)
except Exception as e:
    print("예측된 클래스 출력 실패:", e)
    raise e


로그 테스트 - 이 메시지가 출력되면 로그 설정이 정상적으로 작동하는 것입니다.
모델 로드 중...

Some weights of the model checkpoint at ./trained_model were not used when initializing DistilBertForSequenceClassification: ['classifier.modules_to_save.default.base_layer.bias', 'classifier.modules_to_save.default.base_layer.weight', 'classifier.modules_to_save.default.lora_A.default.weight', 'classifier.modules_to_save.default.lora_B.default.weight', 'classifier.original_module.base_layer.bias', 'classifier.original_module.base_layer.weight', 'classifier.original_module.lora_A.default.weight', 'classifier.original_module.lora_B.default.weight', 'pre_classifier.modules_to_save.default.bias', 'pre_classifier.modules_to_save.default.weight', 'pre_classifier.original_module.bias', 'pre_classifier.original_module.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at ./trained_model and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

모델 및 토크나이저 로드 완료
텍스트 토큰화 중...
텍스트 토큰화 완료
디바이스: cuda
디바이스로 이동된 encoding: {'input_ids': tensor([[ 101, 2023, 2003, 1037, 3231, 6251, 1012,  102]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}
모델 예측 수행 중...
예측 완료
Predicted class: LABEL_615

TextModel_LoRa.ipynb

45.8KB

LABEL_615 가 이런 답을 가지고 있다는데, 이게 무슨 뜻인지는 제대로 알 지 못했다.

2024/8/4 - Code Review

Code Review

수정한 code