Portfolio

데이터 셋 만들어보기

1. train.json, test.json, val.json 통해서 amusement.json … fear.json 파일로 전환하기

push_to_hub() 함수를 사용하면 아래와 같은 모양의 폴더 형태로 업로드 된다고 볼 수 있다.

이 때 허깅페이스에서 사용하는 일반적인 데이터셋(아래 사진)과 같이 image와 label이 짝 지어지려면

dataset_info.json과 state.json파일에 해당 json파일의 내용을 추가해야한다.

그러기 위해서 기본적으로 기존의 json파일들을 변경할 필요가 있다.

첫 번째 코드를 통해서 json파일들에서 감정 라벨을 제외한 다른 정보들을 제거하고,


import json

def convert_json(input_file, output_file):
    with open(input_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    # 변환된 데이터를 저장할 리스트
    converted_data = []
    
    for item in data:
        # 이미지 경로에서 "E:\\EmoSet-118\\" 부분을 제거
        image_path = item["image_file"].replace("E:/EmoSet-118K\\", "")
        
        # 새로운 구조로 변환
        new_item = {
            "image": image_path,  # 경로에서 불필요한 부분 삭제
            "label": item["label"],  # 라벨
            "image_id": item["image_id"] # 이미지 구별자
        }
        converted_data.append(new_item)
    
    # 새로운 JSON 파일로 저장
    with open(output_file, 'w', encoding='utf-8') as f:
        json.dump(converted_data, f, ensure_ascii=False, indent=4)

# JSON 파일 변환
convert_json('train.json', 'new_train.json')
convert_json('val.json', 'new_val.json')
convert_json('test.json', 'new_test.json')

두 번째 코드에서 각 감정들을 기준으로 분류한다


import json
import os

# JSON 파일들 경로
json_files = ['train.json', 'val.json', 'test.json']

# 라벨별로 데이터를 저장할 딕셔너리 생성
label_data = {
    "amusement": [],
    "anger": [],
    "awe": [],
    "contentment": [],
    "disgust": [],
    "excitement": [],
    "fear": [],
    "sadness": []
}

# 각 JSON 파일을 순회하며 데이터를 라벨별로 분류
for json_file in json_files:
    with open(json_file, 'r') as f:
        data = json.load(f)
        for item in data:
            label = item['label']
            if label in label_data:
                label_data[label].append(item)

# 라벨별로 나눈 데이터를 image_id의 숫자 5자리를 기준으로 정렬
for label, items in label_data.items():
    label_data[label] = sorted(items, key=lambda x: int(x['image_id'].split('_')[-1]))

# 라벨별로 나눈 데이터를 개별 JSON 파일로 저장
output_dir = './emotion_labeled'
os.makedirs(output_dir, exist_ok=True)

for label, items in label_data.items():
    output_file = os.path.join(output_dir, f'{label}.json')
    with open(output_file, 'w') as f:
        json.dump(items, f, indent=4)

print("JSON 파일들이 라벨별로 나누어지고 정렬되었습니다.")

2. push_to_hub()를 이용해서 parquet형식으로 변형하고 업로드하기

다음은 push_to_hub() 함수의 내용이다.

GitHubdiffusers/src/diffusers/utils/hub_utils.py at main · huggingface/diffusers

diffusers/src/diffusers/utils/hub_utils.py at main · huggingface/diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX. - huggingface/diffusers


push_to_hub() 함수는 Hugging Face 라이브러리에서 사용되며, 모델, 스케줄러, 또는 파이프라인 파일을 Hugging Face Hub의 리포지토리로 푸시하는 데 사용됩니다. 이 함수는 커뮤니티와 모델을 공유하거나, 모델의 버전을 관리하는 데 유용합니다. 아래는 제공된 이미지와 링크를 기반으로 각 파라미터에 대한 설명입니다:

repo_id (str):
필수 파라미터로, 파일을 푸시하려는 리포지토리의 이름을 나타냅니다. 조직에 푸시하려는 경우 이 파라미터에 조직 이름을 포함해야 합니다. 로컬 디렉터리 경로를 지정할 수도 있습니다.

commit_message (str, optional):
푸시 작업 중에 사용할 커밋 메시지를 지정할 수 있는 선택적 파라미터입니다. 이 파라미터를 제공하지 않으면 기본값으로 "Upload {object}"가 사용됩니다. 여기서 {object}는 업로드하는 항목을 의미합니다.

private (bool, optional):
리포지토리가 비공개인지 공개인지 결정하는 파라미터입니다. True로 설정하면 생성된 리포지토리나 사용된 리포지토리는 비공개로 설정되며, 명시적인 권한이 없으면 다른 사람들이 접근할 수 없습니다.

token (str, optional):
선택적인 문자열로, HTTP 베어러 인증에 사용됩니다. 이 토큰은 보통 huggingface-cli login 명령어를 사용해 로그인할 때 생성됩니다. 일반적으로 ~/.huggingface에 저장됩니다.

create_pr (bool, optional, 기본값 False):
업로드된 파일로 Pull Request (PR)을 생성할지 여부를 제어하는 파라미터입니다. True로 설정하면, 리포지토리에 직접 커밋하는 대신 PR이 생성됩니다.

safe_serialization (bool, optional, 기본값 True):
모델 가중치를 safetensors 형식으로 변환할지 여부를 결정하는 파라미터입니다. 이 형식은 PyTorch .pt 형식보다 안전한 대안입니다. 기본값은 True로 설정되어 있어, 별도로 지정하지 않으면 가중치가 자동으로 변환됩니다.

variant (Optional):
모델이나 파이프라인의 변형을 지정할 수 있는 파라미터로, 예를 들어 16비트 부동 소수점 버전인 "fp16" 등을 지정할 수 있습니다.
이 함수는 머신러닝 모델을 Hugging Face Hub로 자동으로 푸시하는 과정을 간편하게 만들어주며, 팀 간 또는 더 넓은 커뮤니티와 모델을 공유하고, 버전을 관리하는 데 유용합니다.

이 때 필수적으로 사용해야 하는 하이퍼 파라미터는 repo_id와 token이라고 보면 되겠다.

추가적으로 max_shard_size를 통해서 업로드 안정성을 높이는 방법도 존재한다!

실제 코드를 이용해서 업로드하기!


import os
import json
from datasets import Dataset, load_dataset, concatenate_datasets, Features, Image, Value
from huggingface_hub import HfApi

# Hugging Face API 키 설정 (옵션)
api_token = "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"
api = HfApi()

# 업로드할 감정 목록
emotions = ["amusement", "anger", "awe", "contentment", "disgust", "excitement", "fear", "sadness"]

# 기존 데이터셋 불러오기
repo_id = "xodhks/Emoset118K"
try:
    existing_dataset = load_dataset(repo_id, split='train')  # 'train' 스플릿만 불러옴
except:
    existing_dataset = None

# 각 감정에 대해 데이터를 추가
for emotion in emotions:
    json_path = f"E:/EmoSet-118K/image/{emotion}/{emotion}.json"
    image_folder = f"E:/EmoSet-118K/image/{emotion}"

    # JSON 파일 로드
    with open(json_path, 'r') as f:
        emotion_data = json.load(f)

    # 이미지 파일 경로와 JSON에서 읽은 메타데이터를 결합
    data = {
        "image": [],
        "label": [],
        "image_id": []
    }

    for item in emotion_data:
        image_file = os.path.join(image_folder, os.path.basename(item["image"]))
        data["image"].append(image_file)
        data["label"].append(item["label"])
        data["image_id"].append(item["image_id"])

    # 새로운 Dataset 생성
    features = Features({
        "image": Image(),
        "label": Value("string"),
        "image_id": Value("string")
    })
    new_dataset = Dataset.from_dict(data, features=features)

    # 기존 데이터셋의 특정 스플릿과 병합
    if existing_dataset:
        combined_dataset = concatenate_datasets([existing_dataset, new_dataset])
    else:
        combined_dataset = new_dataset

    existing_dataset = combined_dataset

# 병합된 데이터셋을 Hugging Face Hub에 업로드
existing_dataset.push_to_hub(
    repo_id=repo_id,  # 동일한 리포지토리에 업로드
    token=api_token,
    max_shard_size="1GB"
)

print("All emotion datasets have been merged and uploaded successfully.")

실행 결과이다

3. 허깅페이스에 업로드 됐는지 확인하기

업로드 된 것을 확인 가능하다!

2024/09/02 - 학습 데이터셋 만들어보기, 테스트 데이터 셋 만들어보기(업로드 예정)

데이터 셋 만들어보기

1. train.json, test.json, val.json 통해서 amusement.json … fear.json 파일로 전환하기

2. push_to_hub()를 이용해서 parquet형식으로 변형하고 업로드하기

3. 허깅페이스에 업로드 됐는지 확인하기