아빠는 개발자

[tensorflow 2] Text embedding A/B TEST - 1 본문

Python/Text embeddings

[tensorflow 2] Text embedding A/B TEST - 1

father6019 2024. 8. 19. 21:50
728x90
반응형

tensorflow embedding A/B 테스트

tensorflow embedding 2가지 모델을 2가지 방법으로 색인해서 테스트해 본다.

#모델 API 
A : https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3"
B : https://tfhub.dev/google/universal-sentence-encoder-multilingual/3"

공통 :

  • 512차원 밀집백터 색인
  • cosineSimilarity 비교 
  • 2919개의 상품명, 카테고리 데이터 

후보 1 

 A 모델을 사용하여 name (상품명) 으로 vector 추출 

 

 

 

 

 

 

후보 2

 B 모델을 사용하여 name (상품명) 으로 vector 추출 

 

 

 

 

 

 

후보 3

A 모델을 사용하여 name (상품명) 과 category (카테고리) 를 조합하여 vector 추출

 

 

 

 

 

 

후보 4

B 모델을 사용하여 name (상품명) 과 category (카테고리) 를 조합하여 vector 추출

 

 

 

 

 

 

 

토너먼트 방식 

색인된 단어와 카테고리를 조합하여 100개의 단어 생성  TOP 3까지 비교하여 승자를 정함

 

데이터 셈플 2919개 색인

상품데이터

음..  승자를 무슨 기준으로.. 판단해야 하나..

 

a VS b

검색어 : 아이폰 케이스 11 프로
CASE A :
name: 케이맥스 아이폰 11 프로용 클리어핏 케이스, category: 디지털/가전 휴대폰액세서리 휴대폰케이스 기타케이스, score: 1.8579223
name: 슈피겐 아이폰 11 프로용 씬핏에어 케이스, category: 디지털/가전 휴대폰액세서리 휴대폰케이스 기타케이스, score: 1.8412657
name: Apple 아이폰 11 프로 MAX용 레더 케이스, category: 디지털/가전 휴대폰액세서리 휴대폰케이스 가죽케이스, score: 1.8381073

CASE B :
name: araree 아이폰 11 프로용 마하 케이스, category: 디지털/가전 휴대폰액세서리 휴대폰케이스 기타케이스, score: 1.9075938
name: 나하로 아이폰 11 프로용 리얼하이브리드 케이스, category: 디지털/가전 휴대폰액세서리 휴대폰케이스 기타케이스, score: 1.8805447
name: 베리어 아이폰 11 프로용 네오 케이스, category: 디지털/가전 휴대폰액세서리 휴대폰케이스 기타케이스, score: 1.8703146

 

검색어 : 아이폰 충전
CASE A :
name: 랩씨 3in1 애플워치 갤럭시워치 아이폰 무선 충전기, category: 디지털/가전 휴대폰액세서리 휴대폰충전기 충전기, score: 1.6187584
name: 베이드플러스 아이폰 애플워치 에어팟 3in1 충전 거치대, category: 디지털/가전 휴대폰액세서리 휴대폰거치대, score: 1.6004881
name: 아이폰 케이블 보호캡, category: 디지털/가전 휴대폰액세서리 기타휴대폰액세서리, score: 1.5367465

CASE B :
name: 아이폰 케이블 보호캡, category: 디지털/가전 휴대폰액세서리 기타휴대폰액세서리, score: 1.59999
name: 랩씨 3in1 애플워치 갤럭시워치 아이폰 무선 충전기, category: 디지털/가전 휴대폰액세서리 휴대폰충전기 충전기, score: 1.5793183
name: 베이드플러스 아이폰 애플워치 에어팟 3in1 충전 거치대, category: 디지털/가전 휴대폰액세서리 휴대폰거치대, score: 1.5563188

 

검색어 : 에어팟 충전
CASE A :
name: 베이드플러스 아이폰 애플워치 에어팟 3in1 충전 거치대, category: 디지털/가전 휴대폰액세서리 휴대폰거치대, score: 1.5089552
name: 해외나이키 에어 줌 펄스 CT1629-003, category: 패션잡화 남성신발 운동화 러닝화, score: 1.3815312
name: 나이키 에어 리프트 브리드 써밋 DJ4639-121, category: 패션잡화 여성신발 운동화 러닝화, score: 1.3518461

CASE B :
name: 베이드플러스 아이폰 애플워치 에어팟 3in1 충전 거치대, category: 디지털/가전 휴대폰액세서리 휴대폰거치대, score: 1.4476553
name: 베리어 아이폰 X용 에어백 케이스, category: 디지털/가전 휴대폰액세서리 휴대폰케이스 기타케이스, score: 1.3893787
name: 데일리어스 아이폰 11용 에어백 카드 범퍼 케이스, category: 디지털/가전 휴대폰액세서리 휴대폰케이스 기타케이스, score: 1.3526618

 

 

a2 VS b2

검색어 : 아이폰 케이스

CASE A :
name: 뷰에스피 아이폰 7용 올레포빅 액정보호필름, category: 디지털/가전 휴대폰액세서리 휴대폰보호필름 액정보호필름, score: 1.7229252
name: 오블릭 아이폰 X / 아이폰 XS용 K3월렛 케이스, category: 디지털/가전 휴대폰액세서리 휴대폰케이스 가죽케이스, score: 1.688775
name: ISK 에디터 아이폰 5 / 아이폰 5S / 아이폰 SE용 크리스탈 필름, category: 디지털/가전 휴대폰액세서리 휴대폰보호필름 액정보호필름, score: 1.6880071

CASE B :
name: 나하로 아이폰 11용 리얼하이브리드 케이스, category: 디지털/가전 휴대폰액세서리 휴대폰케이스 기타케이스, score: 1.7272676
name: 뷰씨 아이폰XR 레인보우 글라스범퍼 케이스, category: 디지털/가전 휴대폰액세서리 휴대폰케이스 기타케이스, score: 1.7031344
name: 하미 패치웍스 아이폰 13 MINI용 레벨 ITG Plus 휴대폰 케이스, category: 디지털/가전 휴대폰액세서리 휴대폰케이스 기타케이스, score: 1.6911461

 

검색어 : 루이비통 여성 숄더백
CASE A :
name: 구찌 GG 마몽 미니 시퀸 숄더백 446744 9SYWP 1000, category: 패션잡화 여성가방 숄더백, score: 1.6179286
name: 구찌 1955 홀스빗 스몰 탑 핸들 백 621220 0YK0G 1000, category: 패션잡화 여성가방 토트백, score: 1.6179286
name: 구찌 주미 미디엄 탑 핸들 백 564714 1B90X 1000, category: 패션잡화 여성가방 토트백, score: 1.6179286

CASE B :
name: 프라다 파니에 사피아노 스몰백 17 2ERX F0LJ4 1BA217, category: 패션잡화 여성가방 토트백, score: 1.5838006
name: 루이 비통 팜 스프링스 미니 MY LV WORLD TOUR, category: 패션잡화 여성가방 백팩, score: 1.5672462
name: 루이 비통 LV ME 팔찌 알파벳 Y M67182, category: 패션잡화 주얼리 팔찌 패션팔찌, score: 1.5280777

 

검색어 : 프라다 숄더백
CASE A :
name: 해외프라다 에티켓 로고 스터드 레더 숄더백 PEO 1BD082 F0SCC, category: 패션잡화 여성가방 숄더백, score: 1.568241
name: 구찌 GG 마몽 미니 시퀸 숄더백 446744 9SYWP 1000, category: 패션잡화 여성가방 숄더백, score: 1.4379253
name: 구찌 1955 홀스빗 스몰 탑 핸들 백 621220 0YK0G 1000, category: 패션잡화 여성가방 토트백, score: 1.4379253

CASE B :
name: 해외프라다 에티켓 로고 스터드 레더 숄더백 PEO 1BD082 F0SCC, category: 패션잡화 여성가방 숄더백, score: 1.5346006
name: 해외아크네스튜디오 로고 오버핏 티셔츠 BL0198, category: 패션의류 남성의류 티셔츠, score: 1.4877313
name: 루이 비통 삭 쾨르 M58738, category: 패션잡화 여성가방 숄더백, score: 1.4624231

 

 

 

 

 

get_data.py

# -*- coding: utf-8 -*-
import time
import json

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

import tensorflow_hub as hub
import tensorflow_text
import kss, numpy


##### SEARCHING #####

def run_query_loop():
    while True:
        try:
            handle_query()
        except KeyboardInterrupt:
            return

def handle_query():
    query = input("Enter query: ")

    embedding_start = time.time()
    query_vector_a = embed_text_a([query])[0]
    query_vector_b = embed_text_b([query])[0]
    embedding_time = time.time() - embedding_start

    script_query_a = {
        "script_score": {
            "query": {"match_all": {}},
            "script": {
                "source": "cosineSimilarity(params.query_vector, doc['name_vector']) + 1.0",
                "params": {"query_vector": query_vector_a}
            }
        }
    }

    script_query_b = {
        "script_score": {
            "query": {"match_all": {}},
            "script": {
                "source": "cosineSimilarity(params.query_vector, doc['name_vector']) + 1.0",
                "params": {"query_vector": query_vector_b}
            }
        }
    }

    search_start = time.time()
    response_a = client.search(
        index=INDEX_NAME_A,
        body={
            "size": SEARCH_SIZE,
            "query": script_query_a,
            "_source": {"includes": ["name", "category"]}
        }
    )

    response_b = client.search(
        index=INDEX_NAME_B,
        body={
            "size": SEARCH_SIZE,
            "query": script_query_b,
            "_source": {"includes": ["name", "category"]}
        }
    )
    search_time = time.time() - search_start


    print("검색어 :" , query)
    print()
    print("CASE A : ")
    for hit in response_a["hits"]["hits"]:
        print("name: {}, category: {}, score: {}".format(hit["_source"]["name"], hit["_source"]["category"], hit["_score"]))
    print()
    print("CASE B : ")
    for hit in response_b["hits"]["hits"]:
        print("name: {}, category: {}, score: {}".format(hit["_source"]["name"], hit["_source"]["category"], hit["_score"]))

##### EMBEDDING #####

def embed_text_a(input):
    vectors = embed_a(input)
    return [vector.numpy().tolist() for vector in vectors]

def embed_text_b(input):
    vectors = embed_b(input)
    return [vector.numpy().tolist() for vector in vectors]

##### MAIN SCRIPT #####

if __name__ == '__main__':
    INDEX_NAME_A = "products_a"
    INDEX_NAME_B = "products_b"

    SEARCH_SIZE = 3

    print("Downloading pre-trained embeddings from tensorflow hub...")
    embed_a = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3")
    embed_b = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual/3")

    client = Elasticsearch(http_auth=('elastic', 'dlengus'))

    run_query_loop()
    print("Done.")
 
 
728x90
반응형