[Aqqle] analyzer TEST - doo-nori-posfilter

Notice

Recent Posts

Recent Comments

Link

250x250

« 2025/02 »
일	월	화	수	목	금	토
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

Tags more

Archives

Today

Total

관리 메뉴

아빠는 개발자

[Aqqle] analyzer TEST - doo-nori-posfilter 본문

Aqqle/INDEXER

[Aqqle] analyzer TEST - doo-nori-posfilter

father6019 2025. 2. 2. 19:47

728x90

doo-nori-posfilter 는 아래와 같은 품사를 제거 해야 하는데 이상하다.

"doo-nori-posfilter": {
  "type": "nori_part_of_speech",
  "stoptaags": [
    "E",
    "IC",
    "J",
    "MAG",
    "MM",
    "NA",
    "NR",
    "SC",
    "SE",
    "SF",
    "SH",
    "SL",
    "SN",
    "SP",
    "SSC",
    "SSO",
    "SY",
    "UNA",
    "UNKNOWN",
    "VA",
    "VCN",
    "VCP",
    "VSV",
    "VV",
    "VX",
    "XPN",
    "XR",
    "XSA",
    "XSN",
    "XSV"
  ]
}

테스트

"text": "여섯아이가 모두의 마블을 한다."

결과

explain

{
  "detail": {
    "custom_analyzer": true,
    "charfilters": [],
    "tokenizer": {
      "name": "nori_tokenizer",
      "tokens": [
        {
          "token": "여섯",
          "start_offset": 0,
          "end_offset": 2,
          "type": "word",
          "position": 0,
          "bytes": "[ec 97 ac ec 84 af]",
          "leftPOS": "NR(Numeral)",
          "morphemes": null,
          "posType": "MORPHEME",
          "positionLength": 1,
          "reading": null,
          "rightPOS": "NR(Numeral)",
          "termFrequency": 1
        },
        {
          "token": "아이",
          "start_offset": 2,
          "end_offset": 4,
          "type": "word",
          "position": 1,
          "bytes": "[ec 95 84 ec 9d b4]",
          "leftPOS": "NNG(General Noun)",
          "morphemes": null,
          "posType": "MORPHEME",
          "positionLength": 1,
          "reading": null,
          "rightPOS": "NNG(General Noun)",
          "termFrequency": 1
        },
        {
          "token": "가",
          "start_offset": 4,
          "end_offset": 5,
          "type": "word",
          "position": 2,
          "bytes": "[ea b0 80]",
          "leftPOS": "J(Ending Particle)",
          "morphemes": null,
          "posType": "MORPHEME",
          "positionLength": 1,
          "reading": null,
          "rightPOS": "J(Ending Particle)",
          "termFrequency": 1
        },
        {
          "token": "모두",
          "start_offset": 6,
          "end_offset": 8,
          "type": "word",
          "position": 3,
          "bytes": "[eb aa a8 eb 91 90]",
          "leftPOS": "NNG(General Noun)",
          "morphemes": null,
          "posType": "MORPHEME",
          "positionLength": 1,
          "reading": null,
          "rightPOS": "NNG(General Noun)",
          "termFrequency": 1
        },
        {
          "token": "의",
          "start_offset": 8,
          "end_offset": 9,
          "type": "word",
          "position": 4,
          "bytes": "[ec 9d 98]",
          "leftPOS": "J(Ending Particle)",
          "morphemes": null,
          "posType": "MORPHEME",
          "positionLength": 1,
          "reading": null,
          "rightPOS": "J(Ending Particle)",
          "termFrequency": 1
        },
        {
          "token": "마블",
          "start_offset": 10,
          "end_offset": 12,
          "type": "word",
          "position": 5,
          "bytes": "[eb a7 88 eb b8 94]",
          "leftPOS": "NNG(General Noun)",
          "morphemes": null,
          "posType": "MORPHEME",
          "positionLength": 1,
          "reading": null,
          "rightPOS": "NNG(General Noun)",
          "termFrequency": 1
        },
        {
          "token": "을",
          "start_offset": 12,
          "end_offset": 13,
          "type": "word",
          "position": 6,
          "bytes": "[ec 9d 84]",
          "leftPOS": "J(Ending Particle)",
          "morphemes": null,
          "posType": "MORPHEME",
          "positionLength": 1,
          "reading": null,
          "rightPOS": "J(Ending Particle)",
          "termFrequency": 1
        },
        {
          "token": "하",
          "start_offset": 14,
          "end_offset": 16,
          "type": "word",
          "position": 7,
          "bytes": "[ed 95 98]",
          "leftPOS": "VV(Verb)",
          "morphemes": null,
          "posType": "MORPHEME",
          "positionLength": 1,
          "reading": null,
          "rightPOS": "VV(Verb)",
          "termFrequency": 1
        },
        {
          "token": "ᆫ다",
          "start_offset": 14,
          "end_offset": 16,
          "type": "word",
          "position": 8,
          "bytes": "[e1 86 ab eb 8b a4]",
          "leftPOS": "E(Verbal endings)",
          "morphemes": null,
          "posType": "MORPHEME",
          "positionLength": 1,
          "reading": null,
          "rightPOS": "E(Verbal endings)",
          "termFrequency": 1
        }
      ]
    },
    "tokenfilters": [
      {
        "name": "doo-nori-posfilter",
        "tokens": [
          {
            "token": "여섯",
            "start_offset": 0,
            "end_offset": 2,
            "type": "word",
            "position": 0,
            "bytes": "[ec 97 ac ec 84 af]",
            "leftPOS": "NR(Numeral)",
            "morphemes": null,
            "posType": "MORPHEME",
            "positionLength": 1,
            "reading": null,
            "rightPOS": "NR(Numeral)",
            "termFrequency": 1
          },
          {
            "token": "아이",
            "start_offset": 2,
            "end_offset": 4,
            "type": "word",
            "position": 1,
            "bytes": "[ec 95 84 ec 9d b4]",
            "leftPOS": "NNG(General Noun)",
            "morphemes": null,
            "posType": "MORPHEME",
            "positionLength": 1,
            "reading": null,
            "rightPOS": "NNG(General Noun)",
            "termFrequency": 1
          },
          {
            "token": "모두",
            "start_offset": 6,
            "end_offset": 8,
            "type": "word",
            "position": 3,
            "bytes": "[eb aa a8 eb 91 90]",
            "leftPOS": "NNG(General Noun)",
            "morphemes": null,
            "posType": "MORPHEME",
            "positionLength": 1,
            "reading": null,
            "rightPOS": "NNG(General Noun)",
            "termFrequency": 1
          },
          {
            "token": "마블",
            "start_offset": 10,
            "end_offset": 12,
            "type": "word",
            "position": 5,
            "bytes": "[eb a7 88 eb b8 94]",
            "leftPOS": "NNG(General Noun)",
            "morphemes": null,
            "posType": "MORPHEME",
            "positionLength": 1,
            "reading": null,
            "rightPOS": "NNG(General Noun)",
            "termFrequency": 1
          },
          {
            "token": "하",
            "start_offset": 14,
            "end_offset": 16,
            "type": "word",
            "position": 7,
            "bytes": "[ed 95 98]",
            "leftPOS": "VV(Verb)",
            "morphemes": null,
            "posType": "MORPHEME",
            "positionLength": 1,
            "reading": null,
            "rightPOS": "VV(Verb)",
            "termFrequency": 1
          }
        ]
      }
    ]
  }
}

각 항목 설명 - 마블

token	분석된 형태소(단어) 문자열입니다. 예제: "마블"
start_offset	원본 텍스트에서 해당 토큰의 시작 위치입니다. 예제: 10
end_offset	원본 텍스트에서 해당 토큰의 끝 위치입니다. 예제: 12
type	토큰 유형입니다. 보통 word는 일반 단어를 의미합니다.
position	문장에서 해당 토큰이 등장하는 위치(0부터 시작)입니다. 예제: 5
bytes	토큰의 UTF-8 인코딩 값을 표시합니다. 예제: [eb a7 88 eb b8 94] (마블의 UTF-8 바이트 값)
leftPOS	형태소의 품사(POS) 정보입니다. 예제: NNG(General Noun) (일반 명사)
morphemes	복합어의 경우 세부 형태소 정보가 들어가며, 단일 형태소인 경우 null입니다.
posType	형태소 유형입니다. 보통 MORPHEME은 기본적인 형태소 분석 결과를 의미합니다.
positionLength	해당 토큰이 차지하는 위치 길이입니다. 일반적으로 1입니다.
reading	발음 정보가 있으면 제공됩니다. 한글의 경우 보통 null입니다.
rightPOS	형태소의 오른쪽 POS 정보입니다. 예제: NNG(General Noun)
termFrequency	해당 토큰이 문서 내에서 등장한 빈도입니다. 예제: 1

분석 토큰 정보

TokenStart OffsetEnd Offset품사(LeftPOS)품사 설명

여섯	0	2	NR	수사 (Numeral)
아이	3	5	NNG	일반 명사
가	5	6	J	조사 (Ending Particle)
모두	7	9	NNG	일반 명사
의	9	10	J	조사 (Ending Particle)
마블	11	13	NNG	일반 명사
을	13	14	J	조사 (Ending Particle)
하	15	17	VV	동사
ᆫ다	15	17	E	어미 (Verbal endings)

토큰 상세 설명

Position 정보: 단어가 문장에서 차지하는 위치를 나타냅니다.
POS 품사 정보:
- NR (Numeral) 수사 예제: "여섯"
- NNG (General Noun) 일반 명사 예제: "아이", "모두", "마블"
- J (Ending Particle) 조사 예제: "가", "의", "을"
- VV (Verb) 동사 예제: "하"
- E (Verbal endings) 어미 예제: "ᆫ다"

활용 팁

사용자 사전을 정의할 때 해당 품사(POS) 코드를 활용하여 더 정밀한 텍스트 분석이 가능하도록 설정할 수 있습니다.
termFrequency, positionLength 등의 값은 분석 결과 가중치를 설정하는 데 유용합니다.
morphemes 값이 null인 경우 기본 형태소 정보를 제공하지 않았음을 의미합니다.

728x90

'Aqqle > INDEXER' 카테고리의 다른 글

[Aqqle] index settings (0)	2025.02.02
[Aqqle] INDEXER (0)	2024.02.03

'Aqqle/INDEXER' Related Articles

아빠는 개발자

[Aqqle] analyzer TEST - doo-nori-posfilter 본문

[Aqqle] analyzer TEST - doo-nori-posfilter

각 항목 설명 - 마블

분석 토큰 정보

토큰 상세 설명

활용 팁

'Aqqle > INDEXER' 카테고리의 다른 글

티스토리툴바