Portfolio

한글 및 영어 사용 빈도 분석

Eng.

Ko.

[정보디자인] 한글 자소 빈도와 키보드 히트맵

https://story.pxd.co.kr/958

text character frequency analysis. 특수 기호 사용 빈도

symbol.

Age distribution of qwerty keyboard users

QWERTY keyboard 사용자 연령별 비율

Corpus(말 뭉치) data set 분석

Kakao dataset (20대 여성)

기호 별 사용 빈도 수 분석

20대 여성들 1:1 대화 목록들이다. (73000여개)


req,res
너 좋아하는 차 종류 있어?,무슨 차? 자동차? 마시는 차?
ㅋㅋ 마시는 차 말한 거야!,"아하 나 둥글레, 옥수수, 보리차 좋아해"
완전 곡물류 좋아하네 ㅋㅋ,야쓰 끓이기 귀찮아서 냉침해 먹어
그럼 오래 걸리지 않아?,끓이는 것보다는 훨씬 오래 걸리지 ㅠ
근데 냉침 하는 것도 귀찮겠다 ㅜㅠ,응! 그래서 매일은 안 먹고 가끔 마셔
그럼 엄청 귀찮지는 않겠네?,그치 매일 마시면 매일 해야 되잖아
음 생각해 보니깐 그렇긴 하네,언니는 무슨 차 좋아하는데?
나는 밀크티도 좋아하고 루이보스도 좋아해,오 고급져 나 페퍼민트도 좋아한다!
너 구해줘 홈즈 봐?,매번 보는 건 아니고 가끔! 와이?
나 지금 보는 중인데 서울은 진짜 비싸다 싶어서 ㅋㅋ,그치 ㅠ 근데 나오는 사람들 다 부자인가 봐
그니깐 웬만하면 다 돈 있더라;,내말이a 경제적으로 힘들다더니 다 거짓말
인정 나만 힘든 거였어,아냐 나도 힘들어 나도 추가해줘
진짜 어디 무인도 들어가서 살아야 하나 봐,힉 땅값은 싸겠지만 무서울 듯
정부는 대체 무엇을 하는 겁니까!,맞아요 부동산 정책 과연 실행할 수 있는 겁니까?
열심히 벌어서 내 집 마련을 꿈꿉니다 ㅜ,저도 그렇습니다 내 집 마련 완전 소망…
...

기호 빈도수를 분석하기 위한 코드


import pandas as pd
import re

df = pd.read_csv('KakaoData.csv')

l1 = []

for i in range(73384):
    cleaned_text = re.sub(r'\s+', '', df['req'][i]) # \W+ 하면 공백도 같이 포함되므로 공백 제거해주기
    cleaned_text = re.sub('/', '', cleaned_text)
    matches = re.findall('\W+', cleaned_text)
    if len(matches) != 0:
        for j in range(len(matches)):
            l1.append(str(matches[j]))

for i in range(73384):
    cleaned_text = re.sub(r'\s+', '', str(df['res'][i])) # \W+ 하면 공백도 같이 포함되므로 공백 제거해주기
    cleaned_text = re.sub('/', '', cleaned_text)
    matches = re.findall('\W+', cleaned_text)
    if len(matches) != 0:
        for j in range(len(matches)):
            l1.append(str(matches[j]))

d1 = {}
for i in range(1,len(l1)):
    if l1[i] in d1:
        d1[l1[i]] = d1[l1[i]] + 1
    else: 
        d1[l1[i]] = 1

sorted_d1 = dict(sorted(d1.items(), key=lambda item: item[1], reverse = True))

for key,value in sorted_d1.items():
        print(key,':',value)

결과는 다음과 같다.


? : 39122
. : 16841
... : 14201
! : 9180
~ : 3307
, : 3194
...? : 790
… : 742
; : 484
^^ : 373
?! : 362
~! : 219
?~ : 119
...! : 105
' : 101
~? : 96
- : 88
,? : 80
% : 79
!? : 73
: : 72
?... : 66
...^^ : 63
+ : 60
.. : 57
] : 55
?? : 54
~^^ : 49
…? : 44
) : 39
[ : 37
...; : 36
( : 33
^^... : 29
!! : 28
…. : 27
:# : 24
,, : 21
^ : 19
;; : 18
" : 17
> : 15
...[ : 15
= : 14
:) : 13
., : 12
?^^ : 11
!~ : 11
\ : 11
˗ : 11
>< : 10
~~ : 10
...( : 10
...^^... : 10
?. : 9
.... : 8
.? : 8
꙼ : 8
^^, : 8
?\ : 7
^^? : 7
>? : 7
...?! : 6
,^^ : 6
& : 6
,... : 6
?; : 6
?, : 5
..? : 5
^^; : 5
...^^; : 5
,. : 5
^^~ : 5
,! : 4
,,^^ : 4
^-^ : 4
...~ : 4
^^! : 4
^^… : 4
̆ : 4
-! : 3
...... : 3
?!~ : 3
~!? : 3
-> : 3
.,. : 3
...?? : 3
< : 3
?!?! : 3
.^^ : 3
‘ : 3
’ : 3
,...? : 2
~?! : 2
?!... : 2
...\ : 2
?" : 2
...^ : 2
~| : 2
…^^ : 2
~!~!~! : 2
~~^^ : 2
?.? : 2
.! : 2
;( : 2
(?) : 2
…; : 2
...;; : 2
,^^, : 2
!?! : 2
~~! : 2
...!? : 2
?: : 2
...?^^ : 2
?> : 2
!' : 2
;? : 2
’? : 2
?[ : 2
!’ : 2
!, : 1
-. : 1
^^& : 1
!... : 1
^... : 1
,.,! : 1
,,, : 1
•••" : 1
!" : 1
-" : 1
} : 1
.;; : 1
~~~~~~~~~~ : 1
...,? : 1
~[ : 1
!~\ : 1
.?! : 1
!?!?! : 1
"" : 1
...!! : 1
,,? : 1
️ : 1
]‍[ : 1
]️ : 1
$ : 1
“ : 1
” : 1
^^...! : 1
……... : 1
!?!? : 1
~,^^,~? : 1
~?^^>< : 1
-.- : 1
` : 1
...~^^... : 1
^^...; : 1
^^...~ : 1
~^^... : 1
^^~! : 1
<~ : 1
,.; : 1
...' : 1
…! : 1
...~^^ : 1
…!? : 1
~, : 1
~… : 1
~; : 1
;, : 1
?… : 1
:? : 1
!!! : 1
...&^^& : 1
?.. : 1
...) : 1
]! : 1
?' : 1
\! : 1
,., : 1
>>? : 1
][ : 1
]? : 1
˗,? : 1
...^^:; : 1
....? : 1
??...; : 1
.' : 1
̑ : 1
^^> : 1
)[ : 1
^& : 1
.-; : 1
!) : 1
....[ : 1
~!! : 1
@@ : 1
.!? : 1
??;; : 1
?( : 1
,,?^^ : 1
,?! : 1
>>%^ : 1
^;^ : 1
….^^ : 1
%^^ : 1
%? : 1
...?☆ : 1
!. : 1
~' : 1
...?; : 1
~!@ : 1
,‽? : 1
^^( : 1
….? : 1
?- : 1
.; : 1
？ : 1
^,& : 1
...^^;; : 1
~. : 1
?!? : 1
° : 1
?!?!?! : 1
"? : 1
…... : 1

‘!?.’의 빈도가 가장 높은 걸 확인할 수 있다 (각각 39122, 16841, 14201).

matplotlib 라이브러리를 이용해서 그래프로 표현해보면 다음과 같다.


import pandas as pd
import re
import matplotlib.pyplot as plt

df = pd.read_csv('KakaoData.csv')

l1 = []

for i in range(73384):
    cleaned_text = re.sub(r'\s+', '', df['req'][i]) # \W+ 하면 공백도 같이 포함되므로 공백 제거해주기
    cleaned_text = re.sub('/', '', cleaned_text)
    matches = re.findall('\W+', cleaned_text)
    if len(matches) != 0:
        for j in range(len(matches)):
            l1.append(str(matches[j]))

for i in range(73384):
    cleaned_text = re.sub(r'\s+', '', str(df['res'][i])) # \W+ 하면 공백도 같이 포함되므로 공백 제거해주기
    cleaned_text = re.sub('/', '', cleaned_text)
    matches = re.findall('\W+', cleaned_text)
    if len(matches) != 0:
        for j in range(len(matches)):
            l1.append(str(matches[j]))

d1 = {}
for i in range(1,len(l1)):
    if l1[i] in d1:
        d1[l1[i]] = d1[l1[i]] + 1
    else: 
        d1[l1[i]] = 1

sorted_d1 = dict(sorted(d1.items(), key=lambda item: item[1], reverse = True))

df_result = pd.DataFrame(list(sorted_d1.items()), columns=['Special Character', 'Frequency'])

# 그래프 그리기
plt.figure(figsize=(48, 24))
plt.bar(df_result['Special Character'], df_result['Frequency'])
plt.xlabel('Special Character')
plt.ylabel('Frequency')
plt.title('Frequency of Special Characters')
plt.xticks(rotation=90)  # X 축 레이블을 90도 회전해서 보기 편하게 만든다.

plt.show()

잘 안보이므로 10번 이상의 빈도로 사용된 특수 기호들에 한해서 그래프를 그려보자.


import pandas as pd
import re
import matplotlib.pyplot as plt

df = pd.read_csv('KakaoData.csv')

l1 = []

for i in range(73384):
    cleaned_text = re.sub(r'\s+', '', df['req'][i]) # \W+ 하면 공백도 같이 포함되므로 공백 제거해주기
    cleaned_text = re.sub('/', '', cleaned_text)
    matches = re.findall('\W+', cleaned_text)
    if len(matches) != 0:
        for j in range(len(matches)):
            l1.append(str(matches[j]))

for i in range(73384):
    cleaned_text = re.sub(r'\s+', '', str(df['res'][i])) # \W+ 하면 공백도 같이 포함되므로 공백 제거해주기
    cleaned_text = re.sub('/', '', cleaned_text)
    matches = re.findall('\W+', cleaned_text)
    if len(matches) != 0:
        for j in range(len(matches)):
            l1.append(str(matches[j]))

d1 = {}
for i in range(1,len(l1)):
    if l1[i] in d1:
        d1[l1[i]] = d1[l1[i]] + 1
    else: 
        d1[l1[i]] = 1

sorted_d1 = dict(sorted(d1.items(), key=lambda item: item[1], reverse = True))

filtered_items = [(key, value) for key, value in sorted_d1.items() if value >= 10]
df_result = pd.DataFrame(filtered_items, columns=['Special Character', 'Frequency'])

# 그래프 그리기
plt.figure(figsize=(48, 24))
plt.bar(df_result['Special Character'], df_result['Frequency'])
plt.xlabel('Special Character')
plt.ylabel('Frequency')
plt.title('Frequency of Special Characters')
plt.xticks(rotation=90)  # X 축 레이블을 90도 회전해서 보기 편하게 만든다.

plt.show()

상관관계 분석

위 분석을 통해 각 특수 기호 사이의 상관관계 또한 알아볼 수 있다.

예를 들어 ‘.’은 높은 빈도로 ‘?’와 함께 쓰임을 알 수 있고, ‘!’와도 자주 같이 쓰이는 것을 확인할 수 있다.

이러한 상관관계를 바탕으로 키보드 layout을 구성할 수 있을 것이다.

사용된 특수 기호를 겹치지 않게 추출하기 위한 코드이다.


import pandas as pd
import re

df = pd.read_csv('KakaoData.csv')

l1 = []

# req column 분석
for i in range(73384):
    cleaned_text = re.sub(r'\s+', '', df['req'][i]) # \W+ 하면 공백도 같이 포함되므로 공백 제거해주기
    cleaned_text = re.sub('/', '', cleaned_text)
    matches = re.findall('\W+', cleaned_text)
    if len(matches) != 0:
        for j in range(len(matches)):
            l1.append(str(matches[j]))

# res column 분석
for i in range(73384):
    cleaned_text = re.sub(r'\s+', '', str(df['res'][i])) # \W+ 하면 공백도 같이 포함되므로 공백 제거해주기
    cleaned_text = re.sub('/', '', cleaned_text)
    matches = re.findall('\W+', cleaned_text)
    if len(matches) != 0:
        for j in range(len(matches)):
            l1.append(str(matches[j]))

# 겹치는 문자 제거
l2 = []

for i in range(len(l1)):
    a = list(l1[i])
    for j in range(len(a)):
        if a[j] not in l2:
            l2.append(a[j])
print(l2)

사용된 기호들은 다음과 같다 (총 45개의 기호들이 사용되었다).


['?', '!', ';', '~', '.', ',', '%', '+', '-', '^', "'", '>', '<', '(', ')', '&', '\\', ']', '…', ':', '"', '•', '[', '}', '|', '#', '꙼', '̈', '️', '\u200d', '=', '$', '“', '”', '`', '˗', '‘', '’', '̆', '̑', '@', '☆', '‽', '？', '°']

위 vector를 바탕으로 45X45 matrix를 만들어서 상관관계를 나타내고자 하였다.

예를 들어 ?!라는 기호가 사용되었다면, [0][1]과 [1][0]에 +1을 해주는 식의 코드를 구성해보았다.

코드는 다음과 같다.


import pandas as pd
import re
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('KakaoData.csv')

l1 = [] # 사용된 모든 특수 기호들이 중복되게 들어있는 vector

# req column 분석
for i in range(73384):
    cleaned_text = re.sub(r'\s+', '', df['req'][i]) # \W+ 하면 공백도 같이 포함되므로 공백 제거해주기
    cleaned_text = re.sub('/', '', cleaned_text)
    matches = re.findall('\W+', cleaned_text)
    if len(matches) != 0:
        for j in range(len(matches)):
            l1.append(str(matches[j]))

# res column 분석
for i in range(73384):
    cleaned_text = re.sub(r'\s+', '', str(df['res'][i])) # \W+ 하면 공백도 같이 포함되므로 공백 제거해주기
    cleaned_text = re.sub('/', '', cleaned_text)
    matches = re.findall('\W+', cleaned_text)
    if len(matches) != 0:
        for j in range(len(matches)):
            l1.append(str(matches[j]))

# 겹치는 문자 제거
l2 = [] # 사용된 특수 기호들이 중복되지 않게 들어있는 vector

for i in range(len(l1)):
    a = list(l1[i])
    for j in range(len(a)):
        if a[j] not in l2:
            l2.append(a[j])

# 상관관계 분석 : 행과 열이 l2인 행렬 생성
matrix = np.zeros((len(l2), len(l2)))

# ?!가 있으면 [?,!] 만들어 주는 코드, ...?가 있으면 [.,?] 만들어 주는 코드
for i in range(len(l1)):
    l4 = l1[i] # [...?] 이 리스트를 의미한다
    l3 = []  # [.,?]->[0,4] 담아두는 코드 -> 매번 reset 되어야 한다
    k = 0
    while k < len(l4):
        for j in range(len(l2)):
            if l4[k] == l2[j]:
                l3.append(j)
        l4 = [item for item in l4 if item != l4[k]]
# matrix에 +1 해주기
    if len(l3)>1:
        for j in range(((len(l3))*(len(l3)-1))//2):
            for k in range(len(l3)-1):
                for e in range(k+1,len(l3)-k):
                    matrix[l3[k]][l3[e]] += 1
                    matrix[l3[e]][l3[k]] += 1

np.set_printoptions(threshold=np.inf)

plt.figure(figsize=(12, 12))
plt.imshow(matrix, cmap='coolwarm', interpolation='nearest')
plt.colorbar()  # 컬러바 추가
plt.title('Correlation Matrix Heatmap')
plt.xticks(np.arange(len(matrix)), labels=l2)
plt.yticks(np.arange(len(matrix)), labels=l2)
plt.show()

print(matrix)

correlation_list = []
for i in range(len(matrix)):
    for j in range(i + 1, len(matrix)):  # 겹치는 걸 막기 위해 대각 성분 위만 고려함
        if matrix[i][j] > 0:
            item1 = l2[i]
            item2 = l2[j]
            correlation = matrix[i][j]
            correlation_list.append(((item1, item2), correlation))

# Sort the list by correlation in descending order
correlation_list.sort(key=lambda x: x[1], reverse=True)

# Print the sorted list
for pair, correlation in correlation_list:
    item1, item2 = pair
    print(f"{item1} and {item2} : {correlation}")

상관관계를 heatmap으로 그려보면 다음과 같은 결과가 나온다.

matrix로 보면 다음과 같다.


[[  0. 459.  12. 255. 939. 101.   1.   0.   1.  28.   1.  20.   0.   7.
    0.   0.   7.   1.  51.   3.   3.   0.   2.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   3.   0.   2.   0.   0.   0.   0.
    0.   0.   0.]
 [459.   0.   0. 256. 140.  11.   0.   0.   3.  10.   2.   0.   0.   0.
    1.   0.   4.   1.   4.   0.   1.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   2.   0.   0.   0.   0.
    0.   0.   0.]
 [ 12.   0.   0.   1.  70.   4.   0.   0.   0.   9.   0.   0.   0.   2.
    0.   0.   0.   0.   2.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [255. 256.   1.   0.  14.   7.   0.   0.   0.  81.   1.  10.  11.   0.
    0.   0.   0.   0.   1.   0.   0.   0.   1.   0.   2.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   3.   0.
    0.   0.   0.]
 [939. 140.  70.  14.   0.  42.   0.   0.   5. 156.   2.   0.   0.  10.
    1.   3.   2.   0.  35.   6.   0.   0.  16.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   3.
    0.   0.   0.]
 [101.  11.   4.   7.  42.   0.   0.   0.   0.  32.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   3.   0.   0.   0.   0.   0.   0.
    3.   0.   0.]
 [  1.   0.   0.   0.   0.   0.   0.   0.   0.   1.   0.   3.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  1.   3.   0.   0.   5.   0.   0.   0.   0.   4.   0.   3.   0.   0.
    0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [ 28.  10.   9.  81. 156.  32.   1.   0.   4.   0.   0.   4.   0.   1.
    0.   5.   0.   0.   9.   6.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  1.   2.   0.   1.   2.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [ 20.   0.   0.  10.   0.   0.   3.   0.   3.   4.   0.   0.  10.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.  11.   0.   0.   0.   0.   0.   0.   0.  10.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  7.   0.   2.   0.  10.   0.   0.   0.   0.   1.   0.   0.   0.   0.
    6.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   1.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   0.   6.
    0.   0.   0.   0.   0.  13.   0.   0.   1.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   3.   0.   0.   0.   0.   5.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  7.   4.   0.   0.   2.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  1.   1.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   4.   0.   0.   0.   0.   0.
    1.   3.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [ 51.   4.   2.   1.  35.   0.   0.   0.   0.   9.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  3.   0.   0.   0.   6.   0.   0.   0.   0.   6.   0.   0.   0.   0.
   13.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.  24.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  3.   1.   0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  2.   0.   0.   1.  16.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    1.   0.   0.   4.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   2.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.  24.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   8.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   8.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   4.   1.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   1.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   3.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  3.   0.   0.   0.   0.   3.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  2.   2.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   4.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   1.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   3.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   3.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   3.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]
 [  0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.   0.
    0.   0.   0.]]

이를 ‘기호1’ and ‘기호2’ : 겹친 횟수를 출력하면 다음과 같다.


? and . : 939.0
? and ! : 459.0
! and ~ : 256.0
? and ~ : 255.0
. and ^ : 156.0
! and . : 140.0
? and , : 101.0
~ and ^ : 81.0
; and . : 70.0
? and … : 51.0
. and , : 42.0
. and … : 35.0
, and ^ : 32.0
? and ^ : 28.0
: and # : 24.0
? and > : 20.0
. and [ : 16.0
~ and . : 14.0
) and : : 13.0
? and ; : 12.0
! and , : 11.0
~ and < : 11.0
! and ^ : 10.0
~ and > : 10.0
. and ( : 10.0
> and < : 10.0
; and ^ : 9.0
^ and … : 9.0
꙼ and ̈ : 8.0
? and ( : 7.0
? and \ : 7.0
~ and , : 7.0
. and : : 6.0
^ and : : 6.0
( and ) : 6.0
. and - : 5.0
^ and & : 5.0
! and \ : 4.0
! and … : 4.0
; and , : 4.0
- and ^ : 4.0
^ and > : 4.0
] and [ : 4.0
̈ and ̆ : 4.0
? and : : 3.0
? and " : 3.0
? and ˗ : 3.0
! and - : 3.0
~ and @ : 3.0
. and & : 3.0
. and ☆ : 3.0
, and ˗ : 3.0
, and ‽ : 3.0
% and > : 3.0
- and > : 3.0
] and ‍ : 3.0
? and [ : 2.0
? and ’ : 2.0
! and ' : 2.0
! and ’ : 2.0
; and ( : 2.0
; and … : 2.0
~ and | : 2.0
. and ' : 2.0
. and \ : 2.0
? and % : 1.0
? and - : 1.0
? and ' : 1.0
? and ] : 1.0
! and ) : 1.0
! and ] : 1.0
! and " : 1.0
; and ~ : 1.0
~ and ' : 1.0
~ and … : 1.0
~ and [ : 1.0
. and ) : 1.0
% and ^ : 1.0
- and " : 1.0
^ and ( : 1.0
) and [ : 1.0
] and ️ : 1.0
" and • : 1.0
̈ and ̑ : 1.0

고려해야 할 점

1. 연령대 별로 대화 skill이 다르다. 이것에 대한 data가 마련되어야 한다. 연령별 비율을 바탕으로 가중치를 설정하여 분석해야 할 것이다.

ex. 쿼티 키보드 쓰는 사용자 : 10대 10%, 20대 30%, 30대 60%이면 10대 데이터 10%, .. 가져와서 비교

#02. 각 기호의 사용 빈도수