[NLP] NLP라이브러리 없는 자연어 처리

분석 개요

  • NLP 라이브러리를 사용하지 않고, 정규식과 collections.Counter 모듈을 사용하여 텍스트 분석

    Configuration

  • Rules : DONOT use NLP libraries
import re
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive
drive.mount('./MyDrive')

DATA_PATH = "/content/MyDrive/My Drive/data/"

def make_path(add_path):
    path_list = [DATA_PATH]
    path_list.append(add_path)
    return ''.join(path_list)
Mounted at ./MyDrive
# load data 
with open(make_path('data1.csv'), 'r',encoding='UTF8') as f:
    data = f.readlines()
f.close()
data[0:5]
['\ufeffStately, plump Buck Mulligan came from the stairhead, bearing a bowl of\n',
 'lather on which a mirror and a razor lay crossed. A yellow\n',
 'dressinggown, ungirdled, was sustained gently behind him on the mild\n',
 'morning air. He held the bowl aloft and intoned:\n',
 '\n']

Part1. Counting / Frequency

word_list = []
for i in range(len(data)):
     words = re.findall('\w+', data[i])
     word_list+=words
word_list[0:7]
['Stately', 'plump', 'Buck', 'Mulligan', 'came', 'from', 'the']

Q1. Frequency of the word “Dedalus”

freq_Dedalus = Counter(word_list)['Dedalus']
freq_Dedalus
174

Q2. Frequency of the word “mounted”

freq_mounted = Counter(word_list)['mounted']
freq_mounted
8

Q3. Frequency of the word “Eunsok”

freq_Eunsok = Counter(word_list)['Eunsok']
freq_Eunsok

Part2. Data Cleaning and Frequency

Q5. Data Cleaning

  • 1) Remove all empty lines
  • 2) remove all digits/numbers such as 1,2,3
  • 3) all special characters such as ?, !, ”, *, [, ],-, etc. except period (.).
cleaned = []
for line in data:
    line = re.sub(r"[^a-zA-Z.]+", " ", line)
    cleaned.append(line.strip())
cleaned_txt = ' '.join(cleaned)
cleaned_txt

Q6. Add the single space before every period (.)

with_space = []
for line in cleaned:
    line = line.replace('.', ' .') # add the single space 
    with_space.append(line)
print(*with_space[11:20], sep='\n')
Solemnly he came forward and mounted the round gunrest . He faced about
and blessed gravely thrice the tower the surrounding land and the
awaking mountains . Then catching sight of Stephen Dedalus he bent
towards him and made rapid crosses in the air gurgling in his throat
and shaking his head . Stephen Dedalus displeased and sleepy leaned
his arms on the top of the staircase and looked coldly at the shaking
gurgling face that blessed him equine in its length and at the light
untonsured hair grained and hued like pale oak .

Q7. Total number of words

  • after Q6
with_space[0:4]
['Stately plump Buck Mulligan came from the stairhead bearing a bowl of',
 'lather on which a mirror and a razor lay crossed . A yellow',
 'dressinggown ungirdled was sustained gently behind him on the mild',
 'morning air . He held the bowl aloft and intoned']
words_list = []
for word in with_space:
     words = re.findall('\w+', word) # is . (period) included in words? -> no (by professor)
     words_list+=words
words_list[300:306]
['oval', 'jowl', 'recalled', 'a', 'prelate', 'patron']
len(words_list) # total number of words 
2867884

Q8. Top 10 most frequent word in Q6

Counter(words_list).most_common()[:10] # is . period included in words? -> no (by professor)
[('the', 210843),
 ('and', 131625),
 ('of', 117072),
 ('to', 50935),
 ('in', 44873),
 ('that', 42239),
 ('And', 39389),
 ('he', 35160),
 ('a', 33676),
 ('I', 32542)]

Part3. Histogram

Q9. Plot the histogram of all the words in Q6

  • X : word , Y : frequency(DESC)
cnt_dict = {}
cnt_dict['word'] = list(Counter(words_list).keys())
cnt_dict['count'] = list(Counter(words_list).values())
df = pd.DataFrame(cnt_dict)
sorted_df = df.sort_values('count', ascending=False)
sorted_df.reset_index(drop=True, inplace=True)
sorted_df
word count
0 the 210843
1 and 131625
2 of 117072
3 to 50935
4 in 44873
... ... ...
43669 enamelled 1
43670 pegs 1
43671 clamped 1
43672 dyes 1
43673 Stately 1

43674 rows × 2 columns

# plot top 30
fig = plt.figure()
fig.set_size_inches(20, 5, forward=True)    
sns.barplot(x=sorted_df['word'][:30], y=sorted_df['count'][:30])
plt.show()

png

# plot most of all (2000)
fig = plt.figure()
fig.set_size_inches(20, 5, forward=True)    
sns.barplot(x=sorted_df['word'][:2000], y=sorted_df['count'][:2000])
plt.show()

png

Q10. Top 10 most frequent Two Word Sequence in Q6

two_words = []
len_lst = len(words_list) # words_list : extracted words list after Q6
for i in range(len_lst):
    if i != len_lst-1:
        two_words.append(" ".join(words_list[i : i+2]))
print(*two_words[0:20], sep=' | ')
Stately plump | plump Buck | Buck Mulligan | Mulligan came | came from | from the | the stairhead | stairhead bearing | bearing a | a bowl | bowl of | of lather | lather on | on which | which a | a mirror | mirror and | and a | a razor | razor lay
# Top 10 most frequent two word sequence
Counter(two_words).most_common()[:10]
[('of the', 37120),
 ('the LORD', 17892),
 ('in the', 16890),
 ('and the', 13029),
 ('to the', 7739),
 ('shall be', 7436),
 ('And the', 6728),
 ('all the', 6644),
 ('unto the', 6063),
 ('I will', 6038)]

Q11. Plot the two words in Q6

  • X : two words
  • Y : frequency (DESC)
cnt_dict_2 = {}
cnt_dict_2['two_words'] = list(Counter(two_words).keys())
cnt_dict_2['count'] = list(Counter(two_words).values())
df_2 = pd.DataFrame(cnt_dict_2)
sorted_df_2 = df_2.sort_values('count', ascending=False)
sorted_df_2.reset_index(drop=True, inplace=True)
sorted_df_2
two_words count
0 of the 37120
1 the LORD 17892
2 in the 16890
3 and the 13029
4 to the 7739
... ... ...
372272 educational careers 1
372273 their educational 1
372274 find their 1
372275 Rathgar Did 1
372276 you immortal 1

372277 rows × 2 columns

# plot top 30
fig = plt.figure()
fig.set_size_inches(30, 5, forward=True)    
sns.barplot(x=sorted_df_2['two_words'][:30], y=sorted_df_2['count'][:30])
plt.show()

png

# plot most of all (2000)
fig = plt.figure()
fig.set_size_inches(30, 5, forward=True)    
sns.barplot(x=sorted_df_2['two_words'][:2000], y=sorted_df_2['count'][:2000])
plt.show()

png

Extra Credit

Extra Credit 1 : Top 10 most frequent three word sequence in Q6

three_words = []
len_lst = len(words_list) # words_list : extracted words list after Q6
for i in range(len_lst):
    if i != len_lst-2:
        three_words.append(" ".join(words_list[i : i+3]))
print(*three_words[0:5], sep=' | ')
Stately plump Buck | plump Buck Mulligan | Buck Mulligan came | Mulligan came from | came from the
# Top 10 most frequent three word sequence
Counter(three_words).most_common()[:10]
[('of the LORD', 4878),
 ('the son of', 3911),
 ('the children of', 3767),
 ('the house of', 2733),
 ('out of the', 2519),
 ('children of Israel', 1941),
 ('the land of', 1878),
 ('saith the LORD', 1845),
 ('the sons of', 1521),
 ('unto the LORD', 1464)]

Extra Credit 2 : What are the three truths that Micheal learned?

for idx, sentence in enumerate(lines):
    if 'three truths' in sentence:
        print (idx, " : ", sentence)
28040  :   And I smiled three times, because God sent me to learn three truths, and I have learnt them
28044  :  "  And Simon said, "Tell me, Michael, what did God punish you for? and what were the three truths? that I, too, may know them
28061  :  ' And God said: 'Go-take the mother's soul, and learn three truths: Learn What dwells in man, What is not given to man, and What men live by
for idx, sentence in enumerate(lines):
    if (idx >= 28040) and (idx < 28061):
        print(sentence)
 And I smiled three times, because God sent me to learn three truths, and I have learnt them
 One I learnt when your wife pitied me, and that is why I smiled the first time
 The second I learnt when the rich man ordered the boots, and then I smiled again
 And now, when I saw those little girls, I learn the third and last truth, and I smiled the third time
"  And Simon said, "Tell me, Michael, what did God punish you for? and what were the three truths? that I, too, may know them
"  And Michael answered: "God punished me for disobeying Him
 I was an angel in heaven and disobeyed God
 God sent me to fetch a woman's soul
 I flew to earth, and saw a sick woman lying alone, who had just given birth to twin girls
 They moved feebly at their mother's side, but she could not lift them to her breast
 When she saw me, she understood that God had sent me for her soul, and she wept and said: 'Angel of God! My husband has just been buried, killed by a falling tree
 I have neither sister, nor aunt, nor mother: no one to care for my orphans
 Do not take my soul! Let me nurse my babes, feed them, and set them on their feet before I die
 Children cannot live without father or mother
' And I hearkened to her
 I placed one child at her breast and gave the other into her arms, and returned to the Lord in heaven
 I flew to the Lord, and said: 'I could not take the soul of the mother
 Her husband was killed by a tree; the woman has twins, and prays that her soul may not be taken
 She says: "Let me nurse and feed my children, and set them on their feet
 Children cannot live without father or mother
" I have not taken her soul
  • he learned three truths :
    • One I learnt when your wife pitied me, and that is why I smiled the first time
    • The second I learnt when the rich man ordered the boots, and then I smiled again
    • And now, when I saw those little girls, I learn the third and last truth, and I smiled the third time