[NLP] NLP라이브러리 없는 자연어 처리
분석 개요
- NLP 라이브러리를 사용하지 않고, 정규식과 collections.Counter 모듈을 사용하여 텍스트 분석
Configuration
- Rules : DONOT use NLP libraries
import re
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from google.colab import drive
drive.mount('./MyDrive')
DATA_PATH = "/content/MyDrive/My Drive/data/"
def make_path(add_path):
path_list = [DATA_PATH]
path_list.append(add_path)
return ''.join(path_list)
Mounted at ./MyDrive
# load data
with open(make_path('data1.csv'), 'r',encoding='UTF8') as f:
data = f.readlines()
f.close()
data[0:5]
['\ufeffStately, plump Buck Mulligan came from the stairhead, bearing a bowl of\n',
'lather on which a mirror and a razor lay crossed. A yellow\n',
'dressinggown, ungirdled, was sustained gently behind him on the mild\n',
'morning air. He held the bowl aloft and intoned:\n',
'\n']
Part1. Counting / Frequency
word_list = []
for i in range(len(data)):
words = re.findall('\w+', data[i])
word_list+=words
word_list[0:7]
['Stately', 'plump', 'Buck', 'Mulligan', 'came', 'from', 'the']
Q1. Frequency of the word “Dedalus”
freq_Dedalus = Counter(word_list)['Dedalus']
freq_Dedalus
174
Q2. Frequency of the word “mounted”
freq_mounted = Counter(word_list)['mounted']
freq_mounted
8
Q3. Frequency of the word “Eunsok”
freq_Eunsok = Counter(word_list)['Eunsok']
freq_Eunsok
Part2. Data Cleaning and Frequency
Q5. Data Cleaning
- 1) Remove all empty lines
- 2) remove all digits/numbers such as 1,2,3
- 3) all special characters such as ?, !, ”, *, [, ],-, etc. except period (.).
cleaned = []
for line in data:
line = re.sub(r"[^a-zA-Z.]+", " ", line)
cleaned.append(line.strip())
cleaned_txt = ' '.join(cleaned)
cleaned_txt
Q6. Add the single space before every period (.)
with_space = []
for line in cleaned:
line = line.replace('.', ' .') # add the single space
with_space.append(line)
print(*with_space[11:20], sep='\n')
Solemnly he came forward and mounted the round gunrest . He faced about
and blessed gravely thrice the tower the surrounding land and the
awaking mountains . Then catching sight of Stephen Dedalus he bent
towards him and made rapid crosses in the air gurgling in his throat
and shaking his head . Stephen Dedalus displeased and sleepy leaned
his arms on the top of the staircase and looked coldly at the shaking
gurgling face that blessed him equine in its length and at the light
untonsured hair grained and hued like pale oak .
Q7. Total number of words
- after Q6
with_space[0:4]
['Stately plump Buck Mulligan came from the stairhead bearing a bowl of',
'lather on which a mirror and a razor lay crossed . A yellow',
'dressinggown ungirdled was sustained gently behind him on the mild',
'morning air . He held the bowl aloft and intoned']
words_list = []
for word in with_space:
words = re.findall('\w+', word) # is . (period) included in words? -> no (by professor)
words_list+=words
words_list[300:306]
['oval', 'jowl', 'recalled', 'a', 'prelate', 'patron']
len(words_list) # total number of words
2867884
Q8. Top 10 most frequent word in Q6
Counter(words_list).most_common()[:10] # is . period included in words? -> no (by professor)
[('the', 210843),
('and', 131625),
('of', 117072),
('to', 50935),
('in', 44873),
('that', 42239),
('And', 39389),
('he', 35160),
('a', 33676),
('I', 32542)]
Part3. Histogram
Q9. Plot the histogram of all the words in Q6
- X : word , Y : frequency(DESC)
cnt_dict = {}
cnt_dict['word'] = list(Counter(words_list).keys())
cnt_dict['count'] = list(Counter(words_list).values())
df = pd.DataFrame(cnt_dict)
sorted_df = df.sort_values('count', ascending=False)
sorted_df.reset_index(drop=True, inplace=True)
sorted_df
word | count | |
---|---|---|
0 | the | 210843 |
1 | and | 131625 |
2 | of | 117072 |
3 | to | 50935 |
4 | in | 44873 |
... | ... | ... |
43669 | enamelled | 1 |
43670 | pegs | 1 |
43671 | clamped | 1 |
43672 | dyes | 1 |
43673 | Stately | 1 |
43674 rows × 2 columns
# plot top 30
fig = plt.figure()
fig.set_size_inches(20, 5, forward=True)
sns.barplot(x=sorted_df['word'][:30], y=sorted_df['count'][:30])
plt.show()
# plot most of all (2000)
fig = plt.figure()
fig.set_size_inches(20, 5, forward=True)
sns.barplot(x=sorted_df['word'][:2000], y=sorted_df['count'][:2000])
plt.show()
Q10. Top 10 most frequent Two Word Sequence in Q6
two_words = []
len_lst = len(words_list) # words_list : extracted words list after Q6
for i in range(len_lst):
if i != len_lst-1:
two_words.append(" ".join(words_list[i : i+2]))
print(*two_words[0:20], sep=' | ')
Stately plump | plump Buck | Buck Mulligan | Mulligan came | came from | from the | the stairhead | stairhead bearing | bearing a | a bowl | bowl of | of lather | lather on | on which | which a | a mirror | mirror and | and a | a razor | razor lay
# Top 10 most frequent two word sequence
Counter(two_words).most_common()[:10]
[('of the', 37120),
('the LORD', 17892),
('in the', 16890),
('and the', 13029),
('to the', 7739),
('shall be', 7436),
('And the', 6728),
('all the', 6644),
('unto the', 6063),
('I will', 6038)]
Q11. Plot the two words in Q6
- X : two words
- Y : frequency (DESC)
cnt_dict_2 = {}
cnt_dict_2['two_words'] = list(Counter(two_words).keys())
cnt_dict_2['count'] = list(Counter(two_words).values())
df_2 = pd.DataFrame(cnt_dict_2)
sorted_df_2 = df_2.sort_values('count', ascending=False)
sorted_df_2.reset_index(drop=True, inplace=True)
sorted_df_2
two_words | count | |
---|---|---|
0 | of the | 37120 |
1 | the LORD | 17892 |
2 | in the | 16890 |
3 | and the | 13029 |
4 | to the | 7739 |
... | ... | ... |
372272 | educational careers | 1 |
372273 | their educational | 1 |
372274 | find their | 1 |
372275 | Rathgar Did | 1 |
372276 | you immortal | 1 |
372277 rows × 2 columns
# plot top 30
fig = plt.figure()
fig.set_size_inches(30, 5, forward=True)
sns.barplot(x=sorted_df_2['two_words'][:30], y=sorted_df_2['count'][:30])
plt.show()
# plot most of all (2000)
fig = plt.figure()
fig.set_size_inches(30, 5, forward=True)
sns.barplot(x=sorted_df_2['two_words'][:2000], y=sorted_df_2['count'][:2000])
plt.show()
Extra Credit
Extra Credit 1 : Top 10 most frequent three word sequence in Q6
three_words = []
len_lst = len(words_list) # words_list : extracted words list after Q6
for i in range(len_lst):
if i != len_lst-2:
three_words.append(" ".join(words_list[i : i+3]))
print(*three_words[0:5], sep=' | ')
Stately plump Buck | plump Buck Mulligan | Buck Mulligan came | Mulligan came from | came from the
# Top 10 most frequent three word sequence
Counter(three_words).most_common()[:10]
[('of the LORD', 4878),
('the son of', 3911),
('the children of', 3767),
('the house of', 2733),
('out of the', 2519),
('children of Israel', 1941),
('the land of', 1878),
('saith the LORD', 1845),
('the sons of', 1521),
('unto the LORD', 1464)]
Extra Credit 2 : What are the three truths that Micheal learned?
for idx, sentence in enumerate(lines):
if 'three truths' in sentence:
print (idx, " : ", sentence)
28040 : And I smiled three times, because God sent me to learn three truths, and I have learnt them
28044 : " And Simon said, "Tell me, Michael, what did God punish you for? and what were the three truths? that I, too, may know them
28061 : ' And God said: 'Go-take the mother's soul, and learn three truths: Learn What dwells in man, What is not given to man, and What men live by
for idx, sentence in enumerate(lines):
if (idx >= 28040) and (idx < 28061):
print(sentence)
And I smiled three times, because God sent me to learn three truths, and I have learnt them
One I learnt when your wife pitied me, and that is why I smiled the first time
The second I learnt when the rich man ordered the boots, and then I smiled again
And now, when I saw those little girls, I learn the third and last truth, and I smiled the third time
" And Simon said, "Tell me, Michael, what did God punish you for? and what were the three truths? that I, too, may know them
" And Michael answered: "God punished me for disobeying Him
I was an angel in heaven and disobeyed God
God sent me to fetch a woman's soul
I flew to earth, and saw a sick woman lying alone, who had just given birth to twin girls
They moved feebly at their mother's side, but she could not lift them to her breast
When she saw me, she understood that God had sent me for her soul, and she wept and said: 'Angel of God! My husband has just been buried, killed by a falling tree
I have neither sister, nor aunt, nor mother: no one to care for my orphans
Do not take my soul! Let me nurse my babes, feed them, and set them on their feet before I die
Children cannot live without father or mother
' And I hearkened to her
I placed one child at her breast and gave the other into her arms, and returned to the Lord in heaven
I flew to the Lord, and said: 'I could not take the soul of the mother
Her husband was killed by a tree; the woman has twins, and prays that her soul may not be taken
She says: "Let me nurse and feed my children, and set them on their feet
Children cannot live without father or mother
" I have not taken her soul
- he learned three truths :
- One I learnt when your wife pitied me, and that is why I smiled the first time
- The second I learnt when the rich man ordered the boots, and then I smiled again
- And now, when I saw those little girls, I learn the third and last truth, and I smiled the third time