Research Blog @ CUHK Library: text analysis

Thursday, November 29, 2018

香港中文大學藏盛宣懷檔案解密 Unlocking the Sheng Xuanhuai Archive @ CUHK

盛宣懷(1844-1916)，一位睿智及極具魄力的商人、政治家、銀行家、外交家、慈善家、教育家，是中國晚清至民國早期工業化及現代化的推動者。香港中文大學文物館藏盛宣懷檔案乃於1985年獲已故利榮森博士及程伯奮先生的慷慨支持下購入。這批超過13,000封信件的檔案接近四百萬... thumbnail 1 summary

7:04 PM

Research Support & Digital Initiatives, CUHK Library

7:04 PM

盛宣懷(1844-1916)，一位睿智及極具魄力的商人、政治家、銀行家、外交家、慈善家、教育家，是中國晚清至民國早期工業化及現代化的推動者。

香港中文大學文物館藏盛宣懷檔案乃於1985年獲已故利榮森博士及程伯奮先生的慷慨支持下購入。這批超過13,000封信件的檔案接近四百萬字，分為77卷，包括盛宣懷與家人及同僚的書信往來。這些檔案揭示了中國自晚清至民國初期的財政、政治、家庭和社會狀況，是了解這個動盪時期至為重要的檔案。

由文物館及圖書館合辦的座談會中，上海專家團隊會分享自2014年以來處理盛檔釋文工作的經驗及的當中的研究價值。圖書館數碼服務組則會分享檔案的數碼化處理過程及嘗試以新技術揭示及檢索檔案內容，為數字人文提供基礎。

日期: 2018年12月7日(星期五)
時間: 2:30 ‒ 5:30 p.m.
地點: 中文大學圖書館地下數碼學術研究室
語言: 中文
報名: https://goo.gl/forms/jA4PrlOz5VYefjJl1

講者:
馮金牛 (項目主編)
高洪興 (檔案處理及釋文團隊)
廖頴康 (圖書館數碼服務主任)

程序:

歡迎辭
姚進莊教授 (文物館館長)
李露絲女士 (圖書館館長)

馮金牛先生演講香港中文大學藏《盛宣懷檔案》的研究價值

高洪興先生演講盛宣懷與賑災

廖頴康生生演講揭示香港中文大學藏《盛宣懷檔案》—圖書館的參與

問答環節(由卜永堅教授主持)
歷史系副教授

梁元生教授總結
歷史學講座教授、
前中國文化研究所所長

Sheng Xuanhuai (1844-1916), a powerful and resourceful merchant, politician, banker, diplomat, philanthropist, and educator, was the driving force of China’s industrialization and modernization during the late Qing to early Republican period.

The Sheng Xuanhuai Archive at the Chinese University of Hong Kong was acquired by the Art Museum in 1985, under the auspice of the late Dr. J.S. LEE and Mr. CHENG Chi. It comprises 77 volumes with almost 4 million characters in over 13,000 correspondences among Sheng Xuanhuai, his family members, and colleagues. The archive is key to understanding that period of seismic changes, and it will revolutionize the study of late Qing and early Republican China.

Upon the completion of this important phase of digitisation and transcription, the Art Museum and the Library jointly organised this event to mark the occasion. The Shanghai team will share their experience on the transcription work of the archive and the potential research value. The Library’s Digital Services team will talk on the hosting of this archive on the Digital Repository to enhance discoveries, research and experimenting digital humanities.

Date: 7 Dec 2018 (Fri)
Time: 2:30-5:30 p.m.
Venue: Digital Scholarship Lab, G/F, University Library
Medium: Chinese
Registration: https://goo.gl/forms/jA4PrlOz5VYefjJl1

Speaker:
Mr. FENG Jinniu (Chief Editor of the project)
Mr. GAO Hongxing (Member of the Shanghai expert team)
Mr. Jeff LIU (Digital Services Librarian)

Program:

Opening Remark by:
Prof. Josh YIU (Director of Art Museum) and Ms. Louise JONES (University Librarian)

Mr. FENG Jinniu on "Research Value of the Sheng XuanHuai Archive @ CUHK"

Mr. GAO Hongxing on "Sheng XuanHuai and Disaster Relief"

Mr. Jeff LIU on "Unveiling the Sheng XuanHuai Archive @ CUHK Library’s participation"

Q & A Session moderated by Prof. PUK Wing Kin
Associate Professor, Department of History

Closing Remark by:
Prof. LEUNG Yuen Sang
Professor of History Department;
Former Director of Institute of Chinese Studies

Wednesday, March 14, 2018

Innovation Skills Workshops: Applying Digital Tools in Telling Stories with DATA (First Workshop on 23-24 March 2018)

The Library is pleased to collaborate with Centre for Entrepreneurship in arranging a series of workshops - Innovation Skills Workshops:... thumbnail 1 summary

3:16 PM

Research Support & Digital Initiatives, CUHK Library

3:16 PM

The Library is pleased to collaborate with Centre for Entrepreneurship in arranging a series of workshops - Innovation Skills Workshops: Applying Digital Tools in Telling Stories with DATA from March to October 2018. The series consists of 3 workshops well-structured for equipping participants with the skills to combine computational methods with narrative approaches using data to develop web applications for scholarly and creative works. Popular digital tools like Python will be covered. Each workshop consists of 9 hours with theories and practices, starting on Friday night and end on Saturday afternoon:

Workshop 1: Design Thinking Meets Computational Thinking - Digital Literacy in the Network Age
Workshop 2: Preparing and Exploring Your Data in Python
Workshop 3: Visualizing and Publishing Your Data in Python

Welcome CUHK Faculty members, researchers and all students to join this workshop to equip yourself to tell impactful stories with data. Please also keep an eye on the other two workshops. Certificate of Attendance will be issued for participants who have attended ALL THREE workshops.

Workshop 1: Design Thinking Meets Computational Thinking - Digital Literacy in the Network Age (9 hrs) (will be held on 23-24 March 2018 (Fri–Sat))

a. T-shaped Talent and Digital Literacy

From I-shaped to T-shaped: Talent development for the network age
Design Thinking meets Computational Thinking: A STEAM approach to digital literacy
Lessons from “Digital Humanities”: C.P. Snow, Nicholas Negroponte, and Lev Manovich revisited
Telling story with data: From data scraping to data visualisation and interaction

Date & Time: 23 March 2018 (Fri), 6:30 p.m. – 9:30 p.m.

b. The Big 3 of web publishing

HTML - the noun in web publishing
CSS - the adjective in web publishing
JavaScript - the verb in web publishing
Using Git, Bootstrap library and Pingendo Builder for web development and publishing

Date & Time: 24 March 2018 (Sat), 10:00 a.m. – 1:00 p.m. & 2:30 p.m. - 5:30 p.m.

Workshop 2: Preparing and Exploring Your Data in Python (9 hrs) (tentatively in mid-May to early June 2018)

a. Preparing (pre-processing) your data for growth (3 hrs)

Know your sources: interviews, field studies, open data, API, websites, IoT, and digital archives
ETL (extraction, transformation, and loading) in CSV, XML, and JSON formats for data preparation
Finding a home for your data - cloud computing and its infrastructure for growth and support
Popular tools for data preparation (e.g. Knime, Open Refine, Google Sheet/Xpath, Scrapinghub, Beautiful Soap, and Scrapy)

b. Exploring your data in Python (6 hrs)

Using Anaconda Jupyter Notebook for data exploration in Python
Introduction to Python operations (operator and operand), control structure, data structure, and function
Useful Python modules for data exploration, analysis and mining (Mathpotlib, Numpy, Pandas, etc.)
Free online resources for self-paced learning in Python (codeacademy.com, coursera.org, udacity.com, cognitiveclass.ai, etc.)

Workshop 3: Visualizing and Publishing Your Data in Python (9 hrs) (tentatively in mid-September to early October 2018)
a. Growing your data in the cloud: From Google Sheet to Airtable (3 hrs)

Beyond Google Sheet — Building relational database in Airtable for storing and managing your data
The power of views — Displaying and filtering data in form, grid, calendar, kanban, and gallery views
Functions and API for more advanced data modelling and application development
Integration with other web applications for team collaboration and project management

b. Data Visualisation in JavaScript (3 hrs)

Front-end vs. back-end programming: interface with the user and interface with the data using the Python Flask framework
Useful JavaScript libraries (jQuery, D3, Mpld3, Leaflet, etc.) for data visualization and front-end interactions
Create your first interactive chart in Matpotlib and Mpld3
Create your first interactive map in Leaflet

c. Publishing Your Project on the Web (3 hrs)

The elements of user experience in web design
The narrative components in user journey within a web design
Combining Airtable and Bootstrap library for web publishing
Use of Google Optimize and Google Analytics to track your web project

Venue: Digital Scholarship Lab, G/F, University Library
Registration: Click to register (for workshop 1)
Remarks: Users are required to bring their own devices to the workshop.
Enquiries: dslab@lib.cuhk.edu.hk.

Innovation Skills Workshop: Applying Digital Tools in Telling Stories with DATA

Monday, January 22, 2018

First Collaborative Digital Scholarship Project Launched

The first Digital Scholarship Project in collaboration with Prof Celine Lai of Faculty of Arts has been soft launched together with the Digi... thumbnail 1 summary

12:43 PM

Research Support & Digital Initiatives, CUHK Library

12:43 PM

The first Digital Scholarship Project in collaboration with Prof Celine Lai of Faculty of Arts has been soft launched together with the Digital Scholarship Projects website in early January 2018.

The project "GIS Mapping and Archaeology of Early China" was collaborated with Prof. Celine Lai of Faculty of Arts in using GIS for mapping archaeological sites embedded with unearthed bronzes details. The data originally in excel files was collected from her doctoral study on the topic and at the initiation of Prof. Lai, dynamic maps are employed to visualise the distribution of the archaeological sites for people who are interested to the topic to further study.

To enable researchers to make use of the research data of Prof. Lai and to avoid re-inventing the wheel, the project web site also provides downloadable files in three different formats with the full list of information and references on the bronzes and archaeological sites for producing the map.

PDF file of the table for full references and quotes of information
Files for visualising the maps and for further analysis:

Shapefile for use in GIS software
kmz file for use in Google Map / Google Earth

Please visit the project website: http://dsprojects.lib.cuhk.edu.hk/en/projects/gis-nao/ for these files and more information. The project will continue as more data are provided by Prof. Lai.

Apart from this collaborative project, the DS Service Team has also released another small project: Text Data Preparation: A Practice in R using the Sheng Xuanhuai Collection in demonstrating the use of R in preparing data for Chinese text-mining and analysis. Future Digital Scholarship Projects will all be available in this Digital Scholarship Projects website.

About Digital Scholarship Service in CUHK Library

The Digital Scholarship Service Team aims on collaborating with CUHK Faculty members, researchers and postgraduates in facilitating digital scholarship research. CUHK community are welcome to visit our Digital Scholarship Services website or email us at dslab@lib.cuhk.edu.hk if your are interested to know more.

Thursday, December 28, 2017

Text Data Preparation: a Practice in R using the Sheng Xuanhuai Collection

In this post, we share a way of preparing Chinese text data for computational analysis; we do so in R using sample texts from a historical... thumbnail 1 summary

3:49 PM

Research Support & Digital Initiatives, CUHK Library

3:49 PM

In this post, we share a way of preparing Chinese text data for computational analysis; we do so in R using sample texts from a historical collection that is currently being digitized by our library - the Sheng Xuanhuai Collection.

The Sheng Xuanhuai collection contains over 70 volumes of correspondences between the entrepreneur Sheng Xuanhuai and other individuals. The texts included in the collection are digitized and are preserved in formats of images and text files. The texts are also coded with labels/variables such as title, sender name, receiver name, date, key words and locations mentioned in the texts. The digitization and transcription of the correspondences that transform these texts into machine readable formats allows researchers to conduct studies using computational text analysis and other relevant methods.

In the following sections, we demonstrate our way of importing text data to R, preparing texts for analysis, as well as exploring and visualizing texts. Basic knowledge of R will be helpful if you are to try this practice or even apply this on your own data.

Import Text

First we need to read our data (the texts - csv files are used in this demo) to R - to do so, we use setwd() function to set up the working directory, i.e., let R know the path of where we store the data on the computer, then we use read.csv() to load the data file named v36.csv to R.

# set your working directory
setwd('YOUR WORKING DIRECTORY')

# load the data in a spreadsheet to R
v36 <- read.csv('v36.csv', encoding = 'UTF-8', header=TRUE, row.names=1, stringsAsFactors=FALSE)

# view the first two rows of the data
head(v36, 2)

##                lid
## 1 36-001A-1—001A-2
## 2 36-001B-1—001B-2
##                                                                                                                                                                                                                                                                                                           ltext
## 1                                                              宮保鈞座：新正趨賀兩次，聞藎躬尚未康復，不可以風，未敢驚擾。昨往拜朱邇典。伊面稱久違侍鈞顔，欲於日內晉謁，兼答拜經方。頃接其來函，准於明日下午三句鐘來署拜謁。可否允其接見？謹請示復，以便函告該使。肅此。上叩勛祺，順頌新禧。經方謹上。初九日。
## 2 啟者：頃據民政部公啟稱：於本月十六日下午三點鐘，在部開第一次衛生會會議。請飭派出各員届時蒞會等因。査本部知醫者惟屈道永秋( 去年九月始行札調到部)，及承政廳行走學部主事謝天保兩員。屈道在本部月支薪水二百兩，謝主事則於滬甯鐵路掛名，月支洋百元。此兩員在本部無事，可否派其前往會議？請酌示。經方謹上。十二日。

There are two variables in the data: lid and ltext - they are correspondence letter ID and letter text in volume 36 of the collection. We have 245 rows in this dataset, i.e., 245 letters in this volume.

Segmenting Chinese Text

Words and terms are the basic units of many computational text analysis methods, however Chinese characters are not “naturally” divided by whitespaces like some other languages such as English. A number of methods are developed to segment Chinese characters - here we try the widely used “jieba” segmenter on our sample texts. To use the R version of jieba, install the package by running this command install.packages('jiebaR') in you R Console. Note you also need to run install.packages() for the other packages we use here in the following sections if you haven’t had them installed on your computer.

# load the "jiebaR" and "stringr" packages/libraries
library(jiebaR)
library(stringr)

Initialize an engine for word segmentation, use all the default settings, and try it on a simple sentence.

# initialize jiebaR worker
cutter <- worker()

# test the worker
cutter["今天的天氣真好"]

## [1] "今天" "的"   "天氣" "真好"

We then define a function called seg_x by which we segment the texts stored in the ltext variable of the data v36 and save them as a new variable of v36 called ltext.seg.

# define the function of segmenting
seg_x <- function(x) {str_c(cutter[x], collapse = " ")} 

# apply the function to each document (row of ltext)
x.out <- sapply(v36$ltext, seg_x, USE.NAMES = FALSE)

# attach the segmented text back to the data frame
v36$ltext.seg <- x.out 

# view the first two rows of the data frame
head(v36, 2)

##                lid
## 1 36-001A-1—001A-2
## 2 36-001B-1—001B-2
##                                                                                                                                                                                                                                                                                                           ltext
## 1                                                              宮保鈞座：新正趨賀兩次，聞藎躬尚未康復，不可以風，未敢驚擾。昨往拜朱邇典。伊面稱久違侍鈞顔，欲於日內晉謁，兼答拜經方。頃接其來函，准於明日下午三句鐘來署拜謁。可否允其接見？謹請示復，以便函告該使。肅此。上叩勛祺，順頌新禧。經方謹上。初九日。
## 2 啟者：頃據民政部公啟稱：於本月十六日下午三點鐘，在部開第一次衛生會會議。請飭派出各員届時蒞會等因。査本部知醫者惟屈道永秋( 去年九月始行札調到部)，及承政廳行走學部主事謝天保兩員。屈道在本部月支薪水二百兩，謝主事則於滬甯鐵路掛名，月支洋百元。此兩員在本部無事，可否派其前往會議？請酌示。經方謹上。十二日。
##                                                                                                                                                                                                                                                                                                                                                 ltext.seg
## 1                                                                               宮保 鈞座 新 正 趨 賀 兩次 聞 藎 躬 尚未 康復 不 可以 風 未敢 驚擾 昨往 拜 朱邇 典 伊面 稱 久違 侍鈞 顔 欲 於 日 內 晉謁 兼 答拜 經方 頃接 其 來函 准 於 明日 下午 三句 鐘來署 拜謁 可否 允其 接見 謹 請示 復 以便 函告 該 使 肅此 上叩 勛 祺 順頌 新禧 經方謹 上 初九 日
## 2 啟者 頃 據 民政部 公 啟稱 於 本月 十六日 下午 三點鐘 在 部開 第一次 衛生 會 會議 請 飭 派出 各員届 時 蒞會 等 因 査 本部 知 醫者 惟屈 道 永秋 去年 九月 始行 札 調到 部 及承政廳 行走 學部 主事謝 天保 兩員 屈道 在 本部 月 支 薪水 二百兩 謝主事則 於 滬 甯 鐵路 掛名 月 支洋 百元 此 兩員 在 本部 無事 可否 派 其 前往 會議 請 酌 示 經方謹 上 十二日

Create corpus and document-term/feature-matrix

With the texts segmented by whitespaces, we can move on to create corpus and document-term/feature-matrix (DTM/DFM) that are often used for further text analysis. Here we use functions of the quanteda package to create corpus and DFMs, so does to explore and visualize the texts. quanteda is an R package for managing and analyzing text data; it provides tools for corpus management, natural language processing, document-feature-matrix analysis and more.

# load the library
library(quanteda)

We create a corpus from the texts stored in the ltext.seg variable using the corpus() function. We also tokenize the texts using tokens() and construct a document-feature-matrix using dfm(). Note “fasterword” is specified so that the texts are tokenized by whitespaces. We can then view the most frequent terms/features in this set of texts using topfeatures(). The quanteda package also offers a function textplot_wordcloud() by which you can easily plot a wordcloud from DFMs.

# create corpus
lcorpus <- corpus(v36$ltext.seg)

# summarize the lcorpus object
summary(lcorpus, showmeta = TRUE, 5)

## Corpus consisting of 245 documents, showing 5 documents:
## 
##   Text Types Tokens Sentences
##  text1    73     82         1
##  text2    82    101         1
##  text3   121    153         1
##  text4    64     70         1
##  text5   171    298         1
## 
## Source:  /Users/Guest/Desktop/sheng/* on x86_64 by Guest
## Created: Wed Dec 27 15:14:39 2017
## Notes:

# see the text in the 1st document of lcorpus
texts(lcorpus)[1]

##                                                                                                                                                                                                                                                                       text1 
## "宮保 鈞座 新 正 趨 賀 兩次 聞 藎 躬 尚未 康復 不 可以 風 未敢 驚擾 昨往 拜 朱邇 典 伊面 稱 久違 侍鈞 顔 欲 於 日 內 晉謁 兼 答拜 經方 頃接 其 來函 准 於 明日 下午 三句 鐘來署 拜謁 可否 允其 接見 謹 請示 復 以便 函告 該 使 肅此 上叩 勛 祺 順頌 新禧 經方謹 上 初九 日"

# create dfm with "terms/features" spliting by whitespaces; 
# ie, preserve what has done for segmenting by jiebaR

# tokenize:"tokens" from doc 1, split by whitespaces
tokens(lcorpus, what = "fasterword")[1]

## tokens from 1 document.
## text1 :
##  [1] "宮保"   "鈞座"   "新"     "正"     "趨"     "賀"     "兩次"  
##  [8] "聞"     "藎"     "躬"     "尚未"   "康復"   "不"     "可以"  
## [15] "風"     "未敢"   "驚擾"   "昨往"   "拜"     "朱邇"   "典"    
## [22] "伊面"   "稱"     "久違"   "侍鈞"   "顔"     "欲"     "於"    
## [29] "日"     "內"     "晉謁"   "兼"     "答拜"   "經方"   "頃接"  
## [36] "其"     "來函"   "准"     "於"     "明日"   "下午"   "三句"  
## [43] "鐘來署" "拜謁"   "可否"   "允其"   "接見"   "謹"     "請示"  
## [50] "復"     "以便"   "函告"   "該"     "使"     "肅此"   "上叩"  
## [57] "勛"     "祺"     "順頌"   "新禧"   "經方謹" "上"     "初九"  
## [64] "日"

# tokenize and create document-feature-matrix
ltokens <- tokens(v36$ltext.seg, what = "fasterword")
ldfm <- dfm(ltokens)

# a dfm with 245 documents and 8052 features
ldfm

## Document-feature matrix of: 245 documents, 8,052 features (98.9% sparse).

# list top 20 features
topfeatures(ldfm, 20)

##     上   宮保   鈞座 經方謹     之   肅頌     復     與     已   崇綏 
##    216    194    176    171    168    149    147    134    134    130 
##     呈     為     又     在     於     請     附     係     以     有 
##    122    109    107    105     97     87     86     86     85     85

# plot wordcloud
par(family='Kaiti TC') # set Chinese font on Mac; you may not need to set font on Windows
textplot_wordcloud(ldfm, min.freq=30, random.order=FALSE,
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

Combine multiple data files

In the above lines we show how to work with texts stored in one single file, however it is also fairly common that we have texts saved in multiple files. Here we demonstrate how we combine more than one text file in a more efficient way than do it one by one and also some more ways and options of segmenting, exploring and visualizing text data.
Let’s start fresh by removing what we have loaded and created in R.

# remove everything in R environment
rm(list=ls())

We first define a function named multcomb to do the following: 1) list the file names of all the data files that you would like to combine to one file - in this case, we have two csv files to combine, 2) read in the files one by one and rbind them to one data frame.
Save all the data files in one folder, then plug in the path of the folder in the multcomb function to combine all the data files - here we save the combined data frame as mydata.

# define the function of combining multiple files
multcomb <- function(mypath){
  # save all the file names (with path) in an object "filenames"
  filenames <- list.files(path=mypath, full.names=TRUE)
  # import all files and save them as "datalist"
  datalist <- lapply(filenames, function(x){
    read.csv(file=x, encoding='UTF-8', header=TRUE, row.names=1, stringsAsFactors=FALSE)})
  # combine the files (data frames in "datalist")
  Reduce(function(x,y) {rbind(x,y)}, datalist)}

# Use the function multcomb to combine the files in the folder:
# before excecute the function, save all the csv. files in one folder;
# note the folder should not contain other files
mydata <- multcomb('YOUR PATH OF THE FOLDER')

# view the first two rows of mydata
head(mydata, 2)

##                lid
## 1 36-001A-1—001A-2
## 2 36-001B-1—001B-2
##                                                                                                                                                                                                                                                                                                           ltext
## 1                                                              宮保鈞座：新正趨賀兩次，聞藎躬尚未康復，不可以風，未敢驚擾。昨往拜朱邇典。伊面稱久違侍鈞顔，欲於日內晉謁，兼答拜經方。頃接其來函，准於明日下午三句鐘來署拜謁。可否允其接見？謹請示復，以便函告該使。肅此。上叩勛祺，順頌新禧。經方謹上。初九日。
## 2 啟者：頃據民政部公啟稱：於本月十六日下午三點鐘，在部開第一次衛生會會議。請飭派出各員届時蒞會等因。査本部知醫者惟屈道永秋( 去年九月始行札調到部)，及承政廳行走學部主事謝天保兩員。屈道在本部月支薪水二百兩，謝主事則於滬甯鐵路掛名，月支洋百元。此兩員在本部無事，可否派其前往會議？請酌示。經方謹上。十二日。

Segmenting: stopwords and dictionary

Segment the words in the combined data file - this time we use stopwords and dictionary to modify the “worker” of segmenting.

# see the stopwords and dictionary
readLines('sheng_stop.txt', encoding = 'UTF-8')

##  [1] ""   "之" "與" "為" "也" "有" "在" "以" "於" "即" "係"

readLines('sheng_dic.txt', encoding = 'UTF-8')

## [1] ""     "經方" "謹上" "滬甯" "京奉" "匯豐" "匯理"

Here we include 10 words in our stopwords list - those we think can be safely filtered out, and we have 6 terms in our custom dictionary so that each of these terms can be segmented as is. It is recommended to use notepad++ to create your custom stopwords lists and dictionaries encoded in UTF-8. Note if you need to use Notepad of Windows to create these text files, it may be easier for R to work with these files if the first rows of each file are left blank - you can see the first elements in my two text files are empty.

# set up and apply the worker and function for segmenting
cutter <- worker(stop_word = 'sheng_stop.txt', user = 'sheng_dic.txt')
seg_x <- function(x) {str_c(cutter[x], collapse = " ")} 
mydata$ltext.seg <- sapply(mydata$ltext, seg_x, USE.NAMES = FALSE)

# view the first few rows
head(mydata, 2)

##                lid
## 1 36-001A-1—001A-2
## 2 36-001B-1—001B-2
##                                                                                                                                                                                                                                                                                                           ltext
## 1                                                              宮保鈞座：新正趨賀兩次，聞藎躬尚未康復，不可以風，未敢驚擾。昨往拜朱邇典。伊面稱久違侍鈞顔，欲於日內晉謁，兼答拜經方。頃接其來函，准於明日下午三句鐘來署拜謁。可否允其接見？謹請示復，以便函告該使。肅此。上叩勛祺，順頌新禧。經方謹上。初九日。
## 2 啟者：頃據民政部公啟稱：於本月十六日下午三點鐘，在部開第一次衛生會會議。請飭派出各員届時蒞會等因。査本部知醫者惟屈道永秋( 去年九月始行札調到部)，及承政廳行走學部主事謝天保兩員。屈道在本部月支薪水二百兩，謝主事則於滬甯鐵路掛名，月支洋百元。此兩員在本部無事，可否派其前往會議？請酌示。經方謹上。十二日。
##                                                                                                                                                                                                                                                                                                                                 ltext.seg
## 1                                                                     宮保 鈞座 新 正 趨 賀 兩次 聞 藎 躬 尚未 康復 不 可以 風 未敢 驚擾 昨往 拜 朱邇 典 伊面 稱 久違 侍鈞 顔 欲 日 內 晉謁 兼 答拜 經方 頃接 其 來函 准 明日 下午 三句 鐘來署 拜謁 可否 允其 接見 謹 請示 復 以便 函告 該 使 肅此 上叩 勛 祺 順頌 新禧 經方 謹上 初九 日
## 2 啟者 頃 據 民政部 公 啟稱 本月 十六日 下午 三點鐘 部開 第一次 衛生 會 會議 請 飭 派出 各員届 時 蒞會 等 因 査 本部 知 醫者 惟屈 道 永秋 去年 九月 始行 札 調到 部 及承政廳 行走 學部 主事謝 天保 兩員 屈道 本部 月 支 薪水 二百兩 謝主事則 滬甯 鐵路 掛名 月 支洋 百元 此 兩員 本部 無事 可否 派 其 前往 會議 請 酌 示 經方 謹上 十二日

Create corpus, DFM and feature frequency tab

Now we have segmented texts saved in the variable ltext.seg of mydata. We then use functions corpus() and dfm() to create corpus and DFM from ltext.seg and save them as mycorpus and mydfm.

# create and examine corpus 
mycorpus <- corpus(mydata$ltext.seg)
summary(mycorpus, showmeta = TRUE, 5)

## Corpus consisting of 335 documents, showing 5 documents:
## 
##   Text Types Tokens Sentences
##  text1    73     79         1
##  text2    79     95         1
##  text3   119    148         1
##  text4    62     68         1
##  text5   165    277         1
## 
## Source:  /Users/Guest/Desktop/sheng/* on x86_64 by Guest
## Created: Wed Dec 27 15:14:41 2017
## Notes:

# view texts in the first document of the corpus
texts(mycorpus)[1]

##                                                                                                                                                                                                                                                                 text1 
## "宮保 鈞座 新 正 趨 賀 兩次 聞 藎 躬 尚未 康復 不 可以 風 未敢 驚擾 昨往 拜 朱邇 典 伊面 稱 久違 侍鈞 顔 欲 日 內 晉謁 兼 答拜 經方 頃接 其 來函 准 明日 下午 三句 鐘來署 拜謁 可否 允其 接見 謹 請示 復 以便 函告 該 使 肅此 上叩 勛 祺 順頌 新禧 經方 謹上 初九 日"

# create and examine DFM
mydfm <- dfm(tokens(mydata$ltext.seg, what = "fasterword"))
mydfm

## Document-feature matrix of: 335 documents, 15,126 features (99.2% sparse).

# top 20 features
topfeatures(mydfm, 20)

##   亦 宮保   已 經方 鈞座 謹上   又   復   稟   不   者 肅頌   呈   均   其 
##  271  265  256  219  208  196  195  187  177  161  160  149  145  143  134 
##   請   再 崇綏   而   寳 
##  131  131  130  125  123

Note in this DFM, those terms included in our stopwords list are gone, and those in our dictionary are segmented as in the text file of the dictionary.
We can then generate a data frame indicating frequency of each features using textstat_frequency().

# tabulate feature frequency
dfmtab <- textstat_frequency(mydfm)
head(dfmtab)

##   feature frequency rank docfreq
## 1      亦       271    1     106
## 2    宮保       265    2     236
## 3      已       256    3     141
## 4    經方       219    4     211
## 5    鈞座       208    5     208
## 6    謹上       196    6     196

Sometimes you only care about, say, longer features/terms, use dfm_select() to choose those meet certain conditions, e.g., terms contain two or more words.

# select 2+ word features 
mydfm2 <- dfm_select(mydfm, min_nchar = 2) 
topfeatures(mydfm2, 20)

##   宮保   經方   鈞座   謹上   肅頌   崇綏   不知   合同   公司   不能 
##    265    219    208    196    149    130     97     90     85     80 
##   閣下   一切   謹悉 外務部   本部   尚未   本日   大人   如何   可以 
##     73     59     58     57     54     53     53     53     53     51

Plot a wordcloud graph from the DFM containing those 2-or-more-word terms. Here we select terms appearing 5 or more times to plot and set 200 as the maximum number of terms to be included.

# plot wordcloud
par(family='Kaiti TC')
textplot_wordcloud(mydfm2, min.freq=5, random.order=FALSE, max.words = 200,
                   rot.per = .25, scale = c(2.8, .5),
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

Explore more about the data

The function textstat_frequency() can tabulate all feature frequencies like we did above - we can also limit the frequencies to be tabulated and plot these selected features using ggplot() of the ggplot2 package.

# tabulate the top 10 features
textstat_frequency(mydfm2, n=10)

##    feature frequency rank docfreq
## 1     宮保       265    1     236
## 2     經方       219    2     211
## 3     鈞座       208    3     208
## 4     謹上       196    4     196
## 5     肅頌       149    5     149
## 6     崇綏       130    6     130
## 7     不知        97    7      49
## 8     合同        90    8      47
## 9     公司        85    9      45
## 10    不能        80   10      56

# plot freq. by rank of the most frequent 50 features
library(ggplot2)
theme_set(theme_minimal())
textstat_frequency(mydfm2, n = 50) %>% 
  ggplot(aes(x = rank, y = frequency)) +
  geom_point() +
  labs(x = "Frequency rank", y = "Term frequency")

We can also use dfm_weight to create a DFM representing weighted frequencies of the terms, for instance, a DFM with relative feature/term frequencies, i.e., the proportion of the feature counts of total feature counts.

# create dfm with relative term frequencies
dfmpct <- dfm_weight(mydfm2, type = "relfreq") 

# plot relative term frequencies
textstat_frequency(dfmpct, n = 10) %>% 
  ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
  geom_bar(stat = "identity") + coord_flip() + 
  labs(x = "", y = "Relative Term Frequency") +
  theme(text = element_text(family = 'STKaiti'))

In Sum...

In this blog post, we share our way of preparing Chinese texts for computational text analysis, mainly using two R packages - jiebaR and quanteda. We hope this will help our users either to use our Sheng Xuanhuai correspondences collection or to apply this way of processing Chinese texts on your text data. The two packages provide a lot more functions than what we can introduce in this single post, to learn more about the two text analysis packages, start from their documentations at https://qinwenfeng.com/jiebaR/. and http://docs.quanteda.io/index.html.