Monday, January 22, 2018

Conference: IP Protection and Enforcement: from Policy to Strategy to Practice

Faculty members and postgraduate students are increasingly aware of the copyright issues in teaching, learning, and research. Copyright is ... thumbnail 1 summary

Faculty members and postgraduate students are increasingly aware of the copyright issues in teaching, learning, and research. Copyright is only part of Intellectual Property that also covers other intangible property right such as trade marks, patents, designs, plant varieties and the layout design of integrated circuits. Intellectual property protection is important to Hong Kong being an international trading centre. If you want to know more about the IP policy and current practices, you are welcome to join the conference on IP Protection and Enforcement: from Policy to Strategy to Practice organized by Faculty of Law, CUHK

The conference has two major focuses. The first session focuses on the enforcement practices and IP initiatives by the Government of the Hong Kong SAR. The speakers are top officials from the Intellectual Property Department, Customs and Excise Department, and Hong Kong Trade Development Council. The second session focuses on the best enforcement practices in the business sector. The speakers are industry leaders and top IP lawyers. 



Date: 26 Jan 2018 (Fri)
Time: 9:00 a.m. – 1:40 p.m.
Venue: The CUHK Graduate Law Centre, 2/F Bank of America Tower, Central
Conference programme: http://www.law.cuhk.edu.hk/en/event-page/file/20180126_Event_agenda.pdf

Digital Scholarship Consultation Hours in University Library Continues From 16 January, 2018

Digital Scholarship Consultation Hours service has been commenced since  October 2017 to provide instant discussion and advice to any walk... thumbnail 1 summary
Digital Scholarship Consultation Hours service has been commenced since October 2017 to provide instant discussion and advice to any walk-in users with questions on:

  • Using the facilities of the Digital Scholarship Lab
  • Software and tools in the Lab such as ArcGIS, Gephi, Omeka and Voyant Tools
  • Locating data sources
  • Applying digital scholarship research methodologies and visualising data
  • Copyright issues on the use of research data and materials.
On every alternate Tuesday, the Digital Scholarship team does have interesting in-depth discussion and demonstration with academics and postgraduates that come to the Lab. The service will continue in this second semester starting from 16 January 2018 (Tue) on every alternate Tuesday during term time:

Date: 16 & 30 January
        13 & 27 February
        13 & 27 March
        10 & 24 April

Time: 3:00 – 5:00 p.m.

Venue: Digital Scholarship Lab, G/F, University Library

CUHK colleagues and students are welcome to visit us for an instant discussion of your research, or email to dslab@lib.cuhk.edu.hk your question for help!



Thursday, December 28, 2017

Text Data Preparation: a Practice in R using the Sheng Xuanhuai Collection

In this post, we share a way of preparing Chinese text data for computational analysis; we do so in R using sample texts from a historical... thumbnail 1 summary
In this post, we share a way of preparing Chinese text data for computational analysis; we do so in R using sample texts from a historical collection that is currently being digitized by our library - the Sheng Xuanhuai Collection.

The Sheng Xuanhuai collection contains over 70 volumes of correspondences between the entrepreneur Sheng Xuanhuai and other individuals. The texts included in the collection are digitized and are preserved in formats of images and text files. The texts are also coded with labels/variables such as title, sender name, receiver name, date, key words and locations mentioned in the texts. The digitization and transcription of the correspondences that transform these texts into machine readable formats allows researchers to conduct studies using computational text analysis and other relevant methods.

In the following sections, we demonstrate our way of importing text data to R, preparing texts for analysis, as well as exploring and visualizing texts. Basic knowledge of R will be helpful if you are to try this practice or even apply this on your own data.

Import Text

First we need to read our data (the texts - csv files are used in this demo) to R - to do so, we use setwd() function to set up the working directory, i.e., let R know the path of where we store the data on the computer, then we use read.csv() to load the data file named v36.csv to R.
# set your working directory
setwd('YOUR WORKING DIRECTORY')
# load the data in a spreadsheet to R
v36 <- read.csv('v36.csv', encoding = 'UTF-8', header=TRUE, row.names=1, stringsAsFactors=FALSE)

# view the first two rows of the data
head(v36, 2)
##                lid
## 1 36-001A-1—001A-2
## 2 36-001B-1—001B-2
##                                                                                                                                                                                                                                                                                                           ltext
## 1                                                              ๅฎฎไฟ้ˆžๅบง:ๆ–ฐๆญฃ่ถจ่ณ€ๅ…ฉๆฌก,่ž่—Ž่บฌๅฐšๆœชๅบทๅพฉ,ไธๅฏไปฅ้ขจ,ๆœชๆ•ข้ฉšๆ“พ。ๆ˜จๅพ€ๆ‹œๆœฑ้‚‡ๅ…ธ。ไผŠ้ข็จฑไน…้•ไพ้ˆž้ก”,ๆฌฒๆ–ผๆ—ฅๅ…งๆ™‰่ฌ,ๅ…ผ็ญ”ๆ‹œ็ถ“ๆ–น。้ ƒๆŽฅๅ…ถไพ†ๅ‡ฝ,ๅ‡†ๆ–ผๆ˜Žๆ—ฅไธ‹ๅˆไธ‰ๅฅ้˜ไพ†็ฝฒๆ‹œ่ฌ。ๅฏๅฆๅ…ๅ…ถๆŽฅ่ฆ‹?่ฌน่ซ‹็คบๅพฉ,ไปฅไพฟๅ‡ฝๅ‘Š่ฉฒไฝฟ。่‚…ๆญค。ไธŠๅฉๅ‹›็ฅบ,้ †้ Œๆ–ฐ็ฆง。็ถ“ๆ–น่ฌนไธŠ。ๅˆไนๆ—ฅ。
## 2 ๅ•Ÿ่€…:้ ƒๆ“šๆฐ‘ๆ”ฟ้ƒจๅ…ฌๅ•Ÿ็จฑ:ๆ–ผๆœฌๆœˆๅๅ…ญๆ—ฅไธ‹ๅˆไธ‰้ปž้˜,ๅœจ้ƒจ้–‹็ฌฌไธ€ๆฌก่ก›็”Ÿๆœƒๆœƒ่ญฐ。่ซ‹้ฃญๆดพๅ‡บๅ„ๅ“กๅฑŠๆ™‚่’žๆœƒ็ญ‰ๅ› 。ๆŸปๆœฌ้ƒจ็Ÿฅ้†ซ่€…ๆƒŸๅฑˆ้“ๆฐธ็ง‹( ๅŽปๅนดไนๆœˆๅง‹่กŒๆœญ่ชฟๅˆฐ้ƒจ),ๅŠๆ‰ฟๆ”ฟๅปณ่กŒ่ตฐๅญธ้ƒจไธปไบ‹่ฌๅคฉไฟๅ…ฉๅ“ก。ๅฑˆ้“ๅœจๆœฌ้ƒจๆœˆๆ”ฏ่–ชๆฐดไบŒ็™พๅ…ฉ,่ฌไธปไบ‹ๅ‰‡ๆ–ผๆปฌ็”ฏ้ต่ทฏๆŽ›ๅ,ๆœˆๆ”ฏๆด‹็™พๅ…ƒ。ๆญคๅ…ฉๅ“กๅœจๆœฌ้ƒจ็„กไบ‹,ๅฏๅฆๆดพๅ…ถๅ‰ๅพ€ๆœƒ่ญฐ?่ซ‹้…Œ็คบ。็ถ“ๆ–น่ฌนไธŠ。ๅไบŒๆ—ฅ。
There are two variables in the data: lid and ltext - they are correspondence letter ID and letter text in volume 36 of the collection. We have 245 rows in this dataset, i.e., 245 letters in this volume.

Segmenting Chinese Text

Words and terms are the basic units of many computational text analysis methods, however Chinese characters are not “naturally” divided by whitespaces like some other languages such as English. A number of methods are developed to segment Chinese characters - here we try the widely used “jieba” segmenter on our sample texts. To use the R version of jieba, install the package by running this command install.packages('jiebaR') in you R Console. Note you also need to run install.packages() for the other packages we use here in the following sections if you haven’t had them installed on your computer.
# load the "jiebaR" and "stringr" packages/libraries
library(jiebaR)
library(stringr)
Initialize an engine for word segmentation, use all the default settings, and try it on a simple sentence.
# initialize jiebaR worker
cutter <- worker()

# test the worker
cutter["ไปŠๅคฉ็š„ๅคฉๆฐฃ็œŸๅฅฝ"]
## [1] "ไปŠๅคฉ" "็š„"   "ๅคฉๆฐฃ" "็œŸๅฅฝ"
We then define a function called seg_x by which we segment the texts stored in the ltext variable of the data v36 and save them as a new variable of v36 called ltext.seg.
# define the function of segmenting
seg_x <- function(x) {str_c(cutter[x], collapse = " ")} 

# apply the function to each document (row of ltext)
x.out <- sapply(v36$ltext, seg_x, USE.NAMES = FALSE)

# attach the segmented text back to the data frame
v36$ltext.seg <- x.out 

# view the first two rows of the data frame
head(v36, 2)
##                lid
## 1 36-001A-1—001A-2
## 2 36-001B-1—001B-2
##                                                                                                                                                                                                                                                                                                           ltext
## 1                                                              ๅฎฎไฟ้ˆžๅบง:ๆ–ฐๆญฃ่ถจ่ณ€ๅ…ฉๆฌก,่ž่—Ž่บฌๅฐšๆœชๅบทๅพฉ,ไธๅฏไปฅ้ขจ,ๆœชๆ•ข้ฉšๆ“พ。ๆ˜จๅพ€ๆ‹œๆœฑ้‚‡ๅ…ธ。ไผŠ้ข็จฑไน…้•ไพ้ˆž้ก”,ๆฌฒๆ–ผๆ—ฅๅ…งๆ™‰่ฌ,ๅ…ผ็ญ”ๆ‹œ็ถ“ๆ–น。้ ƒๆŽฅๅ…ถไพ†ๅ‡ฝ,ๅ‡†ๆ–ผๆ˜Žๆ—ฅไธ‹ๅˆไธ‰ๅฅ้˜ไพ†็ฝฒๆ‹œ่ฌ。ๅฏๅฆๅ…ๅ…ถๆŽฅ่ฆ‹?่ฌน่ซ‹็คบๅพฉ,ไปฅไพฟๅ‡ฝๅ‘Š่ฉฒไฝฟ。่‚…ๆญค。ไธŠๅฉๅ‹›็ฅบ,้ †้ Œๆ–ฐ็ฆง。็ถ“ๆ–น่ฌนไธŠ。ๅˆไนๆ—ฅ。
## 2 ๅ•Ÿ่€…:้ ƒๆ“šๆฐ‘ๆ”ฟ้ƒจๅ…ฌๅ•Ÿ็จฑ:ๆ–ผๆœฌๆœˆๅๅ…ญๆ—ฅไธ‹ๅˆไธ‰้ปž้˜,ๅœจ้ƒจ้–‹็ฌฌไธ€ๆฌก่ก›็”Ÿๆœƒๆœƒ่ญฐ。่ซ‹้ฃญๆดพๅ‡บๅ„ๅ“กๅฑŠๆ™‚่’žๆœƒ็ญ‰ๅ› 。ๆŸปๆœฌ้ƒจ็Ÿฅ้†ซ่€…ๆƒŸๅฑˆ้“ๆฐธ็ง‹( ๅŽปๅนดไนๆœˆๅง‹่กŒๆœญ่ชฟๅˆฐ้ƒจ),ๅŠๆ‰ฟๆ”ฟๅปณ่กŒ่ตฐๅญธ้ƒจไธปไบ‹่ฌๅคฉไฟๅ…ฉๅ“ก。ๅฑˆ้“ๅœจๆœฌ้ƒจๆœˆๆ”ฏ่–ชๆฐดไบŒ็™พๅ…ฉ,่ฌไธปไบ‹ๅ‰‡ๆ–ผๆปฌ็”ฏ้ต่ทฏๆŽ›ๅ,ๆœˆๆ”ฏๆด‹็™พๅ…ƒ。ๆญคๅ…ฉๅ“กๅœจๆœฌ้ƒจ็„กไบ‹,ๅฏๅฆๆดพๅ…ถๅ‰ๅพ€ๆœƒ่ญฐ?่ซ‹้…Œ็คบ。็ถ“ๆ–น่ฌนไธŠ。ๅไบŒๆ—ฅ。
##                                                                                                                                                                                                                                                                                                                                                 ltext.seg
## 1                                                                               ๅฎฎไฟ ้ˆžๅบง ๆ–ฐ ๆญฃ ่ถจ ่ณ€ ๅ…ฉๆฌก ่ž ่—Ž ่บฌ ๅฐšๆœช ๅบทๅพฉ ไธ ๅฏไปฅ ้ขจ ๆœชๆ•ข ้ฉšๆ“พ ๆ˜จๅพ€ ๆ‹œ ๆœฑ้‚‡ ๅ…ธ ไผŠ้ข ็จฑ ไน…้• ไพ้ˆž ้ก” ๆฌฒ ๆ–ผ ๆ—ฅ ๅ…ง ๆ™‰่ฌ ๅ…ผ ็ญ”ๆ‹œ ็ถ“ๆ–น ้ ƒๆŽฅ ๅ…ถ ไพ†ๅ‡ฝ ๅ‡† ๆ–ผ ๆ˜Žๆ—ฅ ไธ‹ๅˆ ไธ‰ๅฅ ้˜ไพ†็ฝฒ ๆ‹œ่ฌ ๅฏๅฆ ๅ…ๅ…ถ ๆŽฅ่ฆ‹ ่ฌน ่ซ‹็คบ ๅพฉ ไปฅไพฟ ๅ‡ฝๅ‘Š ่ฉฒ ไฝฟ ่‚…ๆญค ไธŠๅฉ ๅ‹› ็ฅบ ้ †้ Œ ๆ–ฐ็ฆง ็ถ“ๆ–น่ฌน ไธŠ ๅˆไน ๆ—ฅ
## 2 ๅ•Ÿ่€… ้ ƒ ๆ“š ๆฐ‘ๆ”ฟ้ƒจ ๅ…ฌ ๅ•Ÿ็จฑ ๆ–ผ ๆœฌๆœˆ ๅๅ…ญๆ—ฅ ไธ‹ๅˆ ไธ‰้ปž้˜ ๅœจ ้ƒจ้–‹ ็ฌฌไธ€ๆฌก ่ก›็”Ÿ ๆœƒ ๆœƒ่ญฐ ่ซ‹ ้ฃญ ๆดพๅ‡บ ๅ„ๅ“กๅฑŠ ๆ™‚ ่’žๆœƒ ็ญ‰ ๅ›  ๆŸป ๆœฌ้ƒจ ็Ÿฅ ้†ซ่€… ๆƒŸๅฑˆ ้“ ๆฐธ็ง‹ ๅŽปๅนด ไนๆœˆ ๅง‹่กŒ ๆœญ ่ชฟๅˆฐ ้ƒจ ๅŠๆ‰ฟๆ”ฟๅปณ ่กŒ่ตฐ ๅญธ้ƒจ ไธปไบ‹่ฌ ๅคฉไฟ ๅ…ฉๅ“ก ๅฑˆ้“ ๅœจ ๆœฌ้ƒจ ๆœˆ ๆ”ฏ ่–ชๆฐด ไบŒ็™พๅ…ฉ ่ฌไธปไบ‹ๅ‰‡ ๆ–ผ ๆปฌ ็”ฏ ้ต่ทฏ ๆŽ›ๅ ๆœˆ ๆ”ฏๆด‹ ็™พๅ…ƒ ๆญค ๅ…ฉๅ“ก ๅœจ ๆœฌ้ƒจ ็„กไบ‹ ๅฏๅฆ ๆดพ ๅ…ถ ๅ‰ๅพ€ ๆœƒ่ญฐ ่ซ‹ ้…Œ ็คบ ็ถ“ๆ–น่ฌน ไธŠ ๅไบŒๆ—ฅ

Create corpus and document-term/feature-matrix

With the texts segmented by whitespaces, we can move on to create corpus and document-term/feature-matrix (DTM/DFM) that are often used for further text analysis. Here we use functions of the quanteda package to create corpus and DFMs, so does to explore and visualize the texts. quanteda is an R package for managing and analyzing text data; it provides tools for corpus management, natural language processing, document-feature-matrix analysis and more.
# load the library
library(quanteda)
We create a corpus from the texts stored in the ltext.seg variable using the corpus() function. We also tokenize the texts using tokens() and construct a document-feature-matrix using dfm(). Note “fasterword” is specified so that the texts are tokenized by whitespaces. We can then view the most frequent terms/features in this set of texts using topfeatures(). The quanteda package also offers a function textplot_wordcloud() by which you can easily plot a wordcloud from DFMs.
# create corpus
lcorpus <- corpus(v36$ltext.seg)
# summarize the lcorpus object
summary(lcorpus, showmeta = TRUE, 5)
## Corpus consisting of 245 documents, showing 5 documents:
## 
##   Text Types Tokens Sentences
##  text1    73     82         1
##  text2    82    101         1
##  text3   121    153         1
##  text4    64     70         1
##  text5   171    298         1
## 
## Source:  /Users/Guest/Desktop/sheng/* on x86_64 by Guest
## Created: Wed Dec 27 15:14:39 2017
## Notes:
# see the text in the 1st document of lcorpus
texts(lcorpus)[1]
##                                                                                                                                                                                                                                                                       text1 
## "ๅฎฎไฟ ้ˆžๅบง ๆ–ฐ ๆญฃ ่ถจ ่ณ€ ๅ…ฉๆฌก ่ž ่—Ž ่บฌ ๅฐšๆœช ๅบทๅพฉ ไธ ๅฏไปฅ ้ขจ ๆœชๆ•ข ้ฉšๆ“พ ๆ˜จๅพ€ ๆ‹œ ๆœฑ้‚‡ ๅ…ธ ไผŠ้ข ็จฑ ไน…้• ไพ้ˆž ้ก” ๆฌฒ ๆ–ผ ๆ—ฅ ๅ…ง ๆ™‰่ฌ ๅ…ผ ็ญ”ๆ‹œ ็ถ“ๆ–น ้ ƒๆŽฅ ๅ…ถ ไพ†ๅ‡ฝ ๅ‡† ๆ–ผ ๆ˜Žๆ—ฅ ไธ‹ๅˆ ไธ‰ๅฅ ้˜ไพ†็ฝฒ ๆ‹œ่ฌ ๅฏๅฆ ๅ…ๅ…ถ ๆŽฅ่ฆ‹ ่ฌน ่ซ‹็คบ ๅพฉ ไปฅไพฟ ๅ‡ฝๅ‘Š ่ฉฒ ไฝฟ ่‚…ๆญค ไธŠๅฉ ๅ‹› ็ฅบ ้ †้ Œ ๆ–ฐ็ฆง ็ถ“ๆ–น่ฌน ไธŠ ๅˆไน ๆ—ฅ"
# create dfm with "terms/features" spliting by whitespaces; 
# ie, preserve what has done for segmenting by jiebaR

# tokenize:"tokens" from doc 1, split by whitespaces
tokens(lcorpus, what = "fasterword")[1]
## tokens from 1 document.
## text1 :
##  [1] "ๅฎฎไฟ"   "้ˆžๅบง"   "ๆ–ฐ"     "ๆญฃ"     "่ถจ"     "่ณ€"     "ๅ…ฉๆฌก"  
##  [8] "่ž"     "่—Ž"     "่บฌ"     "ๅฐšๆœช"   "ๅบทๅพฉ"   "ไธ"     "ๅฏไปฅ"  
## [15] "้ขจ"     "ๆœชๆ•ข"   "้ฉšๆ“พ"   "ๆ˜จๅพ€"   "ๆ‹œ"     "ๆœฑ้‚‡"   "ๅ…ธ"    
## [22] "ไผŠ้ข"   "็จฑ"     "ไน…้•"   "ไพ้ˆž"   "้ก”"     "ๆฌฒ"     "ๆ–ผ"    
## [29] "ๆ—ฅ"     "ๅ…ง"     "ๆ™‰่ฌ"   "ๅ…ผ"     "็ญ”ๆ‹œ"   "็ถ“ๆ–น"   "้ ƒๆŽฅ"  
## [36] "ๅ…ถ"     "ไพ†ๅ‡ฝ"   "ๅ‡†"     "ๆ–ผ"     "ๆ˜Žๆ—ฅ"   "ไธ‹ๅˆ"   "ไธ‰ๅฅ"  
## [43] "้˜ไพ†็ฝฒ" "ๆ‹œ่ฌ"   "ๅฏๅฆ"   "ๅ…ๅ…ถ"   "ๆŽฅ่ฆ‹"   "่ฌน"     "่ซ‹็คบ"  
## [50] "ๅพฉ"     "ไปฅไพฟ"   "ๅ‡ฝๅ‘Š"   "่ฉฒ"     "ไฝฟ"     "่‚…ๆญค"   "ไธŠๅฉ"  
## [57] "ๅ‹›"     "็ฅบ"     "้ †้ Œ"   "ๆ–ฐ็ฆง"   "็ถ“ๆ–น่ฌน" "ไธŠ"     "ๅˆไน"  
## [64] "ๆ—ฅ"
# tokenize and create document-feature-matrix
ltokens <- tokens(v36$ltext.seg, what = "fasterword")
ldfm <- dfm(ltokens)

# a dfm with 245 documents and 8052 features
ldfm 
## Document-feature matrix of: 245 documents, 8,052 features (98.9% sparse).
# list top 20 features
topfeatures(ldfm, 20)
##     ไธŠ   ๅฎฎไฟ   ้ˆžๅบง ็ถ“ๆ–น่ฌน     ไน‹   ่‚…้ Œ     ๅพฉ     ่ˆ‡     ๅทฒ   ๅด‡็ถ 
##    216    194    176    171    168    149    147    134    134    130 
##     ๅ‘ˆ     ็‚บ     ๅˆ     ๅœจ     ๆ–ผ     ่ซ‹     ้™„     ไฟ‚     ไปฅ     ๆœ‰ 
##    122    109    107    105     97     87     86     86     85     85
# plot wordcloud
par(family='Kaiti TC') # set Chinese font on Mac; you may not need to set font on Windows
textplot_wordcloud(ldfm, min.freq=30, random.order=FALSE,
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))



Combine multiple data files

In the above lines we show how to work with texts stored in one single file, however it is also fairly common that we have texts saved in multiple files. Here we demonstrate how we combine more than one text file in a more efficient way than do it one by one and also some more ways and options of segmenting, exploring and visualizing text data.
Let’s start fresh by removing what we have loaded and created in R.
# remove everything in R environment
rm(list=ls())

We first define a function named multcomb to do the following: 1) list the file names of all the data files that you would like to combine to one file - in this case, we have two csv files to combine, 2) read in the files one by one and rbind them to one data frame.
Save all the data files in one folder, then plug in the path of the folder in the multcomb function to combine all the data files - here we save the combined data frame as mydata.
# define the function of combining multiple files
multcomb <- function(mypath){
  # save all the file names (with path) in an object "filenames"
  filenames <- list.files(path=mypath, full.names=TRUE)
  # import all files and save them as "datalist"
  datalist <- lapply(filenames, function(x){
    read.csv(file=x, encoding='UTF-8', header=TRUE, row.names=1, stringsAsFactors=FALSE)})
  # combine the files (data frames in "datalist")
  Reduce(function(x,y) {rbind(x,y)}, datalist)}
# Use the function multcomb to combine the files in the folder:
# before excecute the function, save all the csv. files in one folder;
# note the folder should not contain other files
mydata <- multcomb('YOUR PATH OF THE FOLDER')
# view the first two rows of mydata
head(mydata, 2)
##                lid
## 1 36-001A-1—001A-2
## 2 36-001B-1—001B-2
##                                                                                                                                                                                                                                                                                                           ltext
## 1                                                              ๅฎฎไฟ้ˆžๅบง:ๆ–ฐๆญฃ่ถจ่ณ€ๅ…ฉๆฌก,่ž่—Ž่บฌๅฐšๆœชๅบทๅพฉ,ไธๅฏไปฅ้ขจ,ๆœชๆ•ข้ฉšๆ“พ。ๆ˜จๅพ€ๆ‹œๆœฑ้‚‡ๅ…ธ。ไผŠ้ข็จฑไน…้•ไพ้ˆž้ก”,ๆฌฒๆ–ผๆ—ฅๅ…งๆ™‰่ฌ,ๅ…ผ็ญ”ๆ‹œ็ถ“ๆ–น。้ ƒๆŽฅๅ…ถไพ†ๅ‡ฝ,ๅ‡†ๆ–ผๆ˜Žๆ—ฅไธ‹ๅˆไธ‰ๅฅ้˜ไพ†็ฝฒๆ‹œ่ฌ。ๅฏๅฆๅ…ๅ…ถๆŽฅ่ฆ‹?่ฌน่ซ‹็คบๅพฉ,ไปฅไพฟๅ‡ฝๅ‘Š่ฉฒไฝฟ。่‚…ๆญค。ไธŠๅฉๅ‹›็ฅบ,้ †้ Œๆ–ฐ็ฆง。็ถ“ๆ–น่ฌนไธŠ。ๅˆไนๆ—ฅ。
## 2 ๅ•Ÿ่€…:้ ƒๆ“šๆฐ‘ๆ”ฟ้ƒจๅ…ฌๅ•Ÿ็จฑ:ๆ–ผๆœฌๆœˆๅๅ…ญๆ—ฅไธ‹ๅˆไธ‰้ปž้˜,ๅœจ้ƒจ้–‹็ฌฌไธ€ๆฌก่ก›็”Ÿๆœƒๆœƒ่ญฐ。่ซ‹้ฃญๆดพๅ‡บๅ„ๅ“กๅฑŠๆ™‚่’žๆœƒ็ญ‰ๅ› 。ๆŸปๆœฌ้ƒจ็Ÿฅ้†ซ่€…ๆƒŸๅฑˆ้“ๆฐธ็ง‹( ๅŽปๅนดไนๆœˆๅง‹่กŒๆœญ่ชฟๅˆฐ้ƒจ),ๅŠๆ‰ฟๆ”ฟๅปณ่กŒ่ตฐๅญธ้ƒจไธปไบ‹่ฌๅคฉไฟๅ…ฉๅ“ก。ๅฑˆ้“ๅœจๆœฌ้ƒจๆœˆๆ”ฏ่–ชๆฐดไบŒ็™พๅ…ฉ,่ฌไธปไบ‹ๅ‰‡ๆ–ผๆปฌ็”ฏ้ต่ทฏๆŽ›ๅ,ๆœˆๆ”ฏๆด‹็™พๅ…ƒ。ๆญคๅ…ฉๅ“กๅœจๆœฌ้ƒจ็„กไบ‹,ๅฏๅฆๆดพๅ…ถๅ‰ๅพ€ๆœƒ่ญฐ?่ซ‹้…Œ็คบ。็ถ“ๆ–น่ฌนไธŠ。ๅไบŒๆ—ฅ。

Segmenting: stopwords and dictionary

Segment the words in the combined data file - this time we use stopwords and dictionary to modify the “worker” of segmenting.
# see the stopwords and dictionary
readLines('sheng_stop.txt', encoding = 'UTF-8')
##  [1] ""   "ไน‹" "่ˆ‡" "็‚บ" "ไนŸ" "ๆœ‰" "ๅœจ" "ไปฅ" "ๆ–ผ" "ๅณ" "ไฟ‚"
readLines('sheng_dic.txt', encoding = 'UTF-8')
## [1] ""     "็ถ“ๆ–น" "่ฌนไธŠ" "ๆปฌ็”ฏ" "ไบฌๅฅ‰" "ๅŒฏ่ฑ" "ๅŒฏ็†"
Here we include 10 words in our stopwords list - those we think can be safely filtered out, and we have 6 terms in our custom dictionary so that each of these terms can be segmented as is. It is recommended to use notepad++ to create your custom stopwords lists and dictionaries encoded in UTF-8. Note if you need to use Notepad of Windows to create these text files, it may be easier for R to work with these files if the first rows of each file are left blank - you can see the first elements in my two text files are empty.
# set up and apply the worker and function for segmenting
cutter <- worker(stop_word = 'sheng_stop.txt', user = 'sheng_dic.txt')
seg_x <- function(x) {str_c(cutter[x], collapse = " ")} 
mydata$ltext.seg <- sapply(mydata$ltext, seg_x, USE.NAMES = FALSE)

# view the first few rows
head(mydata, 2)
##                lid
## 1 36-001A-1—001A-2
## 2 36-001B-1—001B-2
##                                                                                                                                                                                                                                                                                                           ltext
## 1                                                              ๅฎฎไฟ้ˆžๅบง:ๆ–ฐๆญฃ่ถจ่ณ€ๅ…ฉๆฌก,่ž่—Ž่บฌๅฐšๆœชๅบทๅพฉ,ไธๅฏไปฅ้ขจ,ๆœชๆ•ข้ฉšๆ“พ。ๆ˜จๅพ€ๆ‹œๆœฑ้‚‡ๅ…ธ。ไผŠ้ข็จฑไน…้•ไพ้ˆž้ก”,ๆฌฒๆ–ผๆ—ฅๅ…งๆ™‰่ฌ,ๅ…ผ็ญ”ๆ‹œ็ถ“ๆ–น。้ ƒๆŽฅๅ…ถไพ†ๅ‡ฝ,ๅ‡†ๆ–ผๆ˜Žๆ—ฅไธ‹ๅˆไธ‰ๅฅ้˜ไพ†็ฝฒๆ‹œ่ฌ。ๅฏๅฆๅ…ๅ…ถๆŽฅ่ฆ‹?่ฌน่ซ‹็คบๅพฉ,ไปฅไพฟๅ‡ฝๅ‘Š่ฉฒไฝฟ。่‚…ๆญค。ไธŠๅฉๅ‹›็ฅบ,้ †้ Œๆ–ฐ็ฆง。็ถ“ๆ–น่ฌนไธŠ。ๅˆไนๆ—ฅ。
## 2 ๅ•Ÿ่€…:้ ƒๆ“šๆฐ‘ๆ”ฟ้ƒจๅ…ฌๅ•Ÿ็จฑ:ๆ–ผๆœฌๆœˆๅๅ…ญๆ—ฅไธ‹ๅˆไธ‰้ปž้˜,ๅœจ้ƒจ้–‹็ฌฌไธ€ๆฌก่ก›็”Ÿๆœƒๆœƒ่ญฐ。่ซ‹้ฃญๆดพๅ‡บๅ„ๅ“กๅฑŠๆ™‚่’žๆœƒ็ญ‰ๅ› 。ๆŸปๆœฌ้ƒจ็Ÿฅ้†ซ่€…ๆƒŸๅฑˆ้“ๆฐธ็ง‹( ๅŽปๅนดไนๆœˆๅง‹่กŒๆœญ่ชฟๅˆฐ้ƒจ),ๅŠๆ‰ฟๆ”ฟๅปณ่กŒ่ตฐๅญธ้ƒจไธปไบ‹่ฌๅคฉไฟๅ…ฉๅ“ก。ๅฑˆ้“ๅœจๆœฌ้ƒจๆœˆๆ”ฏ่–ชๆฐดไบŒ็™พๅ…ฉ,่ฌไธปไบ‹ๅ‰‡ๆ–ผๆปฌ็”ฏ้ต่ทฏๆŽ›ๅ,ๆœˆๆ”ฏๆด‹็™พๅ…ƒ。ๆญคๅ…ฉๅ“กๅœจๆœฌ้ƒจ็„กไบ‹,ๅฏๅฆๆดพๅ…ถๅ‰ๅพ€ๆœƒ่ญฐ?่ซ‹้…Œ็คบ。็ถ“ๆ–น่ฌนไธŠ。ๅไบŒๆ—ฅ。
##                                                                                                                                                                                                                                                                                                                                 ltext.seg
## 1                                                                     ๅฎฎไฟ ้ˆžๅบง ๆ–ฐ ๆญฃ ่ถจ ่ณ€ ๅ…ฉๆฌก ่ž ่—Ž ่บฌ ๅฐšๆœช ๅบทๅพฉ ไธ ๅฏไปฅ ้ขจ ๆœชๆ•ข ้ฉšๆ“พ ๆ˜จๅพ€ ๆ‹œ ๆœฑ้‚‡ ๅ…ธ ไผŠ้ข ็จฑ ไน…้• ไพ้ˆž ้ก” ๆฌฒ ๆ—ฅ ๅ…ง ๆ™‰่ฌ ๅ…ผ ็ญ”ๆ‹œ ็ถ“ๆ–น ้ ƒๆŽฅ ๅ…ถ ไพ†ๅ‡ฝ ๅ‡† ๆ˜Žๆ—ฅ ไธ‹ๅˆ ไธ‰ๅฅ ้˜ไพ†็ฝฒ ๆ‹œ่ฌ ๅฏๅฆ ๅ…ๅ…ถ ๆŽฅ่ฆ‹ ่ฌน ่ซ‹็คบ ๅพฉ ไปฅไพฟ ๅ‡ฝๅ‘Š ่ฉฒ ไฝฟ ่‚…ๆญค ไธŠๅฉ ๅ‹› ็ฅบ ้ †้ Œ ๆ–ฐ็ฆง ็ถ“ๆ–น ่ฌนไธŠ ๅˆไน ๆ—ฅ
## 2 ๅ•Ÿ่€… ้ ƒ ๆ“š ๆฐ‘ๆ”ฟ้ƒจ ๅ…ฌ ๅ•Ÿ็จฑ ๆœฌๆœˆ ๅๅ…ญๆ—ฅ ไธ‹ๅˆ ไธ‰้ปž้˜ ้ƒจ้–‹ ็ฌฌไธ€ๆฌก ่ก›็”Ÿ ๆœƒ ๆœƒ่ญฐ ่ซ‹ ้ฃญ ๆดพๅ‡บ ๅ„ๅ“กๅฑŠ ๆ™‚ ่’žๆœƒ ็ญ‰ ๅ›  ๆŸป ๆœฌ้ƒจ ็Ÿฅ ้†ซ่€… ๆƒŸๅฑˆ ้“ ๆฐธ็ง‹ ๅŽปๅนด ไนๆœˆ ๅง‹่กŒ ๆœญ ่ชฟๅˆฐ ้ƒจ ๅŠๆ‰ฟๆ”ฟๅปณ ่กŒ่ตฐ ๅญธ้ƒจ ไธปไบ‹่ฌ ๅคฉไฟ ๅ…ฉๅ“ก ๅฑˆ้“ ๆœฌ้ƒจ ๆœˆ ๆ”ฏ ่–ชๆฐด ไบŒ็™พๅ…ฉ ่ฌไธปไบ‹ๅ‰‡ ๆปฌ็”ฏ ้ต่ทฏ ๆŽ›ๅ ๆœˆ ๆ”ฏๆด‹ ็™พๅ…ƒ ๆญค ๅ…ฉๅ“ก ๆœฌ้ƒจ ็„กไบ‹ ๅฏๅฆ ๆดพ ๅ…ถ ๅ‰ๅพ€ ๆœƒ่ญฐ ่ซ‹ ้…Œ ็คบ ็ถ“ๆ–น ่ฌนไธŠ ๅไบŒๆ—ฅ

Create corpus, DFM and feature frequency tab

Now we have segmented texts saved in the variable ltext.seg of mydata. We then use functions corpus() and dfm() to create corpus and DFM from ltext.seg and save them as mycorpus and mydfm.
# create and examine corpus 
mycorpus <- corpus(mydata$ltext.seg)
summary(mycorpus, showmeta = TRUE, 5)
## Corpus consisting of 335 documents, showing 5 documents:
## 
##   Text Types Tokens Sentences
##  text1    73     79         1
##  text2    79     95         1
##  text3   119    148         1
##  text4    62     68         1
##  text5   165    277         1
## 
## Source:  /Users/Guest/Desktop/sheng/* on x86_64 by Guest
## Created: Wed Dec 27 15:14:41 2017
## Notes:
# view texts in the first document of the corpus
texts(mycorpus)[1]
##                                                                                                                                                                                                                                                                 text1 
## "ๅฎฎไฟ ้ˆžๅบง ๆ–ฐ ๆญฃ ่ถจ ่ณ€ ๅ…ฉๆฌก ่ž ่—Ž ่บฌ ๅฐšๆœช ๅบทๅพฉ ไธ ๅฏไปฅ ้ขจ ๆœชๆ•ข ้ฉšๆ“พ ๆ˜จๅพ€ ๆ‹œ ๆœฑ้‚‡ ๅ…ธ ไผŠ้ข ็จฑ ไน…้• ไพ้ˆž ้ก” ๆฌฒ ๆ—ฅ ๅ…ง ๆ™‰่ฌ ๅ…ผ ็ญ”ๆ‹œ ็ถ“ๆ–น ้ ƒๆŽฅ ๅ…ถ ไพ†ๅ‡ฝ ๅ‡† ๆ˜Žๆ—ฅ ไธ‹ๅˆ ไธ‰ๅฅ ้˜ไพ†็ฝฒ ๆ‹œ่ฌ ๅฏๅฆ ๅ…ๅ…ถ ๆŽฅ่ฆ‹ ่ฌน ่ซ‹็คบ ๅพฉ ไปฅไพฟ ๅ‡ฝๅ‘Š ่ฉฒ ไฝฟ ่‚…ๆญค ไธŠๅฉ ๅ‹› ็ฅบ ้ †้ Œ ๆ–ฐ็ฆง ็ถ“ๆ–น ่ฌนไธŠ ๅˆไน ๆ—ฅ"
# create and examine DFM
mydfm <- dfm(tokens(mydata$ltext.seg, what = "fasterword"))
mydfm 
## Document-feature matrix of: 335 documents, 15,126 features (99.2% sparse).
# top 20 features
topfeatures(mydfm, 20)
##   ไบฆ ๅฎฎไฟ   ๅทฒ ็ถ“ๆ–น ้ˆžๅบง ่ฌนไธŠ   ๅˆ   ๅพฉ   ็จŸ   ไธ   ่€… ่‚…้ Œ   ๅ‘ˆ   ๅ‡   ๅ…ถ 
##  271  265  256  219  208  196  195  187  177  161  160  149  145  143  134 
##   ่ซ‹   ๅ† ๅด‡็ถ   ่€Œ   ๅฏณ 
##  131  131  130  125  123
Note in this DFM, those terms included in our stopwords list are gone, and those in our dictionary are segmented as in the text file of the dictionary.
We can then generate a data frame indicating frequency of each features using textstat_frequency().
# tabulate feature frequency
dfmtab <- textstat_frequency(mydfm)
head(dfmtab)
##   feature frequency rank docfreq
## 1      ไบฆ       271    1     106
## 2    ๅฎฎไฟ       265    2     236
## 3      ๅทฒ       256    3     141
## 4    ็ถ“ๆ–น       219    4     211
## 5    ้ˆžๅบง       208    5     208
## 6    ่ฌนไธŠ       196    6     196
Sometimes you only care about, say, longer features/terms, use dfm_select() to choose those meet certain conditions, e.g., terms contain two or more words.
# select 2+ word features 
mydfm2 <- dfm_select(mydfm, min_nchar = 2) 
topfeatures(mydfm2, 20)
##   ๅฎฎไฟ   ็ถ“ๆ–น   ้ˆžๅบง   ่ฌนไธŠ   ่‚…้ Œ   ๅด‡็ถ   ไธ็Ÿฅ   ๅˆๅŒ   ๅ…ฌๅธ   ไธ่ƒฝ 
##    265    219    208    196    149    130     97     90     85     80 
##   ้–ฃไธ‹   ไธ€ๅˆ‡   ่ฌนๆ‚‰ ๅค–ๅ‹™้ƒจ   ๆœฌ้ƒจ   ๅฐšๆœช   ๆœฌๆ—ฅ   ๅคงไบบ   ๅฆ‚ไฝ•   ๅฏไปฅ 
##     73     59     58     57     54     53     53     53     53     51
Plot a wordcloud graph from the DFM containing those 2-or-more-word terms. Here we select terms appearing 5 or more times to plot and set 200 as the maximum number of terms to be included.
# plot wordcloud
par(family='Kaiti TC')
textplot_wordcloud(mydfm2, min.freq=5, random.order=FALSE, max.words = 200,
                   rot.per = .25, scale = c(2.8, .5),
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

Explore more about the data

The function textstat_frequency() can tabulate all feature frequencies like we did above - we can also limit the frequencies to be tabulated and plot these selected features using ggplot() of the ggplot2 package.
# tabulate the top 10 features
textstat_frequency(mydfm2, n=10)
##    feature frequency rank docfreq
## 1     ๅฎฎไฟ       265    1     236
## 2     ็ถ“ๆ–น       219    2     211
## 3     ้ˆžๅบง       208    3     208
## 4     ่ฌนไธŠ       196    4     196
## 5     ่‚…้ Œ       149    5     149
## 6     ๅด‡็ถ       130    6     130
## 7     ไธ็Ÿฅ        97    7      49
## 8     ๅˆๅŒ        90    8      47
## 9     ๅ…ฌๅธ        85    9      45
## 10    ไธ่ƒฝ        80   10      56
# plot freq. by rank of the most frequent 50 features
library(ggplot2)
theme_set(theme_minimal())
textstat_frequency(mydfm2, n = 50) %>% 
  ggplot(aes(x = rank, y = frequency)) +
  geom_point() +
  labs(x = "Frequency rank", y = "Term frequency")

We can also use dfm_weight to create a DFM representing weighted frequencies of the terms, for instance, a DFM with relative feature/term frequencies, i.e., the proportion of the feature counts of total feature counts.
# create dfm with relative term frequencies
dfmpct <- dfm_weight(mydfm2, type = "relfreq") 

# plot relative term frequencies
textstat_frequency(dfmpct, n = 10) %>% 
  ggplot(aes(x = reorder(feature, -rank), y = frequency)) +
  geom_bar(stat = "identity") + coord_flip() + 
  labs(x = "", y = "Relative Term Frequency") +
  theme(text = element_text(family = 'STKaiti'))





In Sum...


In this blog post, we share our way of preparing Chinese texts for computational text analysis, mainly using two R packages - jiebaR and quanteda. We hope this will help our users either to use our Sheng Xuanhuai correspondences collection or to apply this way of processing Chinese texts on your text data. The two packages provide a lot more functions than what we can introduce in this single post, to learn more about the two text analysis packages, start from their documentations at https://qinwenfeng.com/jiebaR/. and http://docs.quanteda.io/index.html.

Wednesday, December 27, 2017

Writing and Local Identity: Literary Women of the Pearl River Delta of Guangdong in the 18th and 19th Centuries

The Library is pleased to organise the following public talk by Prof. Grace S. Fong of McGill University. All are welcome to join! Writin... thumbnail 1 summary
The Library is pleased to organise the following public talk by Prof. Grace S. Fong of McGill University. All are welcome to join!

Writing and Local Identity: Literary Women of the Pearl River Delta of Guangdong in the 18th and 19th Centuries

Speaker
Prof. Grace Fong
Professor of Chinese Literature, Department of East Asian Studies, McGill University
Visiting Professor, School of Chinese, The University of Hong Kong
Abstract of the talk
On the southern margins of the Qing empire, Guangdong has been seen as an evolving site of regional culture, education, and commerce, particularly in the Pearl River Delta counties surrounding the provincial capital Guangzhou in the eighteenth and nineteenth centuries. While anthropologists have uncovered a working-class women’s culture in this region, educated women from the counties of Panyu, Shunde, Xinhui, and Zhongshan were a significant but as yet understudied part of the elite culture.
Using data on Guangdong women in the Ming Qing Women’s Writings database and digital archive (http://digital.library.mcgill.ca/mingqing/) as primary sources for comparison, this paper aims to explore the relationship of these women’s writing to the construction of regional culture and identity before the Western powers had a significant impact through trade and missionary efforts beyond the Canton trading zone. To what extent were literary women part of a regional culture in Guangdong in the eighteenth and nineteenth centuries? Using both biographical and textual data and examining the paratexts, poetic themes and topics, and regional and social networks contained in fifteen individual works by Qing women writers from Guangdong, I will examine how a regional or local culture might have been constructed in these textual productions, and to ask how the components of this regional women’s culture – the hopes and desires, social and cultural activities, reflections and ambitions of these women in the Pearl River Delta show differences from or similarities to their contemporaries, the well-known elite educated women in the cultured Yangzi River Delta. 
Date: 10 Jan, 2018 (Wed)
Time: 4:00 – 5:30 p.m.
Venue: Digital Scholarship Lab, G/F, University Library

Friday, June 2, 2017

The work behind a video digital collection in CUHK Digital Repository - A Grain of Sand: Poems from Hong Kong

What do you expect from a video digital collection? Fast streaming? High-definition pictures? Clear voices? Whenever our Digital Servi... thumbnail 1 summary
What do you expect from a video digital collection? Fast streaming? High-definition pictures? Clear voices? Whenever our Digital Services Team has the opportunity to handle a digital collection with audio and video items, we will investigate various ways to exploit the full potential of the collection. For instance, in “Chinese Women and Hong Kong Christianity: An Oral History Archive” which has audio and video clippings, images, and other formats of materials, they are all put together as a collection so that users can easily trace through all materials relating to the oral history of the interviewee like the following.  Another example is “United College General Education Senior Seminar Papers Database”.  Again, all materials relating to the paper are grouped together as a collection.



In a recently launched video digital collection, namely, “A Grain of Sand: Poems from Hong Kong”, we experimented a few feature available in our Repository system (Islandora) to let users have a new experience in listening to English poems.
 
This video digital collection is a collaboration project between the Library and the Department of English. It consists of 33 video recordings of 4 poets that have connections with The Chinese University of Hong Kong (CUHK) in different capacities: Louise Ho , Andrew Parkin, Eddie Tay and Kit Fan. They recited their own poems about Hong Kong, and CUHK in particular, in a spectacular background of some landmarks in Hong Kong and CUHK campus.
 

In order to let users visualize the poems not only in beautiful pictures but also in words, we used a new Islandora module called Oral Histories Solution Pack to display the subtitles of the poem in the video and in the time-coded transcript viewer underneath the video. This solution pack is developed by University of Toronto Scarborough Library, and the program codes are shared freely in Github with the Islandora community so that Islandorians including CUHK Library are able to share and contribute what we have enhanced back to the community.  This is the main benefit of using an open source software.  For more details of the Oral Histories Solution Pack, please refer to:  


 
To prepare for display of subtitles, the team has evaluated different subtitles editing tools including those available in the GitHub site mentioned above.  Finally, an open source editor for video subtitles called Subtitle Edit (http://www.nikse.dk/SubtitleEdit/) was adopted as it is relatively easy to use and suitable for our project size and requirement.
 
The digital texts are provided by the English Department.  We made use of the Subtitle Edit to provide time code for all 33 video clips.  The following is screenshot of the application. 

The next step is to use the application to export the subtitles into WebVTT XML files that are suitable for the Solution Pack’s requirement.  Subtitle Edit is able to export the timecoded subtitles into various formats.
 
It also offers functions like “Templating Export” as in Open Refine (http://openrefine.org/) that we also make use of when preparing MODS file.


With these handy tools and the effort of the whole team, we are able to deliver the project within just a few weeks. We are now proud to present this new collection to the world so that users can appreciate the value of English poems. 
We value any collaboration with Faculty.  Our services can be found here. Please let us know if there are any opportunity to create some interesting digital collections within our CUHK community.

Thursday, May 18, 2017

The first CUHK Library Digital Scholarship Symposium “Exploring Digital Scholarship Research at CUHK and Beyond” held successfully

On 31 March 2017, the CUHK Library organized the first Digital Scholarship Symposium with the theme “ Exploring Digital Scholarship Re... thumbnail 1 summary

On 31 March 2017, the CUHK Library organized the first Digital Scholarship Symposium with the theme “Exploring Digital Scholarship Research at CUHK and Beyond” that was held at the Digital Scholarship Lab, G/F of University Library.  Six speakers from different disciplines presented their latest digital scholarship projects, attracting over 60 researchers, research students and librarians.

As a 1st anniversary event of the Digital Scholarship Lab, the Symposium aims at providing an avenue for all scholars interested in and conducting research in digital scholarship to get together to share their research, to further spark off more research in this area, and to enhance the partnership between the CUHK Library and the Faculty in conducting and supporting digital scholarship research.




The Symposium was officiated by Prof. Fanny M.C. CHEUNG, Pro-Vice-Chancellor/Vice-President and Ms. Louise JONES, University Librarian of CUHK. There were six presentations on network analysis, data visualization, GIS and big data analysis. They were:
  1. Prof. HUANG Bo from Department of Geography and Resource Management, CUHK: GIS and Big Data for Urban Applications
  2. Prof. LAI Chi-Tim from Department of Cultural and Religious Studies, CUHK: Guangzhou Daoist and Popular Temples Studies and the Development of Daoist Digital Museum
  3. Ms. Kitty Siu from Library, CUHK: A Collaborative Project in Opening Research Data: Archaeological Sites Mapping in China with GIS
  4. Dr. TSUI Lik-hang from China Biographical Database Project (CBDB), Harvard University: A Cyberinfrastructure for Studying Chinese History: A Proposal Based on the Experience of the China Biographical Database Project
  5. Prof. Angela WU from School of Journalism and Communication, CUHK: Re-presenting Web Use as Networks
  6. Prof. Michelle YE from Department of Translation, CUHK: The Social Network of an Early Republican Literary Magazine: a Visualization with Gephi


The post-symposium workshop in the afternoon by Dr. Tsui introduced the use of China Biographical Database (CBDB) and incorporation in other tools such as MARKUS for China studies.  The digital tools presented in the workshop sparked very lively discussion among participants.  More photos about the symposium can be found here.


The Library is very grateful for the enthusiastic support from the Faculty and the participants. The Library will continue to support research activities across the entire research life cycle by leveraging the latest digital technologies.

Friday, March 17, 2017

1st anniversary of the CUHK Digital Repository ๐Ÿ˜€๐ŸŽ‚๐ŸŽˆ

Today is the 1st anniversary of the CUHK Digital Repository ( http://repository.lib.cuhk.edu.hk ). Over the past year, we have ingested over... thumbnail 1 summary
Today is the 1st anniversary of the CUHK Digital Repository (http://repository.lib.cuhk.edu.hk). Over the past year, we have ingested over 1 million objects into the system. The objects include those items migrated from our legacy digital collections and also new items for our newly built digital collections.
Meanwhile, in this month, the repository platform also recorded more than 1 million access to our objects.

We will highlight some of our collections and their items in the next blog posts.