Topic Map v2

eParticipation 2.
0 Topic Map
Collection
Pre-processing
Analysis: Social computing
•Sentiment analysis
•Classification
•Topic discovery
•Information Extraction
Analysis: Multi-dimensional analysis
•Temporal
•Location
•Trends
•Manual analysis
•Comparison of diff. approaches
•Archival
Collection
As research works are now data-driven, there is a need for a databank of Philippine language resources.
Towards addressing this concern, students who are interested will develop tools and techniques that
can aid automatic collection and categorization of texts. This includes crawling the web for language
resources and automatically storing and organizing them based on language. Related work includes
clustering the languages, and annotating each collected text.
Possible Resource Person(s):
 Mr. Nathaniel Oco

 Prof. Rachel Edita Roxas
Target Venue(s):
 Local: PCSC, NNLPRS

 International (SCOPUS): TENCON, IALP
Starting Reference(s):
Authors Oco, Nathaniel; Syliongka, Leif Romeritch; Allman, Tod; Roxas, Rachel Edita
Title Resources for Philippine Languages: Collection, Annotation, and Modeling
Publication The 30th Pacific Asia Conference on Language, Information and Computation
Pages 433-438
Year 2016
Publisher Institute for the Study of Language and Information at Kyung Hee University
Authors Dita, Shirley N; Roxas, Rachel EO; Inventado, Paul;

Title Building Online Corpora of Philippine Languages
Publication The 23rd Pacific Asia Conference on Language, Information and Computation
Pages 646-653
Year 2009
Publisher City University of Hong Kong
Authors Oco, Nathaniel; Ilao, Joel; Roxas, Rachel Edita; Syliongka, Leif Romeritch;
Title Measuring language similarity using trigrams: Limitations of language identification
Publication 2013 International Conference on Recent Trends in Information Technology (ICRTIT)
Pages 478-481
Year 2013
Publisher IEEE
Pre-Processing
Textual data has been the main resource for numerous software programs. One of the integral
considerations is the proper representation and use of high quality data. In order to achieve such
quality, text pre-processors – or subprograms that modify the raw data to custom fit or provide new
data features to a given system – are needed. Currently, there are numerous pre-processors that are
available. However, there exists no compilation of tools that are lightweight and flexible to different
kind of systems or language domains. Students are to develop pre-processing tools for textual data.
These may consist of the following:
 Tokenization
o Cleaning
o URLs
o Special Characters
o Length Limit
o Duplicates
o Stop words
 True-casing (e.g. john -> John)
 Feature Extraction (Affixes)
 Stemming (Root words)
 Text Transformation
o Standard text normalization (e.g. resume -> résumé, canonicalization)
o Unicode normalization (e.g. ñ -> U+00F1, Å -> U+00C5)
o Shortcut text normalization (e.g. LOL -> Laughing Out Loud, gr8 -> great)
o Spell / grammar check
o Translation
 Mr. Nicco Nocon

 Mr. Matthew Phillip Go
Target Venue(s):

 International (SCOPUS): TENCON, IALP, PACLIC
Authors Nocon, Nicco; Borra, Allan;

SMTPOST: Using Statistical Machine Translation Approach in Filipino Part-of-Speech
Title Tagging
Proceedings of the 26th Pacific Asia Conference on Language, Information, and
Publication Computation
Pages 391-396
Year 2016
Publisher Institute for the Study of Language and Information at Kyung Hee University
Authors Nocon, Nicco; Oco, Nathaniel; Ilao, Joel; Roxas, Rachel Edita;
Title Philippine component of the network-based ASEAN language translation public service
2014 International Conference on Humanoid, Nanotechnology, Information Technology,
Publication Communication and Control, Environment and Management (HNICEM)
Year 2014
Publisher IEEE
Authors Oco, Nathaniel; Roxas, Rachel Edita;

Title Pattern matching refinements to dictionary-based code-switching point detection
Proceedings of the 26th Pacific Asia Conference on Language, Information, and
Publication Computation
Pages 07-10
Year 2012
Authors Oco, Nathaniel; Borra, Allan;

Title A grammar checker for Tagalog using LanguageTool
Publication Asian Language Resources collocated with IJCNLP 2011
Pages 2-9
Year 2011
Sentiment Analysis
• Tweets; Game chat; News articles; Facebook
Data collection posts; Blogs
Data filtering • Language identification; Geolocation
• POS tagging; Code-switching detection;

Data annotation Named entity recognition
• Text normalization; Grammar checking;

Data processing Machine translation; Language modeling
Data analysis • Classification; WordNet
Result Evaluation • Accuracy; F-score; Kappa Statistics
 Mr. Alron Lam

 Mr. Leif Romeritch Syliongka
Target Venue(s):

 Journal (SCOPUS): Philippine Political Science Journal, ACM Transactions on Asian Language
Information Processing, Literary and Linguistic Computing
Authors Lam, Alron Jan;

Improving Twitter Community Detection through Contextual Sentiment Analysis of
Title Tweets
Publication 54th Annual Meeting of the Association for Computational Linguistics
Pages 30-36
Year 2016
Publisher ACL
Authors Regalado, Ralph Vincent J; Chua, Jenina L; Co, Justin L; Tiam-Lee, Thomas James Z;
Subjectivity Classification of Filipino Text with Features Based on Term Frequency--
Title Inverse Document Frequency
Publication 2013 International Conference on Asian Language Processing (IALP)
Pages 113-116
Year 2013
Publisher IEEE
Authors Regalado, Ralph Vincent J; Cheng, Charibeth K;

Title Feature-Based Subjectivity Classification of Filipino Text
Pages 57-60
Year 2012
Publisher IEEE
Classification
• POS tagging; Code-switching detection;

Data annotation Named entity recognition

• Probabilistic classifiers; Decision trees;

Data analysis Convolutional Neural Nets
Result Evaluation • Accuracy; F-score; Kappa Statistics
Possible Topic: Classification of Typhoon-related Tweets
Twitter has been found to be a potentially useful source of information in times of disaster. As a
microblogging platform, users tend to use it for near-real-time updates. Specifically, in the context of
disasters, some use it to report damage, request for assistance, find missing persons, etc. These could be
useful for concerned entities like government agencies that conduct disaster response. However, with
the large multitude of tweets, it is hard for people to manually scour through them; the task is
sometimes likened to finding a needle in a haystack. Thus, automatic classification of relevant tweets
will be useful for situations like these. Students interested in this area will be involved in experimenting
with different features (like word embeddings) and classification algorithms to achieve this end goal.
Possible Resource Person(s) / Mentor(s):
 Mr. Alron Lam

Target Venue(s):

10NNLPRS Proceedings and 11NNLPRS Proceedings (https://sites.google.com/site/11nnlprs/past-

symposia)
Topic Discovery
• Language identification; Geolocation;

Data filtering Sampling
Data annotation • Coding

Data analysis • Unsupervised clustering; Topic modeling
Result Evaluation • Silhouette Index, Purity Index
Possible Resource Person(s) / Mentor(s):

Target Venue(s):

Ligutom III, Cerino; Orio, Jay Vincent; Ramacho, Dyannah Alexa Marie; Montenegro,
Authors Chuchi; Roxas, Rachel Edita; Oco, Nathaniel;
Title Using Topic Modelling to make sense of typhoon-related tweets
Pages 362 - 365
Year 2017
Publisher IEEE
Authors Soriano, Cheryll Ruth; Roldan, Ma Divina Gracia; Cheng, Charibeth; Oco, Nathaniel;
Social media and civic engagement during calamities: the case of Twitter use during
Title typhoon Yolanda
Publication Philippine Political Science Journal
Volume 37
Number 1
Pages 06-25
Year 2016
Publisher Routledge
Syliongka, Leif Romeritch; Oco, Nathaniel; Lam, Alron Jan; Soriano, Cheryll Ruth; Roldan,
Authors Ma Divina Gracia; Magno, Francisco; Cheng, Charibeth;
Combining Automatic and Manual Approaches: Towards a Framework for Discovering
Title Themes in Disaster-related Tweets
Publication Proceedings of the 24th International Conference on World Wide Web
Pages 1239-1244
Year 2015
Publisher ACM
Information Extraction
Data collection • Tweets; News articles; Facebook posts; Blogs
Data annotation • POS tagging; Named entity recognition
Data processing • Text normalization; Grammar checking;
Data analysis • Rule-based IE; Deep Learning
Result Evaluation • Accuracy; Word Error Rate
Possible Topic: Visualizing Disaster Information Extracted from Philippine News Articles / Tweets
News articles and tweets contain loads of information on disasters before, during, and after it happens.
These information sources contain typhoon names, date range of occurrence, locations hit, casualties,
financial and material needs of the victims, and others. They also contain information about donations
(and of what type) provided by countries, organizations, individuals to the victims. In this research,
students will create an automated way of extracting this information from these sources and displaying
them in a visual way showing the series of events related to each typhoon.
 Mr. Matthew Phillip Go

 Mr. Nicco Nocon
Target Venue(s):

 International (SCOPUS): TENCON, IALP, PACLIC, IJCNLP
Starting References:
 https://www.aclweb.org/anthology/W/W14/W14-2905.pdf
 https://www.aclweb.org/anthology/W/W16/W16-3906.pdf
 https://www.aclweb.org/anthology/C/C08/C08-3001.pdf
Resources
http://bit.ly/1MpcFoT
 Tweets – From 2013 to present (filtered tweets available upon request)

 WordNets – Filipino WordNet
 Dictionaries – Filipino dictionary
 Tagged data – Tagged
 Language models – Religious text in different languages
 Multilingual corpora – Religious articles; Parallel corpus
 English and Filipino monolingual corpora – Wikipedia articles
Projects
LanguageTool: https://languagetool.org/
ASEANMT: http://aseanmt.org/
California Report Card: http://californiareportcard.org/
QuakeCAFE: http://quakecafe.org/
Opinion Space: http://opinion.berkeley.edu/
Online Tools
Twitter 4J: http://twitter4j.org/en/
Moses SMT Engine: http://www.statmt.org/moses/
SentiWordNet: http://sentiwordnet.isti.cnr.it/
Weka: http://www.cs.waikato.ac.nz/ml/weka/
eParticipation 2.0 PCARI-funded Project

https://www.dropbox.com/sh/avvs2qxo0f0qe92/AACkXSGBnSt2Urlgd41lOTD3a?dl=0

Topic Map v2

Uploaded by

Copyright:

Available Formats

Topic Map v2

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic Map v2

Uploaded by

Copyright:

Available Formats

eParticipation 2.

Analysis: Social computing

Analysis: Multi-dimensional analysis

Possible Resource Person(s):

 Mr. Nathaniel Oco

 Local: PCSC, NNLPRS

Authors Dita, Shirley N; Roxas, Rachel EO; Inventado, Paul;

Possible Resource Person(s):

 Mr. Nicco Nocon

 Local: PCSC, NNLPRS

Authors Nocon, Nicco; Borra, Allan;

Authors Oco, Nathaniel; Roxas, Rachel Edita;

Authors Oco, Nathaniel; Borra, Allan;

Data filtering • Language identification; Geolocation

• POS tagging; Code-switching detection;

• Text normalization; Grammar checking;

Data analysis • Classification; WordNet

Result Evaluation • Accuracy; F-score; Kappa Statistics

Possible Resource Person(s):

 Mr. Alron Lam

 Local: PCSC, NNLPRS

Authors Lam, Alron Jan;

Authors Regalado, Ralph Vincent J; Cheng, Charibeth K;

Data filtering • Language identification; Geolocation

• POS tagging; Code-switching detection;

• Text normalization; Grammar checking;

• Probabilistic classifiers; Decision trees;

Result Evaluation • Accuracy; F-score; Kappa Statistics

Possible Topic: Classification of Typhoon-related Tweets

Possible Resource Person(s) / Mentor(s):

 Mr. Alron Lam

 Local: PCSC, NNLPRS

10NNLPRS Proceedings and 11NNLPRS Proceedings (https://sites.google.com/site/11nnlprs/past-

• Language identification; Geolocation;

Data annotation • Coding

• Text normalization; Grammar checking;

Data analysis • Unsupervised clustering; Topic modeling

Result Evaluation • Silhouette Index, Purity Index

Possible Resource Person(s) / Mentor(s):

 Mr. Nathaniel Oco

 Local: PCSC, NNLPRS

Data collection • Tweets; News articles; Facebook posts; Blogs

Data filtering • Language identification; Geolocation

Data annotation • POS tagging; Named entity recognition

Data processing • Text normalization; Grammar checking;

Data analysis • Rule-based IE; Deep Learning

Result Evaluation • Accuracy; Word Error Rate

Possible Resource Person(s):

 Mr. Matthew Phillip Go

 Local: PCSC, NNLPRS

 Tweets – From 2013 to present (filtered tweets available upon request)

California Report Card: http://californiareportcard.org/

Opinion Space: http://opinion.berkeley.edu/

Moses SMT Engine: http://www.statmt.org/moses/

eParticipation 2.0 PCARI-funded Project

You might also like