Record:   Prev Next
作者 Black, Alan E. (orcid)0000-0002-6199-4198
書名 Investigating Supplemental Context for Word Sense Disambiguation [electronic resource] / Alan E. Black
出版項 Ann Arbor : ProQuest Dissertations & Theses, 2018
國際標準書號 9780438155190
book jacket
說明 1 online resource
附註 Source: Dissertation Abstracts International, Volume: 79-12(E), Section: A
Adviser: Rosina O. Weber
Thesis (Ph.D.)--Drexel University, 2018
The key to word sense disambiguation is context. Because most words have multiple meanings (i.e. senses), there arises ambiguity when interpreting a word in isolation. Additional information is required to resolve the ambiguity. This additional information is referred to as context. Microtext (e.g., tweets) present a special case where messages are limited to a small number of characters, thereby severely limiting the available context in the text itself. This work is motivated by the importance of Twitter as a unique data source, and the difficulties of precise data collection when confronted with the daunting volume of messages that flow through the system on regular basis. Twitter has become a valuable source of information for academic researchers, industry analysts, marketing organizations, and others. The ultimate goal is to develop a tool that can help users collect relevant Twitter data from the vast sea of messages in the Twitter search index, or from streaming data sources, by leveraging additional context opportunities to aid in word sense disambiguation of a search term. While the language used in tweets presents problems, the Twitter platform offers opportunities to help make sense of users' messages. Various data retrieval mechanisms made available through a variety of APIs (application programing interfaces) allow developers to request additional information that may be used to form a supplemented context within which to better understand a message. In particular, this combination of message text and supplemental context may be employed to disambiguate among the meanings (i.e. senses) of a search word. We investigate two sources of supplemental context; previous tweets from a message's author (i.e. a twitter timeline) and tweets within a temporal window relative to the tweet under investigation's creation timestamp. Contexts for the former were collected on-demand from a RESTful Twitter API while context for the later was collected in bulk using a streaming API resulting in an experimental pool of over 10 million tweets. We propose a simple heuristic that can aid in the automated collection of supplemental context. The results of the heuristic's application are explored using a standard approach to word sense disambiguation combined with a variety of underlying concept-to-concept similarity measures. In addition, we develop a Blue Standard approach to generating sense-tagged test data. This work is motived by issues and limitations associated with employing human coders. The performance of systems that process human (natural) language has always been evaluated against a "gold standard" that is human derived. Engaging people to create tagged corpora for use as gold standards in natural language processing research is a time consuming and expensive proposition. Reuse of existing data is often the only viable option, even if the domain of discourse, writing style (e.g., formal vs. informal) or other significant attributes of the text or coding are not well suited to answering a particular research question. We propose a semi-supervised instance tagging methodology designed to produce sense tagged twitter data for the purpose of studying word sense disambiguation in social media. The proposed approach leverages the foundational work by Yarowsky (1993) who demonstrated that collocations are strongly indicative of word sense. We use the Blue Standard approach to build a test set of over 380,000 sense tagged tweets for use in our WSD experiments
School code: 0065
Host Item Dissertation Abstracts International 79-12A(E)
主題 Information science
Computer science
Alt Author Drexel University. Information Studies
Record:   Prev Next