MARC 主機 00000cam  2200385 i 4500 
001    AAI10838841 
005    20181227095853.5 
008    180531s2018    miu           000 0 eng   
020    9780438155190 
035    (MiAaPQ)AAI10838841 
035    (MiAaPQ)drexel:11578 
040    MiAaPQ|beng|cMiAaPQ|dAS 
100 1  Black, Alan E.|0(orcid)0000-0002-6199-4198 
245 10 Investigating Supplemental Context for Word Sense 
       Disambiguation|h[electronic resource] /|cAlan E. Black 
260    Ann Arbor :|bProQuest Dissertations & Theses,|c2018 
300    1 online resource 
500    Source: Dissertation Abstracts International, Volume: 79-
       12(E), Section: A 
500    Adviser: Rosina O. Weber 
502    Thesis (Ph.D.)--Drexel University, 2018 
520    The key to word sense disambiguation is context. Because 
       most words have multiple meanings (i.e. senses), there 
       arises ambiguity when interpreting a word in isolation. 
       Additional information is required to resolve the 
       ambiguity. This additional information is referred to as 
       context. Microtext (e.g., tweets) present a special case 
       where messages are limited to a small number of characters,
       thereby severely limiting the available context in the 
       text itself. This work is motivated by the importance of 
       Twitter as a unique data source, and the difficulties of 
       precise data collection when confronted with the daunting 
       volume of messages that flow through the system on regular
       basis. Twitter has become a valuable source of information
       for academic researchers, industry analysts, marketing 
       organizations, and others. The ultimate goal is to develop
       a tool that can help users collect relevant Twitter data 
       from the vast sea of messages in the Twitter search index,
       or from streaming data sources, by leveraging additional 
       context opportunities to aid in word sense disambiguation 
       of a search term. While the language used in tweets 
       presents problems, the Twitter platform offers 
       opportunities to help make sense of users' messages. 
       Various data retrieval mechanisms made available through a
       variety of APIs (application programing interfaces) allow 
       developers to request additional information that may be 
       used to form a supplemented context within which to better
       understand a message. In particular, this combination of 
       message text and supplemental context may be employed to 
       disambiguate among the meanings (i.e. senses) of a search 
       word. We investigate two sources of supplemental context; 
       previous tweets from a message's author (i.e. a twitter 
       timeline) and tweets within a temporal window relative to 
       the tweet under investigation's creation timestamp. 
       Contexts for the former were collected on-demand from a 
       RESTful Twitter API while context for the later was 
       collected in bulk using a streaming API resulting in an 
       experimental pool of over 10 million tweets. We propose a 
       simple heuristic that can aid in the automated collection 
       of supplemental context. The results of the heuristic's 
       application are explored using a standard approach to word
       sense disambiguation combined with a variety of underlying
       concept-to-concept similarity measures. In addition, we 
       develop a Blue Standard approach to generating sense-
       tagged test data. This work is motived by issues and 
       limitations associated with employing human coders. The 
       performance of systems that process human (natural) 
       language has always been evaluated against a "gold 
       standard" that is human derived. Engaging people to create
       tagged corpora for use as gold standards in natural 
       language processing research is a time consuming and 
       expensive proposition. Reuse of existing data is often the
       only viable option, even if the domain of discourse, 
       writing style (e.g., formal vs. informal) or other 
       significant attributes of the text or coding are not well 
       suited to answering a particular research question. We 
       propose a semi-supervised instance tagging methodology 
       designed to produce sense tagged twitter data for the 
       purpose of studying word sense disambiguation in social 
       media. The proposed approach leverages the foundational 
       work by Yarowsky (1993) who demonstrated that collocations
       are strongly indicative of word sense. We use the Blue 
       Standard approach to build a test set of over 380,000 
       sense tagged tweets for use in our WSD experiments 
590    School code: 0065 
650  4 Information science 
650  4 Computer science 
710 2  Drexel University.|bInformation Studies 
773 0  |tDissertation Abstracts International|g79-12A(E) 
856 40 |zDigital Dissertation Consortium|u