記錄 4 之 6
Record:   Prev Next
作者 Ganti, Venkatesh., author
書名 Data cleaning : a practical perspective / Venkatesh Ganti, Anish Das Sarma
出版項 San Rafael, California (1537 Fourth Street, San Rafael, CA 94901 USA) : Morgan & Claypool, 2013
國際標準書號 9781608456789 (electronic bk.)
9781608456772 (pbk.)
國際標準號碼 10.2200/S00523ED1V01Y201307DTM036 doi
book jacket
說明 1 PDF (xv, 69 pages) : illustrations
text rdacontent
electronic isbdmedia
online resource rdacarrier
系列 Synthesis lectures on data management, 2153-5426 ; # 36
Synthesis digital library of engineering and computer science
Synthesis lectures on data management ; # 36. 2153-5426
附註 Part of: Synthesis digital library of engineering and computer science
Series from website
Includes bibliographical references (pages 65-67)
1. Introduction -- 1.1 Enterprise data warehouse -- 1.2 Comparison shopping database -- 1.3 Data cleaning tasks -- 1.4 Record matching -- 1.5 Schema matching -- 1.6 Deduplication -- 1.7 Data standardization -- 1.8 Data profiling -- 1.9 Focus of this book --
2. Technological approaches -- 2.1 Domain-specific verticals -- 2.2 Generic platforms -- 2.3 Operator-based approach -- 2.4 Generic data cleaning operators -- 2.4.1 Similarity join -- 2.4.2 Clustering -- 2.4.3 Parsing -- 2.5 Bibliography --
3. Similarity functions -- 3.1 Edit distance -- 3.2 Jaccard similarity -- 3.3 Cosine similarity -- 3.4 Soundex -- 3.5 Combinations and learning similarity functions -- 3.6 Bibliography --
4. Operator: similarity join -- 4.1 Set similarity join (SSJoin) -- 4.2 Instantiations -- 4.2.1 Edit distance -- 4.2.2 Jaccard containment and similarity -- 4.3 Implementing the SSJoin operator -- 4.3.1 Basic SSJoin implementation -- 4.3.2 Filtered SSJoin implementation -- 4.4 Bibliography --
5. Operator: clustering -- -- 5.1 Definitions -- 5.2 Techniques -- 5.2.1 Hash partition -- 5.2.2 Graph-based clustering -- 5.3 Bibliography --
6. Operator: parsing -- 6.1 Regular expressions -- 6.2 Hidden Markov models -- 6.2.1 Training HMMs -- 6.2.2 Use of HMMs for parsing -- 6.3 Bibliography --
7. Task: record matching -- 7.1 Schema matching -- 7.2 Record matching -- 7.2.1 Bipartite graph construction -- 7.2.2 Weighted edges -- 7.2.3 Graph matching -- 7.3 Bibliography --
8. Task: deduplication -- 8.1 Graph partitioning approach -- 8.1.1 Graph construction -- 8.1.2 Graph partitioning -- 8.2 Merging -- 8.3 Using constraints for deduplication -- 8.3.1 Candidate sets of partitions -- 8.3.2 Maximizing constraint satisfaction -- 8.4 Blocking -- 8.5 Bibliography --
9. Data cleaning scripts -- 9.1 Record matching scripts -- 9.2 Deduplication scripts -- 9.3 Support for script development -- 9.3.1 User interface for developing scripts -- 9.3.2 Configurable data cleaning scripts -- 9.4 Bibliography --
10. Conclusion -- Bibliography -- Authors' biographies
Abstract freely available; full-text restricted to subscribers or individual document purchasers
Compendex
INSPEC
Google scholar
Google book search
Mode of access: World Wide Web
System requirements: Adobe Acrobat Reader
Data warehouses consolidate various activities of a business and often form the backbone for generating reports that support important business decisions. Errors in data tend to creep in for a variety of reasons. Some of these reasons include errors during input data collection and errors while merging data collected independently across different databases. These errors in data warehouses often result in erroneous upstream reports, and could impact business decisions negatively. Therefore, one of the critical challenges while maintaining large data warehouses is that of ensuring the quality of data in the data warehouse remains high. The process of maintaining high data quality is commonly referred to as data cleaning. In this book, we first discuss the goals of data cleaning. Often, the goals of data cleaning are not well defined and could mean different solutions in different scenarios. Toward clarifying these goals, we abstract out a common set of data cleaning tasks that often need to be addressed. This abstraction allows us to develop solutions for these common data cleaning tasks. We then discuss a few popular approaches for developing such solutions. In particular, we focus on an operator-centric approach for developing a data cleaning platform. The operator-centric approach involves the development of customizable operators that could be used as building blocks for developing common solutions. This is similar to the approach of relational algebra for query processing. The basic set of operators can be put together to build complex queries. Finally, we discuss the development of custom scripts which leverage the basic data cleaning operators along with relational operators to implement effective solutions for data cleaning tasks
Also available in print
Title from PDF title page (viewed on October 16, 2013)
Morgan & Claypool
鏈接 Print version: 9781608456772
主題 Database management
Data warehousing -- Quality control
Electronic data processing -- Data preparation
data cleaning
deduplication
record matching
data cleaning scripts
schema matching
ETL
clustering
data standardization
ETL data flows
set similarity join
segmentation
parsing
string similarity functions
edit distance
edit similarity
jaccard similarity
cosine similarity
soundex
constrained deduplication
blocking
Alt Author Das Sarma, Anish., author
記錄 4 之 6
Record:   Prev Next