HOME > 상세정보

상세정보

Data preparation for data mining

Data preparation for data mining (5회 대출)

자료유형
단행본
개인저자
Pyle, Dorian.
서명 / 저자사항
Data preparation for data mining / Dorian Pyle.
발행사항
San Francisco, Calif. :   Morgan Kaufmann Publishers,   c1999.  
형태사항
xix, 540 p. : ill. ; 24 cm. + 1 computer laser optical disc (4 3/4 in.).
ISBN
1558605290 (pbk./CD-ROM)
서지주기
Includes bibliographical references (p. 509-511) and index.
일반주제명
Database management. Data mining. Electronic data processing -- Data preparation.
000 00952pamuu2200253 a 4500
001 000000698284
005 20010316141803
008 990120s1999 caua b 001 0 eng
010 ▼a 99017280
020 ▼a 1558605290 (pbk./CD-ROM)
040 ▼a DLC ▼c DLC ▼d YDX ▼d UKM ▼d 211009
050 0 0 ▼a QA76.9.D3 ▼b P95 1999
082 0 0 ▼a 005.74 ▼2 21
090 ▼a 005.74 ▼b P996d
100 1 ▼a Pyle, Dorian.
245 1 0 ▼a Data preparation for data mining / ▼c Dorian Pyle.
260 ▼a San Francisco, Calif. : ▼b Morgan Kaufmann Publishers, ▼c c1999.
300 ▼a xix, 540 p. : ▼b ill. ; ▼c 24 cm. + ▼e 1 computer laser optical disc (4 3/4 in.).
504 ▼a Includes bibliographical references (p. 509-511) and index.
538 ▼a System requirements for accompanying computer disc: Windows 95 or later.
650 0 ▼a Database management.
650 0 ▼a Data mining.
650 4 ▼a Electronic data processing ▼x Data preparation.

소장정보

No. 소장처 청구기호 등록번호 도서상태 반납예정일 예약 서비스
No. 1 소장처 중앙도서관/서고6층/ 청구기호 005.74 P996d 등록번호 111181787 (5회 대출) 도서상태 대출가능 반납예정일 예약 서비스 B M

컨텐츠정보

책소개

Data Preparation for Data Mining addresses an issue unfortunately ignored by most authorities on data mining: data preparation. Thanks largely to its perceived difficulty, data preparation has traditionally taken a backseat to the more alluring question of how best to extract meaningful knowledge. But without adequate preparation of your data, the return on the resources invested in mining is certain to be disappointing.

Dorian Pyle corrects this imbalance. A twenty-five-year veteran of what has become the data mining industry, Pyle shares his own successful data preparation methodology, offering both a conceptual overview for managers and complete technical details for IT professionals. Apply his techniques and watch your mining efforts pay off-in the form of improved performance, reduced distortion, and more valuable results.

On the enclosed CD-ROM, you'll find a suite of programs as C source code and compiled into a command-line-driven toolkit. This code illustrates how the author's techniques can be applied to arrive at an automated preparation solution that works for you. Also included are demonstration versions of three commercial products that help with data preparation, along with sample data with which you can practice and experiment.

Feature

* Offers in-depth coverage of an essential but largely ignored subject.
* Goes far beyond theory, leading you-step by step-through the author's own data preparation techniques.
* Provides practical illustrations of the author's methodology using realistic sample data sets.
* Includes algorithms you can apply directly to your own project, along with instructions for understanding when automation is possible and when greater intervention is required.
* Explains how to identify and correct data problems that may be present in your application.
* Prepares miners, helping them head into preparation with a better understanding of data sets and their limitations.


정보제공 : Aladin

목차


CONTENTS

Preface = xvii

Introduction = 1

Chapter 1 Data Exploration as a Process = 9

 1.1 The Data Exploration Process = 10

  1.1.1 Stage 1 : Exploring the Problem Space = 12

  1.1.2 Stage 2 : Exploring the Solution Space = 19

  1.1.3 Stage 3 : Specifying the Implementation Method = 22

  1.1.4 Stage 4 : Mining the Data = 22

  1.1.5 Exploration : Mining and Modeling = 28

 1.2 Data Mining, Modeling, and Modeling Tools = 28

  1.2.1 Ten Golden Rules = 29

  1.2.2 Introducing Modeling Too1s = 30

  1.2.3 Types of Models = 32

  1.2.4 Active and Passive Models = 33

  1.2.5 Explanatory and Predictive Models = 33

  1.2.6 Static and Continuously Learning Models = 35

 1.3 Summary = 37

  Supplemental Material = 39

   A Continuously Learning Mode1 Application = 39

   How the Continuously Learning Model Worked = 40

Chapter 2 The Nature of the World and Its Impact on Data Preparation = 45

 2.1 Measuring the World = 46

  2.1.1 Objects = 46

  2.1.2 Capturing Measurements = 47

  2.1.3 Errors of Measurement = 48

  2.1.4 Tying Measurements to the Real World = 53

 2.2 Types of Measurements = 53

  2.2.1 Scalar Measurements = 54

  2.2.2 Nonscalar Measurements = 60

 2.3 Continua of Attributes of Variables = 60

  2.3.1 The Qualitative-Quantitative Continuum = 61

  2.3.2 The Discrete-Continuous Continuum = 61

 2.4 Scale Measurement Example = 66

 2.5 Transformations and Difficulties - Variables, Data, and Information = 66

 2.6 Building Mineable Data Representations = 67

  2.6.1 Data Representation = 68

  2.6.2 Building Data - Dealing with Variables = 69

  2.6.3 Building Mineable Data Sets = 77

 2.7 Summary = 86

  Supplemental Material = 87

   Combinations = 87

Chapter 3 Data Preparation as a Process = 89

 3.1 Data Preparation : Inputs, Outputs, Models, and Decisions = 90

  3.1.1 Step 1 : Prepare the Data = 92

  3.1.2 Step 2 : Survey the Data = 97

  3.1.3 Step 3 : Model the Data = 98

  3.1.4 Use the Model = 98

 3.2 Modeling Tools and Data Preparation = 100

  3.2.1 How Modeling Tools Drive Data Preparation = 102

  3.2.2 Decision Trees = 104

  3.2.3 Decision Lists = 104

  3.2.4 Neural Networks = 107

  3.2.5 Evolution Programs = 107

  3.2.6 Modeling Data with the Too1s = 107

  3.2.7 Predictions and Rules = 109

  3.2.8 Choosing Techniques = 111

  3.2.9 Missing Data and Modeling Tools = 111

 3.3 Stages of Data Preparation = 112

  3.3.1 Stage 1 : Accessing the Data = 112

  3.3.2 Stage 2 : Auditing the Data = 113

  3.3.3 Stage 3 : Enhancing and Enriching the Data = 114

  3.3.4 Stage 4 : Looking for Sampling Bias = 114

  3.3.5 Stage 5 : Determining Data Structure (Super-, Macro-, and Micro-) = 115

  3.3.6 Stage 6 : Building the PIE = 116

  3.3.7 Stage 7 : Surveying the Data = 121

  3.3.8 Stage 8 : Modeling the Data = 122

 3.4 And the Result Is...? = 122

Chapter 4 Getting the Data : Basic Preparation = 125

 4.1 Data Discovery = 127

  4.1.1 Data Access Issues = 127

 4.2 Data Characterization = 129

  4.2.1 Detail/Aggregation Level (Granularity) = 129

  4.2.2 Consistency = 131

  4.2.3 Pollution = 132

  4.2.4 Objects = 133

  4.2.5 Relationship = 133

  4.2.6 Domain = 133

  4.2.7 Defaults = 134

  4.2.8 Integrity = 134

  4.2.9 Concurrency = 135

  4.2.10 Duplicate or Redundant Variables = 135

 4.3 Data Set Assembly = 135

  4.3.1 Reverse Pivoting = 136

  4.3.2 Feature Extraction = 137

  4.3.3 Physical or Behavioral Data Sets = 138

  4.3.4 Explanatory Structure = 138

  4.3.5 Data Enhancement or Enrichment = 139

  4.3.6 Sampling Bias = 140

 4.4 Example 1 : CREDIT = 141

  4.4.1 Looking at the Variables = 141

  4.4.2 Relationships between Variables = 146

 4.5 Example 2 : SHOE = 149

  4.5.1 Looking at the Variables = 149

  4.5.2 Relationships between Variables = 150

 4.6 The Data Assay = 151

Chapter 5 Sampling, Variability, and Confidence = 155

 5.1 Sampling, or First Catch Your Hare! = 155

  5.1.1 How Much Data? = 155

  5.1.2 Variability = 156

  5.1.3 Converging on a Representative Sample = 159

  5.1.4 Measuring Variability = 162

  5.1.5 Variability and Deviation = 162

 5.2 Confidence = 166

 5.3 Variability of Numeric Variables = 167

  5.3.1 Variability and Sampling = 168

  5.3.2 Variability and Convergence = 168

 5.4 Variability and Confidence in Alpha Variables = 170

  5.4.1 Ordering and Rate of Discovery = 171

 5.5 Measuring Confidence = 172

  5.5.1 Modeling and Confidence with the Whole Population = 172

  5.5.2 Testing for Confidence = 173

  5.5.3 Confidence Tests and Variability = 176

 5.6 Confidence in Capturing Variability = 178

  5.6.1 A Brief Introduction to the Normal Distribution = 178

  5.6.2 Normally Distributed Probabilities = 180

  5.6.3 Capturing Normally Distributed Probabilities : An Example = 181

  5.6.4 Capturing Confidence, Capturing Variance = 182

 5.7 Problems and Shortcomings of Taking Samples Using Variability = 184

  5.7.1 Missing Values = 184

  5.7.2 Constants (Variables with Only One Value) = 185

  5.7.3 Problems with Sampling = 185

  5.7.4 Monotonic Variable Detection = 186

  5.7.5 Interstitial Linearity = 187

  5.7.6 Rate of Discovery = 187

 5.8 Confidence and Instance Count = 188

 5.9 Summary = 188

  Supplemental Material = 189

   Confidence Samples = 189

Chapter 6 Handling Nonnumerical Variables = 191

 6.1 Representing Alphas and Remapping = 192

  6.1.1 One-of-n Remapping = 193

  6.1.2 m-of-n Remapping = 194

  6.1.3 Remapping to Eliminate Ordering = 195

  6.1.4 Remapping One-to-Many Patterns, or Ill-Formed Problems = 196

  6.1.5 Remapping Circular Discontinuity = 200

 6.2 State Space = 202

  6.2.1 Unit State Space = 202

  6.2.2 Pythagoras in State Space = 204

  6.2.3 Position in State Space = 204

  6.2.4 Neighbors and Associates = 205

  6.2.5 Density and Sparsity = 206

  6.2.6 Nearby and Distant Nearest Neighbors = 211

  6.2.7 Normalizing Measured Point Separation = 211

  6.2.8 Contours, Peaks, and Valleys = 213

  6.2.9 Mapping State Space = 213

  6.2.10 Objects in State Space = 213

  6.2.11 Phase Space = 214

  6.2.12 Mapping Alpha Values = 215

  6.2.13 Location ; Location. Location! = 216

  6.2.14 Numerics, Alphas. and the Montreal Canadiens = 216

 6.3 Joint Distribution Tables = 222

  6.3.1 Two-Way Tables = 223

  6.3.2 More Values, More Variables, and Meaning of the Numeration = 228

  6.3.3 Dealing with Low-Frequency Alpha Labels and Other Problems = 229

 6.4 Dimensionality = 230

  6.4.1 Multidimensional Scaling = 230

  6.4.2 Squashing a Triangle = 231

  6.4.3 Projecting Alpha Values = 234

  6.4.4 Scree Plots = 234

 6.5 Practical Consideration - Implementing Alpha Numeration in the Demonstration Code = 235

  6.5.1 Implementing Neighborhoods = 235

  6.5.2 Implementing Numeration in All Alpha Data Sets = 237

  6.5.3 Implementing Dimensionality Reduction for Variables = 237

 6.6 Summary = 238

Chapter 7 Normalizing and Redistributing Variables = 239

 7.1 Normalizing a Variable's Range = 240

  7.1.1 Review of Data Preparation and Modeling (Training, Testing, and Execution) = 241

  7.1.2 The Nature and Scope of the Out-of-Range Values Problem = 242

  7.1.3 Discovering the Range of Values When Building the PIE = 243

  7.1.4 Out-of-Range Values When Training = 247

  7.1.5 Out-of-Range Values When Testing = 249

  7.1.6 Out-of-Range Values When Executing = 250

  7.1.7 Scaling Transformations = 251

  7.1.8 Softmax Scaling = 257

  7.1.9 Normalizing Ranges = 258

 7.2 Redistributing Variable Values = 259

  7.2.1 The Nature of Distributions = 259

  7.2.2 Distributive Difficulties = 260

  7.2.3 Adjusting Distributions = 261

  7.2.4 Modified Distributions = 266

 7.3 Summary = 269

  Supplemental Material = 271

   The Logistic Function = 271

   Modifying the Linear Part of the Logistic Function Range = 274

Chapter 8 Replacing Missing and Empty Values = 275

 8.1 Retaining Information about Missing Values = 275

  8.1.1 Missing-Value Patterns = 276

  8.1.2 Capturing Patterns = 277

 8.2 Replacing Missing Values = 278

  8.2.1 Unbiased Estimators = 279

  8.2.2 Variability Relationships = 279

  8.2.3 Relationships between Variables = 282

  8.2.4 Preserving Between-Variable Relationships = 284

 8.3 Summary = 285

  Supplemental Material = 286

   Using Regression to Find Least Information-Damaging Missing Values = 286

   Alternative Methods of Missing-Value Replacement = 294

Chapter 9 Series Variables = 299

 9.1 Here There Be Dragons! = 300

 9.2 Types of Series = 300

 9.3 Describing Series Data = 301

  9.3.1 Constructing a Series = 302

  9.3.2 Features of a Series = 302

  9.3.3 Describing a Series Fourier = 303

  9.3.4 Describing a Series Spectrum = 307

  9.3.5 Describing a Series Trend, Seasonality, Cycles, Noise = 314

  9.3.6 Describing a Series Autocorrelation = 316

 9.4 Modeling Series Data = 320

 9.5 Repairing Series Data Problems = 320

  9.5.1 Missing Values = 320

  9.5.2 Outliers = 322

  9.5.3 Nonuniform Displacement = 322

  9.5.4 Trend = 323

 9.6 Tools = 325

  9.6.1 Filtering = 325

  9.6.2 Moving Averages = 326

  9.6.3 Smoothing 1 PVM Smoothing = 333

  9.6.4 Smoothing 2 Median Smoothing, Resmoothing, and Hanning = 333

  9.6.5 Extraction = 335

  9.6.6 Differencing = 336

 9.7 Other Problems = 339

  9.7.1 Numerating Alpha Values = 341

  9.7.2 Distribution = 341

  9.7.3 Normalization = 344

 9.8 Preparing Series Data = 344

  9.8.1 Looking at the Data = 346

  9.8.2 Signposts on the Rocky Road = 341

 9.9 Implementation Notes = 348

Chapter 10 Preparing the Data Set = 351

 10.1 Using Sparsely Populated Variables = 351

  10.1.1 Increasing Information Density Using Sparsely Populated Variables = 351

  10.1.2 Binning Sparse Numerical Values = 353

  10.1.3 Present-Value Patterns(PVPs) = 353

 10.2 Problems with High-Dimensionality Data Sets = 355

  10.2.1 Information Representation = 357

  10.2.2 Representing High-Dimensionality Data in Fewer Dimensions = 358

 10.3 Introducing the Neural Network = 360

  10.3.1 Training a Neural Network = 361

  10.3.2 Neurons = 362

  10.3.3 Reshaping the Logistic Curve = 363

  10.3.4 Single-Input Neurons = 363

  10.3.5 Multiple-Input Neurons = 366

  10.3.6 Networking Neurons to Estimate a Function = 368

  10.3.7 Network Learning = 368

  10.3.8 Network Prediction - Hidden Layer = 371

  10.3.9 Network Prediction - Output Layer = 371

  10.3.10 Stochastic Network Performance = 372

  10.3.11 Network Architecture 1 The Autoassociative Network = 373

  10.3.12 Network Architecture 2 The Sparsely Connected Network = 375

 10.4 Compressing Variables = 376

  10.4.1 Using Compressed Dimensionality Data = 376

 10.5 Removing Variables = 378

  10.5.1 Estimating Variable Importance 1 : What Doesn't Work = 379

  10.5.2 Estimating Variable Importance 2 : Clues = 379

  10.5.3 Estimating Variable Importance 3 : Configuring and Training the Network = 380

 10.6 How Much Data Is Enough? = 383

  10.6.1 Joint Distribution = 384

  10.6.2 Capturing Joint Variability = 390

  10.6.3 Degrees of Freedom = 391

 10.7 Beyond Joint Distribution = 392

  10.7.1 Enhancing the Data Set = 393

  10.7.2 Data Sets in Perspective = 396

 10.8 Implementation Notes = 396

  10.8.1 Collapsing Extremely Sparsely Populated Variables = 397

  10.8.2 Reducing Excessive Dimensionality = 397

  10.8.3 Measuring Variable Importance = 398

  10.8.4 Feature Enhancement = 398

 10.9 Where Next? = 399

Chapter 11 The Data Survey = 401

 11.1 Introduction to the Data Survey = 402

 11.2 Information and Communication = 403

  11.2.1 Measuring Information : Signals and Dictionaries = 405

  11.2.2 Measuring Information : Signals = 406

  11.2.3 Measuring Information : Bits of Information = 407

  11.2.4 Measuring Information : Surprise = 410

  11.2.5 Measuring Information : Entropy = 411

  11.2.6 Measuring Information : Dictionaries = 412

 11.3 Mapping Using Entropy = 414

  11.3.1 Whole Data Set Entropy = 416

  11.3.2 Conditional Entropy between Inputs and Outputs = 417

  11.3.3 Mutual Information = 420

  11.3.4 Other Survey Uses for Entropy and Information = 420

  11.3.5 Looking for Information = 421

 11.4 Identifying Problems with a Data Survey = 423

  11.4.1 Confidence and Sufficient Data = 424

  11.4.2 Detecting Sparsity = 426

  11.4.3 Manifold Definition = 427

 11.5 Clusters = 435

 11.6 Sampling Bias = 436

 11.7 Making the Data Survey = 439

 11.8 Novelty Detection = 442

 11.9 Other Directions = 443

  Supplemental Material = 446

  Entropic Analysis - Example = 446

  Surveying Data Sets = 451

Chapter 12 Using Prepared Data = 483

 12.1 Modeling Data = 485

  12.1.1 Assumptions = 485

  12.1.2 Models = 485

  12.1.3 Data Mining vs. Exploratory Data Analysis = 486

 12.2 Characterizing Data = 489

  12.2.1 Decision Trees = 490

  12.2.2 Clusters = 491

  12.2.3 Nearest Neighbor = 492

  12.2.4 Neural Networks and Regression = 493

 12.3 Prepared Data and Modeling Algorithms = 494

  12.3.1 Neural Networks and the CREDIT Data Set = 494

  12.3.2 Decision Trees and the CREDIT Data Set = 499

 12.4 Practical Use of Data Preparation and Prepared Data = 500

 12.5 Looking at Present Modeling Tools and Future Directions = 501

  12.5.1 Near Future = 503

  12.5.2 Farther Out = 504

Appendix

 Using the Demonstration Code on the CD-ROM = 505

 Further Reading = 509

 Index = 513

 About the Author = 537

 About the CD-ROM = 539



관련분야 신착자료

Harvard Business Review (2025)