From Data Mining to Knowledge Discovery in Databases
Paper Review
AI/ML
2/20/20262 min read
Paper source: https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/1230
We are collecting huge amounts of data.
Humans cannot manually analyze all of it anymore.
So we need automated methods to turn data into useful knowledge.
The field that does this is called:
KDD — Knowledge Discovery in Databases
Inside KDD, there is one important step called:
Data Mining
KDD is the whole process of turning raw data into useful knowledge.
Data mining is just one step inside that process.
📊 Why Do We Need This?
Before computers:
Experts manually analyzed reports
Scientists examined data by hand
Analysts created summaries manually
But now:
Databases have millions or billions of records
Each record may have hundreds of variables
Humans cannot process this scale
So we face:
Data overload
We need algorithms to help us extract meaning from massive data.
🔄 What Is the KDD Process?
KDD is a multi-step process:
1️. Understand the problem
What are we trying to learn?
2️. Select data
Choose relevant subset.
3️. Clean data
Fix missing values, errors, noise.
4️. Transform data
Reduce dimensions, create features.
5️. Apply data mining
Run algorithms to find patterns.
6️. Evaluate patterns
Are they valid? Useful? Interesting?
7️. Use the knowledge
Make decisions or deploy models.
👉 Important: Data mining is only step 5.
🔍 What Is Data Mining?
The paper defines data mining as:
Applying algorithms to extract patterns or models from data
Patterns could be:
Rules
Trends
Clusters
Predictions
Relationships
But not all patterns are useful.
They must be:
✔ Valid (work on new data)
✔ Novel (not already known)
✔ Useful (help decision-making)
✔ Understandable
🎯 Two Main Goals of Data Mining
The paper says there are two big goals:
1️. Prediction
Use data to predict something unknown.
Examples:
Will a customer default?
Will stock price rise?
Will a patient survive?
2️. Description
Find patterns that help us understand data.
Examples:
Customer groups
Frequently bought items together
Fraud patterns
🧩 Main Data Mining Methods
📌 1. Classification
Put items into categories.
Example:
Loan approved vs denied
Fraud vs normal transaction
Think: Decision Trees, Neural Networks
📌 2. Regression
Predict a number.
Example:
Predict house price
Predict credit risk score
📌 3. Clustering
Group similar things together.
Example:
Customer segments
Disease subtypes
No labels required.
📌 4. Dependency Modeling
Find relationships between variables.
Example:
If X increases, Y also increases
Medical variable relationships
📌 5. Change Detection
Find unusual or abnormal behavior.
Example:
Fraud detection
Network attack detection
🧠 The Structure of Any Data Mining Algorithm
The authors simplify algorithms into three components:
Model Representation
What form does the pattern take?
(tree, equation, neural network, rule)Model Evaluation
How do we measure if it’s good?Search
How do we find the best model?
This abstraction is very important conceptually
⚠️ Important Warnings in the Paper
The authors warn against:
❌ Overfitting
Finding patterns that only fit the training data.
❌ False discoveries
If you search enough, you’ll find “patterns” in random noise.
❌ Media hype
Data mining is powerful — but not magic.
🏗 Real-World Applications Mentioned
The paper lists real examples:
Marketing: market basket analysis
Fraud detection: credit card fraud systems
Astronomy: classifying sky objects
Telecom: detecting alarm sequences
Manufacturing: predicting equipment failures
These are real deployed systems, not theory.
🔬 Research Challenges Identified
The paper highlights major problems:
Massive databases (terabytes)
High dimensional data
Noisy and missing data
Overfitting
Statistical significance
Changing data over time
Need for interpretability
These are still active research topics today.
🤖 Role of AI
The authors argue that AI contributes through:
Machine learning
Uncertainty reasoning
Planning
Intelligent agents
Natural language processing
Knowledge representation
So KDD is interdisciplinary.
💡 The Core Takeaway
The most important conceptual message of the paper is:
· Data mining is not the whole story.
· Knowledge discovery is a process.
· Algorithms alone are not enough.
· Understanding the problem and evaluating results is critical.
This idea is central to the paper’s definition of KDD