Data Analysis and Machine Learning¶
约 952 个字 8 张图片 预计阅读时间 3 分钟
Overview
国际化拔尖人才培养课程:数据分析与机器学习。讲得很浅,连写代码的任务都没有。
"Eighty percent will be course lessons, and twenty percent will be life lessons." -- Raja Sooriamurthi
Lec 1: Introduction¶
Real World Problem Solving (From abstract world to real world):
- Puzzled-Based Learning: Domain Independent, Logical Reasoning
- System-based Learning: Reasoning with domain-specific methods (Learn physics knowledge to solve physics problems, etc.)
- Project-based Learning: Working with teams, Dealing with uncertainty
例如,问题要求计算\(100!\),重要的不是答案,而是计算过程。例如说,可以立刻确定结果末尾有两个零,由100带来的。(好难绷的例子)
Learing occurs when someone wants to learn, not when someone wants to teach. -- Roger Schank
Information System¶
Class Core, Ultimate Goal: Info System is all about adding value to organizations and use technology.
Machine Learning永远不会孤立存在,一般发生于某种business context下,用于添加value,例如说医院预测预约者失约的概率。(个人理解:强调实际用途?)因此面对ML问题,从value的角度思考。
如图,从数据中提取Value的过程,其中Visualize(可视化)代表Data Analysis,Modeling(建模)代表Machine Learning。
Use of Data¶
Visualization and Prediction.
Process when solving a problem:
- Consider whether the problem is worth solving
- Invention: WHAT CAN I DO?
Tools¶
"Matplotlib is too low-level"
Machine Learning¶
Learning: Improvement.
eg
- Recommendation System: Netflix
- Association: People who read this book also read...
- Email spam classification
Lec2: Computational Thinking & Tidy Data¶
If there is no action, there is no value.
计算思维¶
四个方面:
- Decomposition: Divide and Conquer
- Abstraction: Separate the "What" from the "How"
- Recognition: Look for similarities between problems
- Generalization: Adapt previous solutions to new problems
- Computation: How to express solution unambiguously
Abstraction¶
CS61A讲过了(
This is the approach of stratified design, the notion that a complex system should be structured as a sequence of levels that are described using a sequence of languages. -- Abelson and Sussman
即复杂的系统在设计/阐述时应该被分层,每一层用不同的语言描述。
e.g. Internet的七层结构,Git分为Porcelain(面向用户)和Plumbing(Core Git)两层。
EFFECTIVE VISUALIZATION¶
Tidy Data: 数据的组织格式,使得数据更容易被处理。(例如用表格)
e.g.
即便是表格也有messy的,如下图,蓝色表格对人类来说是更可读的,而绿色(Tidy Ver.)对计算机更友好。
Data Analysis中注意的三个属性:
- Variable: 和编程语言中的变量不同,这里指的是可以被测量的性质或量
- Value: 某个时刻测量Variable的结果
- Observation: The values of several variables measured under similar conditions.
RESHAPING DATAFRAMES¶
- Column headers are values, not variable names
- Row headers are Observations.
Lec 3: Reshape Data - Introduction to Visualization¶
Creativity, Curiosity, and Compassion
e.g.
Subway Map: put you in other's shoes.
Data gathered: About Movies.
面对这些数据,我们可以提出很多问题,例如:性别不同的人分别喜欢看什么电影?年龄对电影评价的影响如何?等等。然后可以造一个Tidy的表格:
| MovieID | Title | Male_Rating | Female_Rating | Diff |
Effective Mapping¶
Types of Data:
- N(Normals)
- Operation: =, !=
- e.g.: 邮政编码,血型,瞳孔颜色,种族,政党
- 根本无法量化的特性
- O(Ordered)
- Operation: =, !=, >(<)(=)
- e.g.: "Low/High/Medium Income Level". 满意程度(高,中,低)
- 具有分级(Order)的大致范围
- Q(Interval - Location of zero arbitrary - 0 只是一个标记点,实则也是一个该量的值)
- Operation: =, !=, >(<)(=), -
- e.g.: temperature(摄氏度与华氏度), pH, SAT score
- Q(Ratio - Location of 0 fixed - 0 代表该量的完全缺失)
- Operation: =, !=, >(<)(=), -, /
- e.g.: Physical measurements(mass, length, 开尔文温度)
- 换句话说,不存在负数(?)
Lec 4: Overview of Machine Learning¶
Value Proposition (end-to-end)
- Pain Point
- Problem Formulation (measure the pain)
- Solution Development
- Deployment
- Evaluation (reduction in pain)
- Maintenance / Sustainability
Measurement of learning¶
用P表示对某程序的表现衡量,定义经历E和某类任务T,则\(P(T,E+\Delta) > P(T,E)\)。
ML TASKS:
- Classification / Regression
- assign a label (classification) or numerical value (regression) to an unknown entity based on a set of features and known labels (or numerical values)
- Clustering
- group a bunch of entities that share common features
- Optimization
- from amongst a set of alternatives pick the “best” while balancing competing value metrics
- Forecasting
- based on the past, forecast the future
- Recommendation
- based on prior behavior rank order candidate preferences
- Association
- identify which items co-occur e.g., bread and peanut butter
如何评判ML TASKS完成的好坏?(mea 1. 需要确定评判的指标(e.g.
Types of Learning¶
- Supervised Learning
- We know both the input and the output
- 'Teacher'
- Unsupervised Learning
- We only know the input
- Reinforcement
- We know what is desired (correct) and what is not desired
- the ‘credit/blame assignment’ problem
Pull out features from data, and then feed them into a model.
Two phases of ML: 1. Training (with training data) 2. Testing (with testing data)
Lec 5: Evaluating a Classifier¶
Cross-Validation AND DATA LEAKAGE¶
Cross-Validation¶
没太听懂这块,这个Cross-Validation是用来减少“运气”对结果的影响,to measure authentic learning.
Data Leakage¶
Leakage: Testing data has overlapped with training data.
The twain shall never meet.
cLASSIFIER EVALUATION¶
形似倒排索引那一课的表格。