Week1 - Motivation and Basics
Table of Contents
- Weekly Objectives
- Motivate the study on
- Machinelearning,AI,Datamining….
- Why? What?
- Overview of the field
- Short questions and answers on a story
- What consists of machine learning?
- MLE
- MAP
- Some basics
- Probability
- Distribution
- And some rules…
- Motivate the study on
Maximum Likelihood Estimation
Binomial Distribution
- Binomialdistributionis
- The discrete probability distribution
- Of the number of successes in a sequence of n independent yes/no experiments, and each success has the probability of \(\theta\)
- Also called a Bernoull iexperiment
- Flips are i.i.d
- Independent events (처음 던질 때와 두 번째 던질 때 연관이 없는 상태)
- Identically distributed according to binomial distribution (던질때마다 동일한 확률 분포를 가질 때)
- P(H) = \(\theta\), P(T)=1 - \(\theta\)
- P(HHTHT)= \(\theta \cdot \theta \cdot (1- \theta) \cdot \theta \cdot (1- \theta) = \theta ^3(1 - \theta)^2\)
- Let’s say
- D as Data = H, H, T, H, T
- n = 5
- k = \(a_H\) =3
- p = \(\theta\)
- p(D\(\mid \theta\)) = \(\theta ^{a_H} (1-\theta)^{a_T}\)
- D as Data = H, H, T, H, T
How to get the probability?
Maximum Likelihood Estimation
- p(D\(\mid \theta\)) = \(\theta ^{a_H} (1-\theta)^{a_T}\)
- Data: We have obseved the sequence data of D with \(a_H\) and \(a_T\)
- Our hypothesis:
- The gambiling result follows the binomial distribution of \(\theta\)
- How to make our hypothesis strong(어떻게 ‘참’이다라고 말할 수 있을까)?
- Finding out a better distribution of the observation
- Can be done, but you need more rational.
- Binomial distribution보다 좋은 hypothesis 구조가 있다면, 이것도 가능하지만 더 많은 합리적 생각이 필요.
- Finding out the best candidate of \(\theta\)
- What’s the condition to make \(\theta\) most plausible ?
- \(\theta\)를 최적화하여 hypothesis를 strong하게 만들 수 있음.
- Finding out a better distribution of the observation
어떤 \(\theta\)를 선택했을 때 data를 가장 잘 설명할 수 있을까?
- One candidate is the __Maximum Likelihood Estimation (MLE) of \(\theta\)
- Choose \(\theta\) that maximized the probability of obseverd data
MLE calculation
- \[\hat{\theta} = argmax_\theta P(D| \theta) = argmax_\theta \theta ^{a_H} (1-\theta)^{a_T}\]
- This is going nowhere, so you use a trick (자승을 처리하기 까다로움)
- Using the log function.
- \[\hat{\theta} = argmax_\theta ln P(D| \theta) = argmax_\theta ln(\theta ^{a_H} (1-\theta)^{a_T}) = argmax_\theta ( a_H ln\theta + a_T ln (1-\theta) )\]
- Then, thish is a maximization problem, so you use a derivative that is set to zero
- \[\frac{d}{d\theta} ( a_H ln\theta + a_T ln (1-\theta) ) = 0\]
- \[\frac{a_H}{\theta} - \frac{a_T}{1-\theta} = 0\]
- \[\theta = \frac{a_H}{a_T + a_H}\]
- When \(\theta\) is \(\frac{a_H}{a_T + a_H}\), the \(\theta\) becomes the best candidate from the MLE perspective.
- \[\hat{\theta} = \frac{a_H}{a_H + a_T}\]
Note
압정을 던져 2/5, 3/5 확률이 나오는 것은 아래의 방법을 통해 계산.
binomial distribution, MLE, opmization(derivative).
Number of Trials & Simple Error Bound
- 시도 횟수가 많아 진다면 무엇이 달라질까?
- 똑같은 \(\theta\) 값이 나옴.
- Additional trials reduce the error of our estimation.
- Right now, we have \(\hat{\theta} = \frac{a_H}{a_H + a_T}, N = a_H + a_T\)
- Let’s say \(\theta^*\) is the true parameter of the thumbrack flipping for any error, \(\epsilon > 0\)
- We have a simple upper bound on the probability provided by Hoeffding’s inequality
- Can you calculate the required number of trials, N?
- To obtain \(\epsilon\) = 0.1 with 0.01% case
Note
This is Probably Approximate Correct(PAC) learning
Maximum a Posteriori Estimation
Incorporating Prior Knowledge
- Merge the previous knowledge to my trial
- \[P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}\]
- Posterior = (Likelihood x Prior Knowledge) / Normalizing Constant
- Already dealt with \(P(D \mid \theta ) = \theta^ {a_H} (1- \theta) ^{a_T}\)
- \(\theta\)라고 가정하고 이러한 data가 관측될 확률.
- \(P(\theta)\) is part of the pror knowledge
- 왜 압정의 앞/뒤가 이런 확률로 나올까?
- 왜 사람이 이렇게 행동할까?
- P(D)는 많이 중요하지 않음
- 사실 \(\theta\)에 대해 관심이 있음.
- 여러번 시도를 해볼 수 있고, 요즘 data가 많음.
- \(P( \theta \mid D)\) : Poseterior
- Then, it is the conclusion influenced by the data and the prior knowledge? Yes, it will be our future prior knowledge!
- 즉, data가 주어졌을 때 \(\theta\)가 사실일 확률을 표현
- \[P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}\]
More Formula from Bayes Viewpoint
- \[P(\theta|D) \propto P(D|\theta) P(\theta)\]
- \[P(D|\theta) = \theta^{a_H} (1-\theta)^{a_T}\]
- \(P(\theta)\) = ????
- 어떤 distribution(e.g. binomial distribution)에 의존하여 표현해야함.
- 50%, 50% 사전 지식을 그대로 사용 불가.
- 여러 가지 방법이 있으나, 하나의 방법:
- Beta distribution, 특정범위내에서 [0,1]로 결합되어 있는 CDF로 표현하여 확률 성격을 잘 표현(아래그림참조)
- \[P(\theta) = \frac{\theta ^{\alpha-1} (1-\theta)^{\beta -1 }}{B(\alpha, \beta)}, B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}, \Gamma (\alpha) = (\alpha -1 )!\]
- Response is
- \[P(\theta) \propto P(D|\theta) P(\theta) \propto \theta ^{a_H} (1-\theta)^{a_T }\theta ^{\alpha-1} (1-\theta)^{\beta -1 } = \theta ^{a_H + \alpha-1} (1-\theta)^{a_T + \beta -1 }\]
Maximum a Posterior Estimation
- Goal: the most probable and more approximate \(\theta\)
- Previously in MLE, we found \(\theta\) from \(\hat{\theta} = argmax_\theta P(D\mid \theta)\)
- \[P(D| \theta) = \theta ^{a_H} (1-\theta)^{a_T }\]
- \[\hat{\theta} = argmax_\theta P(D| \theta)\]
- Now in MAP, we find \(\theta\) from \(\hat{\theta} = argmax_\theta P(\theta \mid D)\)
- \[P( \theta|D) \propto \theta ^{a_H + \alpha-1} (1-\theta)^{a_T + \beta -1 }\]
- \[\hat{\theta} = \frac{ \theta ^{a_H + \alpha-1} }{ \theta ^{a_H + \alpha + a_T + \beta -2} }\]
- The calculation is same because anyhow it is the maximization.
Note
Likelihood에 대해 maximization을 하는 것(MLE)이 아니고, posterior에 관하여 maximization(MAP).
Conclusion from Anecdote
- If \(a_H\) and \(a_T\) become big, \(\alpha\) and \(\beta\) becomes nothing..
- But, \(\alpha\) and \(\beta\) are important if I don’t give you more trials.
- \(\alpha\) and \(\beta\)을 선택함에 따라 결과가 달라짐.
Basics
Probability
- Philosophically, Either of the two
- Objectivists assign numbers to describe states of events, i.e. counting
- Subjectivists assign numbers by your own belief to events, i.e. betting
- Mathematically
- A function with the below characteristics
Conditional Probability
- We often do not handle the universe, \(\Omega\)
- Somehow, we always make conditions
- Assuming that the parameters are X, Y, Z, …
- Assuming that the events in the scope of X, Y, Z, …
- \[P(A \mid B) = \frac{P(A\cap B)}{P(B)}\]
- The conditional probability of A given B
- Some handy formular
- \[P(B \mid A) = \frac{P(A \mid B) P(B)}{P(A)}\]
- Nice to see that we can switch the condition and the target event
- Poseterior = (Likelihood X Prior Knowledge) / Normalizing Constant
- \[P(A) = \sum_{n} P(A \mid B_n) P(B _n)\]
- Nice to see that we can recover the target event by adding the whole conditional probs and priors
- \[P(B \mid A) = \frac{P(A \mid B) P(B)}{P(A)}\]
Probability Distribution
- Probability distribution assigns
- A probability to a subset of the potential events of a random trial, experiment, survey, etc.
- A function mapping an event to a probability
- Because we call it a probability, the probability should keep its own characteristics(or axiomes)
- An event can be
- A continuous nemeric value from surveys, trials, experiments …
- A discrete categorical value from surveys, trials, experiments …
- For example,
- f: a probability distribution function
- x: a continous value
- f(x): assigned probs \(\begin{aligned} f(x) = \frac{e^{- \frac{1}{2} x^2}}{\sqrt{2 \pi}} \end{aligned}\)
Normal Distribution
- Very commonly observed distribution
- Continuous numerical value
- \[f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} e ^{- \frac{(x - \mu)^2}{2 \sigma ^2}}\]
- Notation: \(N(\mu, \sigma ^2)\)
- Mean: \(\mu\)
- Variance: \(\sigma^2\)
Beta Distribution
- Supports a closed interval
- Continuous numerical value
- [0,1]
- Very nice characteristic
- Why?
- Matches up the characteristics of probs
- f(\(\theta\); $$\alpha, \beta\() =\)P \frac{\theta ^{\alpha-1} (1-\theta)^{\beta -1 }}{B(\alpha, \beta)}, B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}, \Gamma (\alpha) = (\alpha -1 )!$$
- Notation: Beta(\(\alpha, \beta\))
- Mean: \(\frac{\alpha}{\alpha + \beta}\)
- Variance: \(\frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta +1)}\)
Binomial Distribution
- Simplest distribution for discrete values
- Bernoulli trial, yes or no
- 0 or 1
- Selection, switch …
- \[f(\theta; n, p) = \binom{n}{k} p^k (1-p)^{n-k}, \binom{n}{k} = \frac{n!}{k! (n-k)!}\]
- Notation: \(B(n,p)\)
- Mean: \(np\)
- Variance: \(np(1-p)\)
Multinomial Distribution
- The generalization of the binomial distribution
- e.g., Choose A, B, C, D, E,….,Z
- \[f(x_1, ..., x_k; n, p_1, ..., p_k) = \frac{n!}{x_1! ... x_k !} p_1^{x_1} \cdot p_k^{x_k}\]
- Notation: Multi(P), \(P=< p_1, ..., p_k>\)
- Mean: \(E(x_i) = np_i\)
- Variance: Var\((x_i) = np_i (1-p_i)\)
Summary
압정 던지는 시도를 통해서 다음번에 무엇이 나올 지 확률을 계산하는 것은 단순히 (횟수/시도) 되는 것이 아니라, 이항분포 가정-> MLE or MAP(사전지식이용) -> optimization(derivative) 과정을 통해 도출되는 것임을 이해했다.
MAP는 likelihood를 최대화 하는 MLE 과정에 추가적으로 사전 지식을 사용하기 위해 베이즈 룰을 이용하여 posteriori를 계산하는데 적용이 되었으며, 사전 지식은 Beta 분포(0~1 범위를 가지는 연속분포)라고 가정하여 계산할 수 있었다.
- 그러나, 결과적으로 시도 횟수가 많아 지게 되면 MLE, MAP는 동일한 결과를 얻을 수 있다.
Reference: