Week1 - Motivation and Basics
Table of Contents
- Weekly Objectives- Motivate the study on- Machinelearning,AI,Datamining….
- Why? What?
- Overview of the field
 
- Short questions and answers on a story- What consists of machine learning?
- MLE
- MAP
 
- Some basics- Probability
- Distribution
- And some rules…
 
 
- Motivate the study on
Maximum Likelihood Estimation
Binomial Distribution
- Binomialdistributionis- The discrete probability distribution
- Of the number of successes in a sequence of n independent yes/no experiments, and each success has the probability of \(\theta\)
- Also called a Bernoull iexperiment
 
- Flips are i.i.d- Independent events (처음 던질 때와 두 번째 던질 때 연관이 없는 상태)
- Identically distributed according to binomial distribution (던질때마다 동일한 확률 분포를 가질 때)
 
- P(H) = \(\theta\), P(T)=1 - \(\theta\)
- P(HHTHT)= \(\theta \cdot \theta \cdot (1- \theta) \cdot \theta \cdot (1- \theta) = \theta ^3(1 - \theta)^2\)
- Let’s say- D as Data = H, H, T, H, T- n = 5
- k = \(a_H\) =3
- p = \(\theta\)
 
- p(D\(\mid \theta\)) = \(\theta ^{a_H} (1-\theta)^{a_T}\)
 
- D as Data = H, H, T, H, T
How to get the probability?
Maximum Likelihood Estimation
- p(D\(\mid \theta\)) = \(\theta ^{a_H} (1-\theta)^{a_T}\)- Data: We have obseved the sequence data of D with \(a_H\) and \(a_T\)
 
- Our hypothesis:- The gambiling result follows the binomial distribution of \(\theta\)
 
- How to make our hypothesis strong(어떻게 ‘참’이다라고 말할 수 있을까)?- Finding out a better distribution of the observation- Can be done, but you need more rational.
- Binomial distribution보다 좋은 hypothesis 구조가 있다면, 이것도 가능하지만 더 많은 합리적 생각이 필요.
 
- Finding out the best candidate of \(\theta\)- What’s the condition to make \(\theta\) most plausible ?
- \(\theta\)를 최적화하여 hypothesis를 strong하게 만들 수 있음.
 
 
- Finding out a better distribution of the observation
어떤 \(\theta\)를 선택했을 때 data를 가장 잘 설명할 수 있을까?
- One candidate is the __Maximum Likelihood Estimation (MLE) of \(\theta\)- Choose \(\theta\) that maximized the probability of obseverd data
 
MLE calculation
- \[\hat{\theta} = argmax_\theta P(D| \theta) = argmax_\theta \theta ^{a_H} (1-\theta)^{a_T}\]
- This is going nowhere, so you use a trick (자승을 처리하기 까다로움)- Using the log function.
 
- \[\hat{\theta} = argmax_\theta ln P(D| \theta) = argmax_\theta ln(\theta ^{a_H} (1-\theta)^{a_T}) = argmax_\theta ( a_H ln\theta + a_T ln (1-\theta) )\]
- Then, thish is a maximization problem, so you use a derivative that is set to zero- \[\frac{d}{d\theta} ( a_H ln\theta + a_T ln (1-\theta) ) = 0\]
- \[\frac{a_H}{\theta} - \frac{a_T}{1-\theta} = 0\]
- \[\theta = \frac{a_H}{a_T + a_H}\]
- When \(\theta\) is \(\frac{a_H}{a_T + a_H}\), the \(\theta\) becomes the best candidate from the MLE perspective.
 
- \[\hat{\theta} = \frac{a_H}{a_H + a_T}\]
Note
압정을 던져 2/5, 3/5 확률이 나오는 것은 아래의 방법을 통해 계산.
binomial distribution, MLE, opmization(derivative).
Number of Trials & Simple Error Bound
- 시도 횟수가 많아 진다면 무엇이 달라질까?- 똑같은 \(\theta\) 값이 나옴.
- Additional trials reduce the error of our estimation.
- Right now, we have \(\hat{\theta} = \frac{a_H}{a_H + a_T}, N = a_H + a_T\)
- Let’s say \(\theta^*\) is the true parameter of the thumbrack flipping for any error, \(\epsilon > 0\)
- We have a simple upper bound on the probability provided by Hoeffding’s inequality
 
- Can you calculate the required number of trials, N?- To obtain \(\epsilon\) = 0.1 with 0.01% case
 
Note
This is Probably Approximate Correct(PAC) learning
Maximum a Posteriori Estimation
Incorporating Prior Knowledge
- Merge the previous knowledge to my trial-  \[P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}\]- Posterior = (Likelihood x Prior Knowledge) / Normalizing Constant
 
- Already dealt with \(P(D \mid \theta ) = \theta^ {a_H} (1- \theta) ^{a_T}\)- \(\theta\)라고 가정하고 이러한 data가 관측될 확률.
 
- \(P(\theta)\) is part of the pror knowledge- 왜 압정의 앞/뒤가 이런 확률로 나올까?
- 왜 사람이 이렇게 행동할까?
 
- P(D)는 많이 중요하지 않음- 사실 \(\theta\)에 대해 관심이 있음.
- 여러번 시도를 해볼 수 있고, 요즘 data가 많음.
 
- \(P( \theta \mid D)\) : Poseterior- Then, it is the conclusion influenced by the data and the prior knowledge? Yes, it will be our future prior knowledge!
- 즉, data가 주어졌을 때 \(\theta\)가 사실일 확률을 표현
 
 
-  \[P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}\]
More Formula from Bayes Viewpoint
-  \[P(\theta|D) \propto P(D|\theta) P(\theta)\]- \[P(D|\theta) = \theta^{a_H} (1-\theta)^{a_T}\]
- \(P(\theta)\) = ????- 어떤 distribution(e.g. binomial distribution)에 의존하여 표현해야함.
- 50%, 50% 사전 지식을 그대로 사용 불가.
 
- 여러 가지 방법이 있으나, 하나의 방법:- Beta distribution, 특정범위내에서 [0,1]로 결합되어 있는 CDF로 표현하여 확률 성격을 잘 표현(아래그림참조)
- \[P(\theta) = \frac{\theta ^{\alpha-1} (1-\theta)^{\beta -1 }}{B(\alpha, \beta)}, B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}, \Gamma (\alpha) = (\alpha -1 )!\]
 
 

- Response is- \[P(\theta) \propto P(D|\theta) P(\theta) \propto \theta ^{a_H} (1-\theta)^{a_T }\theta ^{\alpha-1} (1-\theta)^{\beta -1 } = \theta ^{a_H + \alpha-1} (1-\theta)^{a_T + \beta -1 }\]
 
Maximum a Posterior Estimation
- Goal: the most probable and more approximate \(\theta\)
- Previously in MLE, we found \(\theta\) from \(\hat{\theta} = argmax_\theta P(D\mid \theta)\)- \[P(D| \theta) = \theta ^{a_H} (1-\theta)^{a_T }\]
- \[\hat{\theta} = argmax_\theta P(D| \theta)\]
 
- Now in MAP, we find \(\theta\) from \(\hat{\theta} = argmax_\theta P(\theta \mid D)\)- \[P( \theta|D) \propto \theta ^{a_H + \alpha-1} (1-\theta)^{a_T + \beta -1 }\]
- \[\hat{\theta} = \frac{ \theta ^{a_H + \alpha-1} }{ \theta ^{a_H + \alpha + a_T + \beta -2} }\]
 
- The calculation is same because anyhow it is the maximization.
Note
Likelihood에 대해 maximization을 하는 것(MLE)이 아니고, posterior에 관하여 maximization(MAP).
Conclusion from Anecdote
- If \(a_H\) and \(a_T\) become big, \(\alpha\) and \(\beta\) becomes nothing..- But, \(\alpha\) and \(\beta\) are important if I don’t give you more trials.
- \(\alpha\) and \(\beta\)을 선택함에 따라 결과가 달라짐.
 
 
 
Basics
Probability
- Philosophically, Either of the two- Objectivists assign numbers to describe states of events, i.e. counting
- Subjectivists assign numbers by your own belief to events, i.e. betting
 
- Mathematically- A function with the below characteristics
 

Conditional Probability
- We often do not handle the universe, \(\Omega\)
- Somehow, we always make conditions- Assuming that the parameters are X, Y, Z, …
- Assuming that the events in the scope of X, Y, Z, …
 

-  \[P(A \mid B) = \frac{P(A\cap B)}{P(B)}\]- The conditional probability of A given B
 
- Some handy formular-  \[P(B \mid A) = \frac{P(A \mid B) P(B)}{P(A)}\]- Nice to see that we can switch the condition and the target event
 
- Poseterior = (Likelihood X Prior Knowledge) / Normalizing Constant
-  \[P(A) = \sum_{n} P(A \mid B_n) P(B _n)\]- Nice to see that we can recover the target event by adding the whole conditional probs and priors
 
 
-  \[P(B \mid A) = \frac{P(A \mid B) P(B)}{P(A)}\]
Probability Distribution
- Probability distribution assigns- A probability to a subset of the potential events of a random trial, experiment, survey, etc.
 
- A function mapping an event to a probability- Because we call it a probability, the probability should keep its own characteristics(or axiomes)
- An event can be- A continuous nemeric value from surveys, trials, experiments …
- A discrete categorical value from surveys, trials, experiments …
 
 
- For example,- f: a probability distribution function
- x: a continous value
- f(x): assigned probs \(\begin{aligned} f(x) = \frac{e^{- \frac{1}{2} x^2}}{\sqrt{2 \pi}} \end{aligned}\)
 

Normal Distribution
- Very commonly observed distribution- Continuous numerical value
 
- \[f(x; \mu, \sigma) = \frac{1}{\sigma \sqrt{2\pi}} e ^{- \frac{(x - \mu)^2}{2 \sigma ^2}}\]
- Notation: \(N(\mu, \sigma ^2)\)
- Mean: \(\mu\)
- Variance: \(\sigma^2\)

Beta Distribution
- Supports a closed interval- Continuous numerical value
- [0,1]
- Very nice characteristic
- Why?- Matches up the characteristics of probs
 
 
- f(\(\theta\); $$\alpha, \beta\() =\)P \frac{\theta ^{\alpha-1} (1-\theta)^{\beta -1 }}{B(\alpha, \beta)}, B(\alpha, \beta) = \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha+\beta)}, \Gamma (\alpha) = (\alpha -1 )!$$
- Notation: Beta(\(\alpha, \beta\))
- Mean: \(\frac{\alpha}{\alpha + \beta}\)
- Variance: \(\frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta +1)}\)

Binomial Distribution
- Simplest distribution for discrete values- Bernoulli trial, yes or no
- 0 or 1
- Selection, switch …
 
- \[f(\theta; n, p) = \binom{n}{k} p^k (1-p)^{n-k}, \binom{n}{k} = \frac{n!}{k! (n-k)!}\]
- Notation: \(B(n,p)\)
- Mean: \(np\)
- Variance: \(np(1-p)\)

Multinomial Distribution
- The generalization of the binomial distribution- e.g., Choose A, B, C, D, E,….,Z
 
- \[f(x_1, ..., x_k; n, p_1, ..., p_k) = \frac{n!}{x_1! ... x_k !} p_1^{x_1} \cdot p_k^{x_k}\]
- Notation: Multi(P), \(P=< p_1, ..., p_k>\)
- Mean: \(E(x_i) = np_i\)
- Variance: Var\((x_i) = np_i (1-p_i)\)
Summary
- 압정 던지는 시도를 통해서 다음번에 무엇이 나올 지 확률을 계산하는 것은 단순히 (횟수/시도) 되는 것이 아니라, 이항분포 가정-> MLE or MAP(사전지식이용) -> optimization(derivative) 과정을 통해 도출되는 것임을 이해했다. 
- MAP는 likelihood를 최대화 하는 MLE 과정에 추가적으로 사전 지식을 사용하기 위해 베이즈 룰을 이용하여 posteriori를 계산하는데 적용이 되었으며, 사전 지식은 Beta 분포(0~1 범위를 가지는 연속분포)라고 가정하여 계산할 수 있었다. - 그러나, 결과적으로 시도 횟수가 많아 지게 되면 MLE, MAP는 동일한 결과를 얻을 수 있다.
 
Reference:
