文章目录

Classification1、Classification with logistic regression1.1、Motivations1.2、Logistic regression1.3、Decision boundary

2、Cost function3、Gradient descent3.1(补)Scikit-Learn

4、The problem of overfitting4.1、Regularization

Classification

1、Classification with logistic regression

1.1、Motivations

We can’t use linear regression to solve the problem of binary classification even though we pick a threshold( 阈值)of say 0.5. It doesn’t work well when some exception values occur.

Introduce the topic.

1.2、Logistic regression

The formula for a sigmoid function is as follows:

g

(

z

)

=

1

1

+

e

z

(1)

g(z) = \frac{1}{1+e^{-z}}\tag{1}

g(z)=1+e−z1​(1)In the case of logistic regression, z (the input to the sigmoid function), is the output of a linear regression model.

f

(

x

)

=

g

(

w

0

x

0

+

b

)

f(x) = g(w_0x_0+ b)

f(x)=g(w0​x0​+b) or

f

(

x

)

=

g

(

w

0

x

0

+

w

1

x

1

+

b

)

f(x) = g(w_0x_0+w_1x_1 + b)

f(x)=g(w0​x0​+w1​x1​+b) or something more complex. (How to fit these parameters to the data will be learned further in the course. Take it easy)

1.3、Decision boundary

For formula 1, we know that

g

(

z

)

>

=

0.5

g(z) >= 0.5

g(z)>=0.5 for

z

>

=

0

z >=0

z>=0 Therefore: if

w

x

+

b

>

=

0

\mathbf{w} \cdot \mathbf{x} + b >= 0

w⋅x+b>=0, the model predicts

y

=

1

y=1

y=1 if

w

x

+

b

<

0

\mathbf{w} \cdot \mathbf{x} + b < 0

w⋅x+b<0, the model predicts

y

=

0

y=0

y=0 We can use

w

x

+

b

=

0

\mathbf{w} \cdot \mathbf{x} + b = 0

w⋅x+b=0 as the boundary.

2、Cost function

If using the mean squared error for logistic regression, the cost function is “non-convex”, so it’s more difficult for gradient descent to find an optimal value for the parameters w and b.

A real-valued function is called convex (凸) if the line segment between any two points on the graph of the function lies above the graph between the two points.

The loss function above can be rewritten to be easier to implement.

l

o

s

s

(

f

w

,

b

(

x

(

i

)

)

,

y

(

i

)

)

=

(

y

(

i

)

log

(

f

w

,

b

(

x

(

i

)

)

)

(

1

y

(

i

)

)

log

(

1

f

w

,

b

(

x

(

i

)

)

)

loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) = (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)

loss(fw,b​(x(i)),y(i))=(−y(i)log(fw,b​(x(i)))−(1−y(i))log(1−fw,b​(x(i))) And loss is calculated on a single training example, while cost is the overall loss The cost function is of the form

J

(

w

,

b

)

=

1

m

i

=

0

m

1

[

l

o

s

s

(

f

w

,

b

(

x

(

i

)

)

,

y

(

i

)

)

]

J(\mathbf{w},b) = \frac{1}{m} \sum_{i=0}^{m-1} \left[ loss(f_{\mathbf{w},b}(\mathbf{x}^{(i)}), y^{(i)}) \right]

J(w,b)=m1​i=0∑m−1​[loss(fw,b​(x(i)),y(i))]

def compute_cost_logistic(X, y, w, b):

"""

Computes cost

Args:

X (ndarray (m,n)): Data, m examples with n features

y (ndarray (m,)) : target values

w (ndarray (n,)) : model parameters

b (scalar) : model parameter

Returns:

cost (scalar): cost

"""

m = X.shape[0]

cost = 0.0

for i in range(m):

z_i = np.dot(X[i],w) + b

f_wb_i = sigmoid(z_i)

cost += -y[i]*np.log(f_wb_i) - (1-y[i])*np.log(1-f_wb_i)

cost = cost / m

return cost

3、Gradient descent

repeat until convergence:

  

{

      

w

j

=

w

j

α

J

(

w

,

b

)

w

j

  

for j := 0..n-1

          

b

=

b

α

J

(

w

,

b

)

b

}

\begin{align*} &\text{repeat until convergence:} \; \lbrace \\ & \; \; \;w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{1} \; & \text{for j := 0..n-1} \\ & \; \; \; \; \;b = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \\ &\rbrace \end{align*}

​repeat until convergence:{wj​=wj​−α∂wj​∂J(w,b)​b=b−α∂b∂J(w,b)​}​for j := 0..n-1(1)​ Surprisingly, the update rule is the same as the one derived by using the sum of the squared errors in linear regression. As a result, we can use the same gradient descent formula for logistic regression as well.

J

(

w

,

b

)

w

j

=

1

m

i

=

0

m

1

(

f

w

,

b

(

x

(

i

)

)

y

(

i

)

)

x

j

(

i

)

J

(

w

,

b

)

b

=

1

m

i

=

0

m

1

(

f

w

,

b

(

x

(

i

)

)

y

(

i

)

)

\begin{align*} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \tag{2} \\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{3} \end{align*}

∂wj​∂J(w,b)​∂b∂J(w,b)​​=m1​i=0∑m−1​(fw,b​(x(i))−y(i))xj(i)​=m1​i=0∑m−1​(fw,b​(x(i))−y(i))​(2)(3)​ The code:

def compute_gradient_logistic(X, y, w, b):

"""

Computes the gradient for linear regression

Args:

X (ndarray (m,n): Data, m examples with n features

y (ndarray (m,)): target values

w (ndarray (n,)): model parameters

b (scalar) : model parameter

Returns

dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w.

dj_db (scalar) : The gradient of the cost w.r.t. the parameter b.

"""

m,n = X.shape

dj_dw = np.zeros((n,)) #(n,)

dj_db = 0.

for i in range(m):

f_wb_i = sigmoid(np.dot(X[i],w) + b) #(n,)(n,)=scalar

err_i = f_wb_i - y[i] #scalar

for j in range(n):

dj_dw[j] = dj_dw[j] + err_i * X[i,j] #scalar

dj_db = dj_db + err_i

dj_dw = dj_dw/m #(n,)

dj_db = dj_db/m #scalar

return dj_db, dj_dw

def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters, lambda_):

"""

Performs batch gradient descent to learn theta. Updates theta by taking

num_iters gradient steps with learning rate alpha

Args:

X : (array_like Shape (m, n)

y : (array_like Shape (m,))

w_in : (array_like Shape (n,)) Initial values of parameters of the model

b_in : (scalar) Initial value of parameter of the model

cost_function: function to compute cost

alpha : (float) Learning rate

num_iters : (int) number of iterations to run gradient descent

lambda_ (scalar, float) regularization constant

Returns:

w : (array_like Shape (n,)) Updated values of parameters of the model after

running gradient descent

b : (scalar) Updated value of parameter of the model after

running gradient descent

"""

# number of training examples

m = len(X)

# An array to store cost J and w's at each iteration primarily for graphing later

J_history = []

w_history = []

for i in range(num_iters):

# Calculate the gradient and update the parameters

dj_db, dj_dw = gradient_function(X, y, w_in, b_in, lambda_)

# Update Parameters using w, b, alpha and gradient

w_in = w_in - alpha * dj_dw

b_in = b_in - alpha * dj_db

# Save cost J at each iteration

if i<100000: # prevent resource exhaustion

cost = cost_function(X, y, w_in, b_in, lambda_)

J_history.append(cost)

# Print cost every at intervals 10 times or as many iterations if < 10

if i% math.ceil(num_iters/10) == 0 or i == (num_iters-1):

w_history.append(w_in)

print(f"Iteration {i:4}: Cost {float(J_history[-1]):8.2f} ")

return w_in, b_in, J_history, w_history #return w and J,w history for graphing

The prediction

def predict(X, w, b):

"""

Predict whether the label is 0 or 1 using learned logistic

regression parameters w

Args:

X : (ndarray Shape (m, n))

w : (array_like Shape (n,)) Parameters of the model

b : (scalar, float) Parameter of the model

Returns:

p: (ndarray (m,1))

The predictions for X using a threshold at 0.5

"""

# number of training examples

m, n = X.shape

p = np.zeros(m)

### START CODE HERE ###

# Loop over each example

for i in range(m):

z_wb = 0

# Loop over each feature

for j in range(n):

# Add the corresponding term to z_wb

z_wb += X[i][j] * w[j]

# Add bias term

z_wb += b

# Calculate the prediction for this example

f_wb = sigmoid(z_wb)

# Apply the threshold

p[i] = f_wb > 0.5

### END CODE HERE ###

return p

3.1(补)Scikit-Learn

import numpy as np

X = np.array([[0.5, 1.5], [1,1], [1.5, 0.5], [3, 0.5], [2, 2], [1, 2.5]])

y = np.array([0, 0, 0, 1, 1, 1])

from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression()

lr_model.fit(X, y)

y_pred = lr_model.predict(X)

print("Prediction on training set:", y_pred)

4、The problem of overfitting

To solve this problem

Collect more training examplesUse fewer featuresReduce the size of parameters

w

j

w_j

wj​(Regularization)

All variables are reserved, and the weights of some unimportant features are set to 0 or the weights become small so that the parameter matrix of the features becomes sparse, and each variable has little impact on the prediction.

4.1、Regularization

Add a penalty term or regularization term to the cost function to minimize all parameter values in the cost function and shrink each parameter.

For linear regression

J

(

w

,

b

)

=

1

2

m

i

=

0

m

1

(

f

w

,

b

(

x

(

i

)

)

y

(

i

)

)

2

+

λ

2

m

j

=

0

n

1

w

j

2

J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m} \sum_{j=0}^{n-1} w_j^2

J(w,b)=2m1​i=0∑m−1​(fw,b​(x(i))−y(i))2+2mλ​j=0∑n−1​wj2​   where:

f

w

,

b

(

x

(

i

)

)

=

w

x

(

i

)

+

b

f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b

fw,b​(x(i))=w⋅x(i)+bFor logistic regression

J

(

w

,

b

)

=

1

m

i

=

0

m

1

[

y

(

i

)

log

(

f

w

,

b

(

x

(

i

)

)

)

(

1

y

(

i

)

)

log

(

1

f

w

,

b

(

x

(

i

)

)

)

]

+

λ

2

m

j

=

0

n

1

w

j

2

J(\mathbf{w},b) = \frac{1}{m} \sum_{i=0}^{m-1} \left[ -y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \right] + \frac{\lambda}{2m} \sum_{j=0}^{n-1} w_j^2

J(w,b)=m1​i=0∑m−1​[−y(i)log(fw,b​(x(i)))−(1−y(i))log(1−fw,b​(x(i)))]+2mλ​j=0∑n−1​wj2​   where:

f

w

,

b

(

x

(

i

)

)

=

s

i

g

m

o

i

d

(

w

x

(

i

)

+

b

)

f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = sigmoid(\mathbf{w} \cdot \mathbf{x}^{(i)} + b)

fw,b​(x(i))=sigmoid(w⋅x(i)+b)

λ

\lambda

λ is a parameter chosen by ourselves. It just likes

α

\alpha

α. And both linear/logistic

J

(

w

,

b

)

w

j

=

1

m

i

=

0

m

1

(

f

w

,

b

(

x

(

i

)

)

y

(

i

)

)

x

j

(

i

)

+

λ

m

w

j

J

(

w

,

b

)

b

=

1

m

i

=

0

m

1

(

f

w

,

b

(

x

(

i

)

)

y

(

i

)

)

\begin{align*} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} + \frac{\lambda}{m} w_j\\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \end{align*}

∂wj​∂J(w,b)​∂b∂J(w,b)​​=m1​i=0∑m−1​(fw,b​(x(i))−y(i))xj(i)​+mλ​wj​=m1​i=0∑m−1​(fw,b​(x(i))−y(i))​ The code for the logistic model

def compute_cost_logistic_reg(X, y, w, b, lambda_ = 1):

"""

Computes the cost over all examples

Args:

Args:

X (ndarray (m,n): Data, m examples with n features

y (ndarray (m,)): target values

w (ndarray (n,)): model parameters

b (scalar) : model parameter

lambda_ (scalar): Controls amount of regularization

Returns:

total_cost (scalar): cost

"""

m,n = X.shape

cost = 0.

for i in range(m):

z_i = np.dot(X[i], w) + b #(n,)(n,)=scalar, see np.dot

f_wb_i = sigmoid(z_i) #scalar

cost += -y[i]*np.log(f_wb_i) - (1-y[i])*np.log(1-f_wb_i) #scalar

cost = cost/m #scalar

reg_cost = 0

for j in range(n):

reg_cost += (w[j]**2) #scalar

reg_cost = (lambda_/(2*m)) * reg_cost #scalar

total_cost = cost + reg_cost #scalar

return total_cost #scalar

def compute_gradient_logistic_reg(X, y, w, b, lambda_):

"""

Computes the gradient for linear regression

Args:

X (ndarray (m,n): Data, m examples with n features

y (ndarray (m,)): target values

w (ndarray (n,)): model parameters

b (scalar) : model parameter

lambda_ (scalar): Controls amount of regularization

Returns

dj_dw (ndarray Shape (n,)): The gradient of the cost w.r.t. the parameters w.

dj_db (scalar) : The gradient of the cost w.r.t. the parameter b.

"""

m,n = X.shape

dj_dw = np.zeros((n,)) #(n,)

dj_db = 0.0 #scalar

for i in range(m):

f_wb_i = sigmoid(np.dot(X[i],w) + b) #(n,)(n,)=scalar

err_i = f_wb_i - y[i] #scalar

for j in range(n):

dj_dw[j] = dj_dw[j] + err_i * X[i,j] #scalar

dj_db = dj_db + err_i

dj_dw = dj_dw/m #(n,)

dj_db = dj_db/m #scalar

for j in range(n):

dj_dw[j] = dj_dw[j] + (lambda_/m) * w[j]

return dj_db, dj_dw

推荐链接

评论可见,请评论后查看内容,谢谢!!!评论后请刷新页面。