AGI Fundamentals: Week 0

I would like to thank UCLA for finally granting me unrestricted MathType access, my matcha latte for granting me the energy to learn Overleaf this afternoon, and AGI safety fundamentals for allowing me into this course. Here are my notes for this week. 


Given a linear regression equation written in the form: 


{"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><mover><mi>y</mi><mo>^</mo></mover><mo>&#xA0;</mo><mo>=</mo><mo>&#xA0;</mo><msub><mi>&#x3B2;</mi><mn>0</mn></msub><mo>+</mo><msub><mi>&#x3B2;</mi><mn>1</mn></msub><mi>X</mi><mo>&#xA0;</mo><mo>+</mo><mi>&#x3B5;</mi><mo>&#xA0;</mo><mspace linebreak=\"newline\"/></mstyle></math>","truncated":false}

{"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><mi>&#x3B5;</mi></mstyle></math>","truncated":false}is random error term, {"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi>&#x3B2;</mi><mn>0</mn></msub></mstyle></math>","truncated":false} is y-intercept and {"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi>&#x3B2;</mi><mn>1</mn></msub></mstyle></math>","truncated":false}is the slope, where we aim to find parameters will minimize error in model predictions. 


 {"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><mi>C</mi><mi>O</mi><mi>S</mi><mi>T</mi><mo>&#x2009;</mo><mo>=</mo><mo>&#xA0;</mo><mfrac><mrow><munderover><mo>&#x2211;</mo><mn>1</mn><mi>n</mi></munderover><msup><mfenced><mrow><mfenced><mrow><msub><mi>&#x3B2;</mi><mn>1</mn></msub><msub><mi>x</mi><mi>i</mi></msub><mo>&#xA0;</mo><mo>+</mo><mo>&#xA0;</mo><msub><mi>&#x3B2;</mi><mn>0</mn></msub></mrow></mfenced><mo>-</mo><msub><mi>y</mi><mi>i</mi></msub></mrow></mfenced><mn>2</mn></msup></mrow><mrow><mn>2</mn><mo>&#xB7;</mo><mi>n</mi></mrow></mfrac><mo>&#xA0;</mo></mstyle></math>","truncated":false}

The cost/loss function measures the degree of inaccuracy of model predictions. 


Here, X, Y and n are given. The function is {"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><mi>f</mi><mfenced><mrow><msub><mi>&#x3B2;</mi><mn>0</mn></msub><mo>,</mo><mo>&#xA0;</mo><msub><mi>&#x3B2;</mi><mn>1</mn></msub></mrow></mfenced><mo>=</mo><mo>&#xA0;</mo><mi>z</mi></mstyle></math>","truncated":false} , with beta parameters. We begin computing gradient descent by computing partial derivatives. 


Specifically, {"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><mfenced open=\"[\" close=\"]\"><mrow><mfrac><mrow><mo>&#x2202;</mo><mi>z</mi></mrow><mrow><mo>&#x2202;</mo><msub><mi>&#x3B2;</mi><mn>0</mn></msub></mrow></mfrac><mo>,</mo><mo>&#xA0;</mo><mfrac><mrow><mo>&#x2202;</mo><mi>z</mi></mrow><mrow><mo>&#x2202;</mo><msub><mi>&#x3B2;</mi><mn>1</mn></msub></mrow></mfrac></mrow></mfenced></mstyle></math>","truncated":false}calculates change in total loss wrt to change in {"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi>&#x3B2;</mi><mn>0</mn></msub></mstyle></math>","truncated":false}, {"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi>&#x3B2;</mi><mn>1</mn></msub></mstyle></math>","truncated":false}. Since we want to minimize total loss, so if {"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><mfrac><mrow><mo>&#x2202;</mo><mi>z</mi></mrow><mrow><mo>&#x2202;</mo><msub><mi>&#x3B2;</mi><mn>1</mn></msub></mrow></mfrac></mstyle></math>","truncated":false}is negative, then we want to increase {"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><msub><mi>&#x3B2;</mi><mn>1</mn></msub></mstyle></math>","truncated":false}


Bias is the amount of error generated from explaining real-world data with a simple model, whereas variance is how much change can be yielded from "too much" adherence to testing data (or so that I gathered). To reduce bias and to reduce variance, we use a regularization term. 

Regularization term:  {"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><mstyle displaystyle=\"false\"><munderover><mrow><mi>&#x3BB;</mi><mo>&#x2211;</mo></mrow><mrow><mi>i</mi><mo>=</mo><mo>&#xA0;</mo><mn>0</mn></mrow><mn>1</mn></munderover></mstyle><msup><msub><mi>&#x3B2;</mi><mi>i</mi></msub><mn>2</mn></msup></mstyle></math>","truncated":false}

 

where it reduces the chance of one feature overriding the loss. For example, a large beta coefficient can be reduced by taking the sum of its squares and then multiplied by lambda. The lambda coefficient is a hyperparameter, which is often tuned by interns, to prevent model overfitting. 


Putting it all together: 


{"mathml":"<math style=\"font-family:stix;font-size:16px;\" xmlns=\"http://www.w3.org/1998/Math/MathML\"><mstyle mathsize=\"16px\"><mi>C</mi><mi>O</mi><mi>S</mi><mi>T</mi><mo>&#x2009;</mo><mo>=</mo><mo>&#xA0;</mo><mfrac><mrow><munderover><mo>&#x2211;</mo><mn>1</mn><mi>n</mi></munderover><msup><mfenced><mrow><mfenced><mrow><msub><mi>&#x3B2;</mi><mn>1</mn></msub><msub><mi>x</mi><mi>i</mi></msub><mo>&#xA0;</mo><mo>+</mo><mo>&#xA0;</mo><msub><mi>&#x3B2;</mi><mn>0</mn></msub></mrow></mfenced><mo>-</mo><msub><mi>y</mi><mi>i</mi></msub></mrow></mfenced><mn>2</mn></msup></mrow><mrow><mn>2</mn><mo>&#xB7;</mo><mi>n</mi></mrow></mfrac><mo>&#xA0;</mo><mo>+</mo><mo>&#xA0;</mo><mi>&#x3BB;</mi><mstyle displaystyle=\"false\"><munderover><mo>&#x2211;</mo><mrow><mi>i</mi><mo>=</mo><mn>0</mn></mrow><mn>1</mn></munderover></mstyle><msup><msub><mi>&#x3B2;</mi><mi>i</mi></msub><mn>2</mn></msup></mstyle></math>","truncated":false}


This is one concept of supervised learning. 


______________________________________________________



1. In a CNN, weights are not specified, instead, they are learned via optimization. Optimization uses either an objective function or a loss function, dependent on dataset characteristics. The most common optimization algorithm is gradient descent, updating step changes in the weight based on the direction of the gradient wrt the objective function.

2. Fully connected networks are not used as widely as convolutional networks, recurrent networks, and transformers.

3. Reinforcement learning requires designing a reward function in the training setup. Meeting the data-insufficiency bottleneck is imminent, so DL tends to generalize **AND** conduct transfer learning. Self-supervised learning is tasked with but not limited to: 1. Image generation 2. language modeling 3. behavioral cloning (a good read) Self-supervised learning techniques are: 1. Autoregression 2. GANs 3. Diffusion modeling Supervised learning is tasked with but not limited to: 1. Regression 2. Classification 3. Reward modeling Supervised learning techniques are: 1. Support vectors machines 2. Gaussian processes Reinforcement learning is tasked with not limited to: 1. Exploration 2. Temporal credit assignment 3. Multiagent credit assignment Reinforcement learning techniques are: 1. Q-learning 2. Policy gradients 3. Intrinsic motivation??? [what is this?]



-More to follow-

Comments

Popular posts from this blog

Et al.

Circumstances

Man's Search for Meaning