03 Jan 2024

Vanishing gradient problem

The problem

As more layers using certain activation functions are added to networks, the gradients and the loss function approaches zero, making the network harder to train.

This is because some activation functions (for example the sigmoid function) squishes a large input into a small input space between 0 and 1. Therefore a large change in the input of the sigmoid will cause a small change in the output, and the derivative becomes small.

When the inputs of the sigmoid function become very large or very small, the derivative is close to zero.

For small networks this isn’t a problem but for big networks it makes it harder to train effectively.

Solutions

→ ReLU: doesn’t cause a small derivative

→ Residual networks

→ Batch normalization

03 Jan 2024

Exploratory Data Analysis

Exploratory Data Analysis (EDA) is the initial phase of data analysis in which you explore and summarize the main characteristics, patterns, and relationships present in a dataset. The goal of EDA is to gain a better understanding of the data, uncover potential insights, and identify any data quality issues or anomalies before proceeding with further analysis or modeling.

EDA Checklist

What questions are you trying to solve or prove wrong?
What kind of data are you dealing with and how do you treat the different types?
What is missing from the data and how do you deal with it?
What are the outliers and what should we do about them?
How can you add, change or remove features to get more out of your data?

02 Jan 2024

Docker Volumes

By default all the changes inside the container are lost when the container stops. If we want to keep data between runs, Docker volumes and bind mounts can help.

→ A docker container runs software stack defined in an image.

→ Can be shared between containers.

Commands

docker volume ls shows all the volumes known to Docker
docker volume ls -f name=data allows us to filter volumes
docker volume rm data_vol removes the volume
docker volume prune removes all unused volumes

Bind Mounts

→ A connection from the container to a directory on the host machine. It allows the host to share it’s own file system with the container.

Docker Volumes

A bind mount uses the host file system, but Docker volumes are native to Docker. The volume has a lifecycle that’s longer than the container’s, allowing it to persist until no longer needed. Volumes can be shared between containers.

Managing Volumes

docker volume allows you to manage volumes. For example:

$ docker volume ls
DRIVER   VOLUME NAME
local    data_volume

To remove volume:

$ docker volume rm data_volume

To remove all unused volumes:

$ docker volume prune

04 Dec 2023

Flyweight

Flyweight lets you for more objects into the available amount of RAM by sharing common parts between multiple objects.

For example, in text editors, each character in a document can be represented as a flyweight object. The character's intrinsic state (such as its Unicode code point) is shared among all instances, while the extrinsic state (e.g., position and formatting) is stored separately for each character.

Cons

You might be trading RAM over CPU cycles when some of the data needs to be recalculated each time.
The code can become much more complicated

17 Nov 2023

@Published wrapper in SwiftUI

@Published is a property wrapper in SwiftUI that allows us to create observable objects that automatically announce when changes occur.

Whenever the property is changed, all vies using that object will be reloaded to reflect those changes.

class Bag: ObservableObkect {
    var items = [String]()
}

This conforms to the ObservableObject protocol (which means views can watch it for changes. But since items isn’t marked with @Published, no change announcements will be sent.

class Bag: ObservableObject {
    @Published var items = [String]()
}

31 Oct 2023

Dropout

Regularization is the training of the model so that it can generalize over data it hasn’t seen before. We can regularize models using data augmentation, weight decay or dropout.

In dropout, during training, a random subset of neurons or units in a neural network is "dropped out" or temporarily turned off with a probability, usually specified as a hyperparameter, typically in the range of 0.2 to 0.5. This means that for each training example, some portion of the network's units are not used in the forward and backward passes. As a result, the network becomes more robust and generalizes better to new data.

The key idea behind dropout is that it prevents the network from relying too heavily on any specific neuron, making the network more adaptive and less likely to overfit. During inference (i.e., when making predictions), all neurons are active, and their contributions are scaled by the dropout probability used during training.

Summary

It drops a unit (eg. node) and its connections with a specified probability value p (common value is p = 0.5).

Why? To prevent co-adaptation where neural network becomes too reliant on particular connections (this could mean overfitting is happening).

19 Oct 2023

Neuron saturation

"Saturating" refers to a situation where the activation function of a neuron reaches its maximum or minimum output value, and any further changes in the input have little to no effect on the output and the neuron becomes unresponsive. This can hinder the learning process in a neural network.

Examples

Sigmoid function: as the input moves toward positive infinity, the output approaches 1, and as the input moves toward negative infinity, the output approaches 0. When the output is close to 1 or 0, the gradient of the function (which is used in the backpropagation algorithm for training) becomes very small, making it difficult for the network to learn from errors.

Hyperbolic tangent function: the tanh function has a similar S-shaped curve, and it saturates when its input values are very large or very small.

That's one of the reasons why Rectified Linear Units (ReLUs) have become popular in neural networks. ReLUs do not saturate for positive input values; they output the input as-is if it's positive and zero for negative input values.

17 Oct 2023

Hypothesis

In ML the hypothesis is the equation or formula that your model uses to make predictions based on input data.

Imagine you have a dataset of houses with features like size, number of bedrooms, and location, and you want to predict their prices. Your hypothesis function might look like this:

H(size, bedrooms, location) = w_1 * size + w_2 * bedrooms + w_3 * location + b

Where

H represents your hypothesis function.
size, bedrooms, and location are the input features of a house.
w_1, w_2, w_3 are the weights (parameters) that your model learns during training to adjust the importance of each feature.
b is the bias term, which accounts for factors not directly related to the input features.

During the training process, your machine learning algorithm tries to find the best values for w1 , w2, w_3 and b that minimize the difference between the predicted prices and the actual prices in your training dataset. Once trained, this hypothesis function can be used to predict the prices of new houses based on their size, bedrooms, and location.

16 Oct 2023

#problem

Negative MSE

from sklearn.svm import SVR

SVM_reg = SVR()
param_grid = [
        {'kernel': ['linear'], 'C': [10., 30., 100., 300., 1000., 3000.]},
        {'kernel': ['rbf'], 'C': [1.0, 3.0, 10., 30., 100., 300., 1000.0],
         'gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]},
    ]

grid_search = GridSearchCV(SVM_reg, param_grid, cv=5,
                            scoring='neg_mean_squared_error',
                            verbose=2)
# grid_search.fit(housing_prepared, housing_labels)

negative_mse = grid_search.best_score_
rmse = np.sqrt(-negative_mse)
rmse

Why is the MSE negative in this case?

The scoring parameter expects a utility function, where higher values indicate better performance. However, mean squared error (MSE) is an error metric, where lower values indicate better performance. To convert the MSE into a utility function, the negative value is used.

By using negative MSE as the scoring metric, the grid search algorithm will try to maximize the negative MSE, which is equivalent to minimizing the MSE.

16 Aug 2023

Utility function

What is a utility function?

A utility function, in the context of machine learning and optimization, is a mathematical function that assigns a numerical value to a particular outcome or decision. It is used to represent the preference or desirability of different outcomes or decisions.

In the case of model evaluation and hyperparameter tuning, a utility function is used as a scoring metric to measure the performance of a model or a combination of hyperparameters. The utility function provides a way to quantify how well the model is performing or how desirable a set of hyperparameters is.

One common approach in AI decision-making is to maximize the expected utility, which considers the probability of different outcomes occurring. The expected utility is calculated by multiplying the utility of each outcome by its corresponding probability and summing up the results. The AI system chooses the action with the highest expected utility as the optimal choice.

Examples:

Self-Driving Cars: In the self-driving cars application, the utility function may consider factors such as time taken, fuel consumption, safety, and comfort, and assign utility values to different routes based on these factors. The self-driving car can then use the utility values to calculate the expected utility of each route, taking into account the probabilities of different traffic conditions or road obstacles, and choose the route with the highest expected utility to reach the destination.
Recommendation Systems: Consider a recommendation system that suggests movies to users based on their preferences. The utility function of the recommendation system may assign higher utility values to movies that match the user's preferred genre, actors, or directors and lower utility values to films that do not match these preferences. The recommendation system can then use the utility values to rank and recommend movies to the user based on their utility values, with higher utility movies being recommended more prominently.

Code example

# Define the utility function
def calculate_utility(salary, commute_time):
    # Utility is a linear combination of salary and negative commute time
    return salary - 2 * commute_time

# Job offer 1 details
salary_job1 = 60000  # in dollars per year
commute_time_job1 = 30  # in minutes

# Job offer 2 details
salary_job2 = 55000  # in dollars per year
commute_time_job2 = 20  # in minutes

# Calculate the utility for each job offer
utility_job1 = calculate_utility(salary_job1, commute_time_job1)
utility_job2 = calculate_utility(salary_job2, commute_time_job2)

# Compare the utilities to make a decision
if utility_job1 > utility_job2:
    print("Job offer 1 is the better choice.")
elif utility_job2 > utility_job1:
    print("Job offer 2 is the better choice.")
else:
    print("Both job offers are equally desirable.")

01 Aug 2023

Polymorphism in Django ORM

Challenges of polymorphism

Django ORM maps attributes to columns in the database.

How should Django ORM map attributes to the columns in the table?
Should different objects reside in the same table?
Should there be multiple tables?

Example

Simplest way

Ecommerce website that sells books. You need the following models: Book, Cart. This approach is non-polymorphic.

class Book(Model):
    name = models.CharField(
        max_length=100,
  )
  price = models.PositiveIntegerField(
        help_text='in cents',
  )
  weight = models.PositiveIntegerField(
        help_text='in grams',
  )

class Cart(Model):
    user = models.OneToOneField(
        get_user_model(),
        primary_key=True,
        on_delete=models.CASCADE,
  )
  books = models.ManyToManyField(Book)

To create a cart with books:

book = Book.objects.create(name="Awesome book", price=100, weight=200)
user = get_user_model().create_user("Some user")
cart = Cart.objects.create(user=user)

cart.products.add(book)

Abstract Base Model

Abstract base model creates an abstract base class that only exists in the code, not in the database.

class Product(Model):
        class Meta:
            abstract = True

    name = models.CharField(
        max_length=100,
    )
    price = models.PositiveIntegerField(
        help_text='in cents',
    )

class Book(Product):
    weight = models.PositiveIntegerField()

class EBook(Product):
    download_link = models.URLField()

Both Book and EBook inherit from Product. Now let’s create the Cart.

class Cart(Model):
    user = models.OneToOneField(
        get_user_model(),
        primary_key=True,
        on_delete=models.CASCADE,
  )

    # we have to create a different variable for each type of product
    book = models.ManyToManyField(Book)
    ebooks = models.ManyToManyField(EBook)

Concrete Base Model

Concrete base model is an approach that does create a separate table in the database.

In this example we have both Books and EBooks in our store. Note that the Product Model is not defined with abstract = True.

class Product(Model):
    name = models.CharField(
        max_length=100,
    )
    price = models.PositiveIntegerField(
        help_text='in cents',
    )

class Book(Product):
    weight = models.PositiveIntegerField()

class EBook(Product):
    download_link = models.URLField()

Create new products:

book = Book.objects.create(...)
ebook = EBook.objects.create(...)

This will add an entry in the Product table that will contain all the common fields, and 2 entries in the book and ebook tables, for the values specific to their respective products.

Notice the productptrid in Books table which points to its respective item in Products table.

Column	Type
productptrid	integer
weight	integer

Django performs a join in the background when fetching a book.

Since all products are managed in the Product table, you can reference them using a foreign key in the Cart model.

class Cart(Model):
    ...
    items = models.ManyToManyField(Product)

cart = Cart.objects.create(user=user)
cart.items.add(book, ebook)

31 Jul 2023

Django Signals

To receive a signal, register a receiver function:

from django.core.signals import request_finished

def my_callback(sender, **kwargs)
    print("Request finished!")
request_finished.connect(my_callback)

@receiver
def my_callback(sender, **kwargs):
    print("Request finished!")

my_callback function will be called each time a request finishes.

Connect to signals sent by specific senders

@receiver
def my_handler(sender, sender=MyModel):
    print("Request finished!")

The method will be called when an instance of MyModel is saved

Defining and sending signals

task_done = django.dispatch.Signal()

def send_task_signal(self, task_name):
    task_done.send(sender=self.__class__, task_name=task_name)

Disconnecting from a signal

To disconnect from a signal simply use:

Signal.disconnect()

22 Jul 2023

Weight Decay

On the data side data augmentation helps models generalize.

On the model side weight decay can help us generalize.

To prevent overfitting we shouldn’t allow our models to get too complex (eg. a polynomial that is actually very overfit like in the pic). Having fewer parameters can prevent your model from getting overly complex, but its a limiting strategy.

To penalize complexity we can

→ add all our parameters (weights) to the loss function,

but since some of them are positive and some negative, we can

→ add the squares of all the parameters to the loss function

but it might result in loss getting huge and the best model would have all parameters set to 0

→ we multiply the sum of squares with another smaller number called *weight decay*, or wd.

Our loss function will look as follows:

Loss = MSE(y_hat, y) + wd * sum(w^2)

y_hat = predicted or estimated value of the target variable

y = actual target value

When we update weights using gradient descent we do the follwing:

w(t) = w(t-1) - lr * dLoss / dw

lr = learning rate, a hyperparameter that determines the step size or the rate at which weights are updated during training

dLoss = derivative of the loss function wrt the weights, represents how the loss function changes as the weights are modified

The value of wd

Generally wd = 0.1 works pretty well.

→ Too much weight decay then no matter how much you train the model will never fit quire well.

→ Too much weight decay and you can still train well, but you need to stop a bit early.

25 Jun 2023

Recipe for Training Neural Networks

Neural net training fails silently

When you break your code you will often get some kind of exception. In neural nets everything could be correct syntactically, but the whole thing isn’t arranged properly and it’s very hard to tell.

Eg. You forgot to flip your labels when you flipped the images during data augmentation. Your neural network will still work pretty well cause it internally learned to detect flipped images and then flips its predictions.

The recipe

1 Get to know the data

Do not touch any neural network code. Instead spend a couple of hours scanning and analyzing the data. You can find things like corrupted images, duplicates etc.

Look for imbalances and biases. Pay attention to your own process for classifying the data, which hints at the kinds of architectures you’ll eventually wanna explore.

Search/filter/sort by whatever you can think of, visualize distributions and outliers along any axis. Outliers almost always uncover some bugs in data quality.

2 End-to-end training skeleton + get dumb baselines

Full training + evaluation skeleton to gain trust in its correctness via experiments. ick some simple model e.g. a linear classifier.

Fix random seed
Simplify, disable any unnecessary fanciness, turn off data augmentation
Verify that loss starts at the correct loss value
Input-independent baseline: e.g. set all your inputs to zero. This should perform worse than when you plug in your actual data. Does it?
Visualize the data that’s going into the network (just before the net)
Generalize a special case. People often bite off more than they can chew writing a relatively general functionality from scratch. Write a specific function, get that to work and then generalize it later making sure that you get the same result.

3 Overfit

First get a model large enough that it can overfit and regularize it. If we are not able to reach a low error rate with any model at all that may indicate some issues, bugs or misconfiguration.

Picking the model: Don’t be a hero. Don’t be too crazy or creative with various exotic architectures. In the early stages of your project simply find the most related paper and copt-paste their simples architecture that achieves good performance. E.g. for classifying images simply copy a ResNet-50 for your first run.
Adam
Complexify only one at a time. If you have mulitple signals to plug into your classifier, plug them in one by one.

4 Regularize

Get more data. The best and preferred way to regularize a model is to add more real training data. It’s a very common mistake to spend a lot of time
Data augment. The next best thing to real data is half-fake data.
Creative augmentation (fake data)
Use a pretrained network if you can.
Smaller input dimensionality. Remove features that may contain spurious signal.
Add dropout.

5 Tune

Random over grid search. Best to use random search because neural nets are often much more sensitive to some parameters than others. If param a matters but b has no effect then you want to sample a more thoroughly.
Hyperparameter optimization

6 Squeeze out the juice

Ensemble models
Leave it training, for weeks even.

Notes from karpathy.github.io

07 Jun 2023

Memoization

Memoization is a technique that improves the performance of a function by caching its results.

When a memoized function is called with a set of inputs, it checks if it has already computed and stored the result for those inputs. If it has, it returns the cached result instead of recomputing it, saving time and resources.


fib_cache = {}

def fibonacci(n):
    # Check if the result is already cached
    if n in fib_cache:
        return fib_cache[n]

    # Compute and cache the result
    if n <= 1:
        result= n
    else:
        result = fibonacci(n-1) + fibonacci(n-2)

    fib_cache[n] = result
    return result

# The first call will compute and cache the result
print(fibonacci(5))  # Output: 5

# The second call will retrieve the cached result
print(fibonacci(5))  # Output: 5

In this example when the function is called with a specific input (e.g., fibonacci(5)), it first checks if the result is already present in the cache. If it is, it returns the cached result.

Memoization is particularly useful for functions with expensive computations or repetitive calculations. By caching and reusing results, the function can avoid redundant work and achieve a significant performance improvement.

29 May 2023

:nth-child()

The :nth-child() is a pseudo-class that matches elements based on their position in a group of siblings.

Example

<ul>
  <li>Item 1</li>
  <li>Item 2</li>
  <li>Item 3</li>
  <li>Item 4</li>
  <li>Item 5</li>
</ul>

li:nth-child(2) {
  color: grey;
}

Result

Item 1
Item 2
Item 3
Item 4
Item 5

Other Examples

# select odd elements
:nth-child(odd)

# select even elements
:nth-child(even)

# select every 5th element
:nth-child(5n)

19 May 2023

pandas.DataFrame.iloc

DataFrame.iloc allows you to select rows and columns using integer-based indexing. You can pass it an integer, a list of integers, a boolean array or a function.

Examples

mydict = [{'a': 1, 'b': 2},
                    {'a': 100, 'b': 200}]

df = pd.DataFrame(mydict)

# Output: 
#       a     b
# 0     1     2
# 1   100   200

df.iloc[0]

# Output: 
# a    1
# b    2

df.iloc[[True, False]]
#       a     b
# 0     1     2

17 May 2023

Python itertools Module

→ Iterators are fast and memory-efficient

→ Can be used with lists, tuples, dictionaries, sets, etc.

→ Can be finite or infinite

`count(start, step)`

Starts from the start and goes infinitely

for i in intertools.count(5, 5):
    if i == 20:
        break
    else:
        print(i, end=" ")

# Output: 5, 10, 15

`cycle(iterable)`

count = 0

for i in intertools.cycle('AB'):
    if count > 7:
        break
    else:
        print(i, end=" ")
        count += 1

# Output: A B A B A B A B

`repeat(val, num)`

Repeatedly runs an infinite number of times. If optional num is provided, it runes num number of times.

print(list(itertools.repeat(25, 4)))

# Output: [25, 25, 25, 25]

`product()`

Computes the cartesian product on input iterables.

→ Use the optional repeat keyword to compute the product of an iterable with itself.

list(product([1,2], repeat=2))

# Output: [(1, 1), (1, 2), (2, 1), (2, 2)]

list(product(['hello', 'world'], '2'))

# Output: [('hello, '2'), ('world', '2')]

`permutations()`

Generates all possible permutations of an iterable.

→ Optional argument group_size becomes length of the iterable if not specified.

list(permutations([1, 'hello'], 2))

# Output [(1, 'hello'), ('hello', 1)]

`chain()`

Chains iterables one after another.

li1 = [1, 2, 3, 4]
li2 = [5, 6, 7, 8]

list(intertools.chain(li1, li2))

# Output: [1, 2, 3, 4, 5, 6, 7, 8]

`filterfalse()`

Only returns values that return false for the passed function.

li = [2, 4, 5, 7]
list(itertools.filterfalse(lambda x : x % 2 == 0, li))

# Output: [5, 7]

`takewhile()`

Returns values until the function returns false for the first time. The opposite of takewhile() is dropwhile().

li = [2, 4, 6, 7]
list(itertools.takewhile(lambda x : x % 2 == 0, li))

# Output: [2, 4, 6]

30 Apr 2023

Classifier-Free Guidance

Classifier Guidance

A method that combines score estimate of a diffusion model with the gradient of an image classifier.
It requires training an image classifier separate from the diffusion model, (in other words an extra classifier is needed)

BUT

Guidance can also be performed by a pure generative model without such a classifier → Classifier-free guidance.

Classifier-Free Guidance

→ Jointly train conditional (one that uses a prompt, eg. text) and unconditional (no prompt) diffusion model and combine the resulting conditional and unconditional score in order to get a trade-off between sample quality and diversity.

→ CFG Improves quality while reducing sample diversity in the diffusion model.

→ Images generated using CFG are very similar, but also high in quality.

→ Avoids training another classifier.

→ Very simple to implement (one-line code change)

26 Apr 2023

Temperature in Natural Language Processing

Increases or decreases the “confidence” a model has in its most likely response.

High temperature means the model will have high confidence in other responses too, not just the one it deems the most correct.

→ High Temperature = More Creative Model

→ Low Temperature = Less Creative Model