03 Jan 2024
As more layers using certain activation functions are added to networks, the gradients and the loss function approaches zero, making the network harder to train.
This is because some activation functions (for example the sigmoid function) squishes a large input into a small input space between 0 and 1. Therefore a large change in the input of the sigmoid will cause a small change in the output, and the derivative becomes small.
When the inputs of the sigmoid function become very large or very small, the derivative is close to zero.
For small networks this isn’t a problem but for big networks it makes it harder to train effectively.
→ ReLU: doesn’t cause a small derivative
→ Residual networks
→ Batch normalization
03 Jan 2024
Exploratory Data Analysis (EDA) is the initial phase of data analysis in which you explore and summarize the main characteristics, patterns, and relationships present in a dataset. The goal of EDA is to gain a better understanding of the data, uncover potential insights, and identify any data quality issues or anomalies before proceeding with further analysis or modeling.
02 Jan 2024
By default all the changes inside the container are lost when the container stops. If we want to keep data between runs, Docker volumes and bind mounts can help.
→ A docker container runs software stack defined in an image.
→ Can be shared between containers.
docker volume ls
shows all the volumes known to Dockerdocker volume ls -f name=data
allows us to filter volumesdocker volume rm data_vol
removes the volumedocker volume prune
removes all unused volumes→ A connection from the container to a directory on the host machine. It allows the host to share it’s own file system with the container.
A bind mount uses the host file system, but Docker volumes are native to Docker. The volume has a lifecycle that’s longer than the container’s, allowing it to persist until no longer needed. Volumes can be shared between containers.
docker volume
allows you to manage volumes. For example:
$ docker volume ls
DRIVER VOLUME NAME
local data_volume
To remove volume:
$ docker volume rm data_volume
To remove all unused volumes:
$ docker volume prune
04 Dec 2023
Flyweight lets you for more objects into the available amount of RAM by sharing common parts between multiple objects.
For example, in text editors, each character in a document can be represented as a flyweight object. The character's intrinsic state (such as its Unicode code point) is shared among all instances, while the extrinsic state (e.g., position and formatting) is stored separately for each character.
17 Nov 2023
@Published
is a property wrapper in SwiftUI that allows us to create observable objects that automatically announce when changes occur.
Whenever the property is changed, all vies using that object will be reloaded to reflect those changes.
class Bag: ObservableObkect {
var items = [String]()
}
This conforms to the ObservableObject
protocol (which means views can watch it for changes. But since items isn’t marked with @Published
, no change announcements will be sent.
class Bag: ObservableObject {
@Published var items = [String]()
}
31 Oct 2023
Regularization is the training of the model so that it can generalize over data it hasn’t seen before. We can regularize models using data augmentation, weight decay or dropout.
In dropout, during training, a random subset of neurons or units in a neural network is "dropped out" or temporarily turned off with a probability, usually specified as a hyperparameter, typically in the range of 0.2 to 0.5. This means that for each training example, some portion of the network's units are not used in the forward and backward passes. As a result, the network becomes more robust and generalizes better to new data.
The key idea behind dropout is that it prevents the network from relying too heavily on any specific neuron, making the network more adaptive and less likely to overfit. During inference (i.e., when making predictions), all neurons are active, and their contributions are scaled by the dropout probability used during training.
It drops a unit (eg. node) and its connections with a specified probability value p (common value is p = 0.5).
Why? To prevent co-adaptation where neural network becomes too reliant on particular connections (this could mean overfitting is happening).
19 Oct 2023
"Saturating" refers to a situation where the activation function of a neuron reaches its maximum or minimum output value, and any further changes in the input have little to no effect on the output and the neuron becomes unresponsive. This can hinder the learning process in a neural network.
That's one of the reasons why Rectified Linear Units (ReLUs) have become popular in neural networks. ReLUs do not saturate for positive input values; they output the input as-is if it's positive and zero for negative input values.
17 Oct 2023
In ML the hypothesis is the equation or formula that your model uses to make predictions based on input data.
Imagine you have a dataset of houses with features like size, number of bedrooms, and location, and you want to predict their prices. Your hypothesis function might look like this:
H(size, bedrooms, location) = w_1 * size + w_2 * bedrooms + w_3 * location + b
Where
H
represents your hypothesis function.size
, bedrooms
, and location
are the input features of a house.w_1
, w_2
, w_3
are the weights (parameters) that your model learns during training to adjust the importance of each feature.b
is the bias term, which accounts for factors not directly related to the input features.During the training process, your machine learning algorithm tries to find the best values for w1 , w2, w_3 and b that minimize the difference between the predicted prices and the actual prices in your training dataset. Once trained, this hypothesis function can be used to predict the prices of new houses based on their size, bedrooms, and location.
16 Oct 2023
from sklearn.svm import SVR
SVM_reg = SVR()
param_grid = [
{'kernel': ['linear'], 'C': [10., 30., 100., 300., 1000., 3000.]},
{'kernel': ['rbf'], 'C': [1.0, 3.0, 10., 30., 100., 300., 1000.0],
'gamma': [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]},
]
grid_search = GridSearchCV(SVM_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
verbose=2)
# grid_search.fit(housing_prepared, housing_labels)
negative_mse = grid_search.best_score_
rmse = np.sqrt(-negative_mse)
rmse
The scoring parameter expects a utility function, where higher values indicate better performance. However, mean squared error (MSE) is an error metric, where lower values indicate better performance. To convert the MSE into a utility function, the negative value is used.
By using negative MSE as the scoring metric, the grid search algorithm will try to maximize the negative MSE, which is equivalent to minimizing the MSE.
16 Aug 2023
A utility function, in the context of machine learning and optimization, is a mathematical function that assigns a numerical value to a particular outcome or decision. It is used to represent the preference or desirability of different outcomes or decisions.
In the case of model evaluation and hyperparameter tuning, a utility function is used as a scoring metric to measure the performance of a model or a combination of hyperparameters. The utility function provides a way to quantify how well the model is performing or how desirable a set of hyperparameters is.
One common approach in AI decision-making is to maximize the expected utility, which considers the probability of different outcomes occurring. The expected utility is calculated by multiplying the utility of each outcome by its corresponding probability and summing up the results. The AI system chooses the action with the highest expected utility as the optimal choice.
Examples:
# Define the utility function
def calculate_utility(salary, commute_time):
# Utility is a linear combination of salary and negative commute time
return salary - 2 * commute_time
# Job offer 1 details
salary_job1 = 60000 # in dollars per year
commute_time_job1 = 30 # in minutes
# Job offer 2 details
salary_job2 = 55000 # in dollars per year
commute_time_job2 = 20 # in minutes
# Calculate the utility for each job offer
utility_job1 = calculate_utility(salary_job1, commute_time_job1)
utility_job2 = calculate_utility(salary_job2, commute_time_job2)
# Compare the utilities to make a decision
if utility_job1 > utility_job2:
print("Job offer 1 is the better choice.")
elif utility_job2 > utility_job1:
print("Job offer 2 is the better choice.")
else:
print("Both job offers are equally desirable.")
01 Aug 2023
Django ORM maps attributes to columns in the database.
Ecommerce website that sells books. You need the following models: Book, Cart. This approach is non-polymorphic.
class Book(Model):
name = models.CharField(
max_length=100,
)
price = models.PositiveIntegerField(
help_text='in cents',
)
weight = models.PositiveIntegerField(
help_text='in grams',
)
class Cart(Model):
user = models.OneToOneField(
get_user_model(),
primary_key=True,
on_delete=models.CASCADE,
)
books = models.ManyToManyField(Book)
To create a cart with books:
book = Book.objects.create(name="Awesome book", price=100, weight=200)
user = get_user_model().create_user("Some user")
cart = Cart.objects.create(user=user)
cart.products.add(book)
Abstract base model creates an abstract base class that only exists in the code, not in the database.
class Product(Model):
class Meta:
abstract = True
name = models.CharField(
max_length=100,
)
price = models.PositiveIntegerField(
help_text='in cents',
)
class Book(Product):
weight = models.PositiveIntegerField()
class EBook(Product):
download_link = models.URLField()
Both Book and EBook inherit from Product. Now let’s create the Cart.
class Cart(Model):
user = models.OneToOneField(
get_user_model(),
primary_key=True,
on_delete=models.CASCADE,
)
# we have to create a different variable for each type of product
book = models.ManyToManyField(Book)
ebooks = models.ManyToManyField(EBook)
Concrete base model is an approach that does create a separate table in the database.
In this example we have both Books and EBooks in our store. Note that the Product Model is not defined with abstract = True
.
class Product(Model):
name = models.CharField(
max_length=100,
)
price = models.PositiveIntegerField(
help_text='in cents',
)
class Book(Product):
weight = models.PositiveIntegerField()
class EBook(Product):
download_link = models.URLField()
Create new products:
book = Book.objects.create(...)
ebook = EBook.objects.create(...)
This will add an entry in the Product table that will contain all the common fields, and 2 entries in the book and ebook tables, for the values specific to their respective products.
Notice the productptrid in Books table which points to its respective item in Products table.
Column | Type |
---|---|
productptrid | integer |
weight | integer |
Django performs a join in the background when fetching a book.
Since all products are managed in the Product table, you can reference them using a foreign key in the Cart model.
class Cart(Model):
...
items = models.ManyToManyField(Product)
cart = Cart.objects.create(user=user)
cart.items.add(book, ebook)
31 Jul 2023
To receive a signal, register a receiver function:
from django.core.signals import request_finished
def my_callback(sender, **kwargs)
print("Request finished!")
request_finished.connect(my_callback)
or
@receiver
def my_callback(sender, **kwargs):
print("Request finished!")
my_callback
function will be called each time a request finishes.
@receiver
def my_handler(sender, sender=MyModel):
print("Request finished!")
The method will be called when an instance of MyModel is saved
task_done = django.dispatch.Signal()
def send_task_signal(self, task_name):
task_done.send(sender=self.__class__, task_name=task_name)
To disconnect from a signal simply use:
Signal.disconnect()
22 Jul 2023
On the data side data augmentation helps models generalize.
On the model side weight decay can help us generalize.
To prevent overfitting we shouldn’t allow our models to get too complex (eg. a polynomial that is actually very overfit like in the pic). Having fewer parameters can prevent your model from getting overly complex, but its a limiting strategy.
To penalize complexity we can
→ add all our parameters (weights) to the loss function,
but since some of them are positive and some negative, we can
→ add the squares of all the parameters to the loss function
but it might result in loss getting huge and the best model would have all parameters set to 0
→ we multiply the sum of squares with another smaller number called *weight decay*, or wd.
Our loss function will look as follows:
Loss = MSE(y_hat, y) + wd * sum(w^2)
y_hat
= predicted or estimated value of the target variable
y
= actual target value
When we update weights using gradient descent we do the follwing:
w(t) = w(t-1) - lr * dLoss / dw
lr
= learning rate, a hyperparameter that determines the step size or the rate at which weights are updated during training
dLoss
= derivative of the loss function wrt the weights, represents how the loss function changes as the weights are modified
Generally wd = 0.1
works pretty well.
→ Too much weight decay then no matter how much you train the model will never fit quire well.
→ Too much weight decay and you can still train well, but you need to stop a bit early.
25 Jun 2023
When you break your code you will often get some kind of exception. In neural nets everything could be correct syntactically, but the whole thing isn’t arranged properly and it’s very hard to tell.
Eg. You forgot to flip your labels when you flipped the images during data augmentation. Your neural network will still work pretty well cause it internally learned to detect flipped images and then flips its predictions.
Do not touch any neural network code. Instead spend a couple of hours scanning and analyzing the data. You can find things like corrupted images, duplicates etc.
Look for imbalances and biases. Pay attention to your own process for classifying the data, which hints at the kinds of architectures you’ll eventually wanna explore.
Search/filter/sort by whatever you can think of, visualize distributions and outliers along any axis. Outliers almost always uncover some bugs in data quality.
Full training + evaluation skeleton to gain trust in its correctness via experiments. ick some simple model e.g. a linear classifier.
First get a model large enough that it can overfit and regularize it. If we are not able to reach a low error rate with any model at all that may indicate some issues, bugs or misconfiguration.
Notes from karpathy.github.io
07 Jun 2023
Memoization is a technique that improves the performance of a function by caching its results.
When a memoized function is called with a set of inputs, it checks if it has already computed and stored the result for those inputs. If it has, it returns the cached result instead of recomputing it, saving time and resources.
fib_cache = {}
def fibonacci(n):
# Check if the result is already cached
if n in fib_cache:
return fib_cache[n]
# Compute and cache the result
if n <= 1:
result= n
else:
result = fibonacci(n-1) + fibonacci(n-2)
fib_cache[n] = result
return result
# The first call will compute and cache the result
print(fibonacci(5)) # Output: 5
# The second call will retrieve the cached result
print(fibonacci(5)) # Output: 5
In this example when the function is called with a specific input (e.g., fibonacci(5)
), it first checks if the result is already present in the cache. If it is, it returns the cached result.
Memoization is particularly useful for functions with expensive computations or repetitive calculations. By caching and reusing results, the function can avoid redundant work and achieve a significant performance improvement.
29 May 2023
The :nth-child()
is a pseudo-class that matches elements based on their position in a group of siblings.
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
<li>Item 4</li>
<li>Item 5</li>
</ul>
li:nth-child(2) {
color: grey;
}
Result
# select odd elements
:nth-child(odd)
# select even elements
:nth-child(even)
# select every 5th element
:nth-child(5n)
19 May 2023
DataFrame.iloc
allows you to select rows and columns using integer-based indexing. You can pass it an integer, a list of integers, a boolean array or a function.
mydict = [{'a': 1, 'b': 2},
{'a': 100, 'b': 200}]
df = pd.DataFrame(mydict)
# Output:
# a b
# 0 1 2
# 1 100 200
df.iloc[0]
# Output:
# a 1
# b 2
df.iloc[[True, False]]
# a b
# 0 1 2
17 May 2023
→ Iterators are fast and memory-efficient
→ Can be used with lists, tuples, dictionaries, sets, etc.
→ Can be finite or infinite
count(start, step)
Starts from the start and goes infinitely
for i in intertools.count(5, 5):
if i == 20:
break
else:
print(i, end=" ")
# Output: 5, 10, 15
cycle(iterable)
count = 0
for i in intertools.cycle('AB'):
if count > 7:
break
else:
print(i, end=" ")
count += 1
# Output: A B A B A B A B
repeat(val, num)
Repeatedly runs an infinite number of times. If optional num
is provided, it runes num
number of times.
print(list(itertools.repeat(25, 4)))
# Output: [25, 25, 25, 25]
product()
Computes the cartesian product on input iterables.
→ Use the optional repeat
keyword to compute the product of an iterable with itself.
list(product([1,2], repeat=2))
# Output: [(1, 1), (1, 2), (2, 1), (2, 2)]
list(product(['hello', 'world'], '2'))
# Output: [('hello, '2'), ('world', '2')]
permutations()
Generates all possible permutations of an iterable.
→ Optional argument group_size
becomes length of the iterable if not specified.
list(permutations([1, 'hello'], 2))
# Output [(1, 'hello'), ('hello', 1)]
chain()
Chains iterables one after another.
li1 = [1, 2, 3, 4]
li2 = [5, 6, 7, 8]
list(intertools.chain(li1, li2))
# Output: [1, 2, 3, 4, 5, 6, 7, 8]
filterfalse()
Only returns values that return false for the passed function.
li = [2, 4, 5, 7]
list(itertools.filterfalse(lambda x : x % 2 == 0, li))
# Output: [5, 7]
takewhile()
Returns values until the function returns false for the first time. The opposite of takewhile()
is dropwhile()
.
li = [2, 4, 6, 7]
list(itertools.takewhile(lambda x : x % 2 == 0, li))
# Output: [2, 4, 6]
30 Apr 2023
BUT
Guidance can also be performed by a pure generative model without such a classifier → Classifier-free guidance
.
→ Jointly train conditional (one that uses a prompt, eg. text) and unconditional (no prompt) diffusion model and combine the resulting conditional and unconditional score in order to get a trade-off between sample quality and diversity.
→ CFG Improves quality while reducing sample diversity in the diffusion model.
→ Images generated using CFG are very similar, but also high in quality.
→ Avoids training another classifier.
→ Very simple to implement (one-line code change)
26 Apr 2023
Increases or decreases the “confidence” a model has in its most likely response.
High temperature means the model will have high confidence in other responses too, not just the one it deems the most correct.
→ High Temperature = More Creative Model
→ Low Temperature = Less Creative Model