Part 2
Issue 1: Local Minima and Global Minimum
(Refer to the image in the comment)
This concept is better explained through the "exploration vs exploitation" concept in reinforcement learning. However, this discussion is limited to supervised learning.
The "distance" between the output and the target is often called "cost" or "loss". Practically, the goal of learning is to update oneself to operate with less loss. But updating consumes energy, and people generally avoid expending energy unless they deem it worthwhile.
During training, one might notice an improvement in technique over a previous state, and changing could worsen the output. In AI models, this is called a minima, where there's nowhere better/lower to go, and any change increases loss rather than decreases it. The model then stops training. Thus, programmers must apply techniques like regularization, changing the loss function, or using mini-batch gradient descent to help the model "escape" local minima, abandoning slightly good spots for better ones. Before reaching a lower spot, one must accept climbing out of the current low.
Sometimes, one feels satisfied with their technique and output, leading to a halt in training, thinking they're good enough. Here, one should remember the mantra from Vagabond, "There is no limit to technique. There is always room for improvement." There's always room for progress.
"The perfect man employs his mind as a mirror; it grasps nothing, it refuses nothing, it receives, but does not keep" - Chuang Tzu. Don't cling to the past or reject what's coming; everything is training, and your job is simply to produce the best output and enjoy the waves. Whether loss increases or decreases, that's up to The Creator.
"'Mistakes' is the word you’re too embarrassed to use. You ought not to be. You’re a product of a trillion of them. Evolution forged the entirety of sentient life on this planet using one tool – the mistake." - Dr. Robert Ford, WestWorld
Issue 2: Overfitting and Underfitting
When training a model, the dataset is typically divided into three: training, validation, and test. The training dataset is used to train the model; this loss updates the model. The validation dataset's loss is calculated but not used for updates, helping programmers see where the model is heading. The test set is similar to validation but in the quality assessment phase.
Validation and test sets are unseen data (as the model isn't updated during these runs). Their purpose is to check if the model "understands" or sees underlying patterns in data. For example, given the sequence 1,2,4,8,16…, the pattern is "each number is double the previous". Recognizing this allows predicting the next number. Another example is a formula ax + b = y; if given 1x+2 = 4, 3x + 4 = 10, 5x +1 = 11, we're quite sure x=2, as this yields correct results in all three cases, a 100% success rate. If one doesn't understand the pattern, they must make educated guesses based on what they've seen.
Underfitting occurs when a model is too simple and can't see underlying patterns or produce results close to targets. Humans underfit when they lack the mental capacity to solve a problem or are overwhelmed by too much data before they can process and update. For instance, in martial arts, a novice has to absorb much information, not understanding opponents or their own actions. A seasoned martial artist can predict an opponent's next move from subtle cues. If reading this, if your attention span is 10 minutes, you might not have the mental capacity to grasp these concepts (case 1). If you're unfamiliar with AI or learning concepts, this might be too much information to digest (case 2). If you know basic AI concepts or have learning experience, this should be straightforward.
To solve underfitting, one needs to increase mental capacity (to play at a higher level) and more training time (to process information gradually).
Humans are always at risk of both overfitting and underfitting. Underfitting means not understanding deeply enough, while overfitting means not understanding broadly enough. Since life has no global minimum, humans can never understand deeply and broadly enough. There's always room for improvement.
Quoting from Tao Te Ching:
The Tao that can be told is not the eternal Tao. Ever desireless, one can see the mystery. Ever desiring, one can see the manifestations.
Without going outside, you may know the whole world. Without looking through the window, you may see the ways of heaven. The farther you go, the less you know. Thus the sage knows without traveling; He sees without looking; He works without doing.
In the pursuit of learning, every day something is acquired. In the pursuit of Tao, every day something is dropped. Less and less is done Until non-action is achieved. When nothing is done, nothing is left undone. The world is ruled by letting things take their course. It cannot be ruled by interfering.
To address human underfitting and overfitting, Tao Te Ching suggests: relax, and enjoy the waves.
Imagine you're falling, accelerating, feeling like you're falling faster, never hitting the ground, looking around to see an endless space, unsure of your direction, where movement feels like stillness. (This part seems relevant if you've realized you're "headless", making understanding the concept of latent space/mindspace easier.)
Assuming the human mind is a model, with datasets being concepts gathered from the past, creating connections between concepts (like word embedding and latent space), then using current input to predict the future. Reasoning is like moving from one concept's coordinates to another's; you're constantly moving in mindspace, yet it's like you're stationary, with concepts reshaping and transforming within this space.
Word embedding represents data as vectors. Each word has a unique coordinate in latent space. ChatGPT's early latent space had 12288 dimensions, recently reduced to 1568 dimensions. During training, words frequently appearing together have their coordinates adjusted to be closer. What's updated are the connections between concepts. If you lack machine learning concepts, reading this adds these to your mindspace. New concepts from this text might locate near existing philosophy or computer-related concepts in your mindspace. Concepts like "rapper", "socks", "milkshake" are unrelated and thus far apart from the text's concepts. If you have machine learning concepts, this might refine their positions in your mindspace, making the network of concepts more coherent.
(to truly understand GPT, one must grasp transformers, attention mechanism, entropy, temperature… but this discussion is mainly speculative)