1. Duration
Monday, November 14th, 2022 - Saturday, November 19th, 2022
2. Learning Record
2.1 Fine-Tune the ViT Model
Fine-tuned the vision transformer model on the cat_and_dag dataset and the flowers dataset. The model achieved approximately 86% accuracy on the flowers dataset.
I also refactored the code for micro-expression spotting to fit the input shape of the vision transformer. But the result was as bad as shit. I needed time to fine-tune the hyperparameters.
2.2 Learned Swin Vision Transformer
I found a different structure of vision transformer called Swin Vit. I watched the video and planned to read the code the next week.
2.3 Learned SL-Vit
I read the paper [1] and refactored the code using TensorFlow to make its structure be similar to the vision transformer code shown in the d2l notebook.
The code worked well and gave a similar result on the cat_and_dog dataset and the flowers dataset.
I read the code carefully and found the SL-Vit changes the patch embedding and multilayer attention module of the vision transformer.
3. Feeling
I don't have any good feelings even though the code ran well as I was blocked in my dorm for ten days.
4. Reference
[1]S. H. Lee, S. Lee, and B. C. Song, “Vision Transformer for Small-Size Datasets.” arXiv, 2021. doi: 10.48550/ARXIV.2112.13492.