This project is an enhanced version based on naklecha/llama3-from-scratch. It has been comprehensively improved and optimized on the basis of the original project, aiming to help everyone more easily understand and master the implementation principle and the detailed reasoning process of the Llama3 model. Thanks to the contributions of the original author :)
-
Structural Optimization
-
Code Annotations
-
Dimension Tracking
-
Principle Explanation
-
KV-Cache Insights
-
Bilingual Documents
- Loading the model
- Convert the input text into embeddings
- Build the first Transformer block
- Normalization
- Implementing the single-head attention mechanism from scratch
- Obtain the QKV vectors corresponding to the input tokens
- Add positional information to the query and key vectors
- Everything's ready. Let's start calculating the attention weights between tokens.
- Finally! Calculate the final result of the single-head attention mechanism!
- Calculate the multi-head attention mechanism (a simple loop to repeat the above process)
- Perform the residual operation (add)
- Perform the second normalization operation
- Perform the calculation of the FFN (Feed-Forward Neural Network) layer
- Perform the residual operation again (Finally, we get the final output of the Transformer block!)
- Everything is here. Let's complete the calculation of all 32 Transformer blocks. Happy reading :)
- Let's complete the last step and predict the next token
- Let's dive deeper and see how different embeddings or token masking strategies might affect the prediction results :)
- Need to predict multiple tokens? Just using KV-Cache! (It really took me a lot of effort to sort this out. Orz)
- Thank you all. Thanks for your continuous learning. Love you all :)
- LICENSE