- Add loop unrolling for some operations
- Do CPU fine tuning
- Add documentation
- Add more AVX512 features = better use of __mmask (to replace blend)
