Are these model in any kind optimized for open cv (i doubt it)?
You don't optimize a network for OpenCV but you can design a network for speed. See for instance MobileNets (MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications) architecture.
The first one has been trained with (in my opinion) speed performance in mind: how_to_train_face_detector.txt
. fp16
should mean that weights are stored in 16-bits floating points (Half-precision floating-point format), so less memory usage / better memory transfer.
This is a brief description of training process which has been used to get
res10_300x300_ssd_iter_140000.caffemodel.
The model was created with SSD
framework using ResNet-10 like
architecture as a backbone. Channels
count in ResNet-10 convolution layers
was significantly dropped (2x- or 4x-
fewer channels). The model was trained
in Caffe framework on some huge and
available online dataset.
The second model openface.nn4.small2.v1.t7
is a reduced network (3733968 vs 6959088 parameters for nn4
) to try to achieve good accuracy with less parameters and so better computation time.
More information from my different research.
Convolution, deep learning and performance
Convolution is a very expensive operation, as you can see in the following image (Learning Semantic Image Representations at a Large Scale, Yangqing Jia):
To perform efficient convolution, matrix-matrix multiplication (GEMM or General matrix-matrix multiplication) is done instead (Why GEMM is at the heart of deep learning). Data must be re-ordered in a way suitable to perform convolution as a GEMM operation. This is known as im2col operation.
BLAS (Basic Linear Algebra Subprograms) libraries are crucial to have good performance on CPU. You can see for instance the prerequisite dependencies of the Caffe framework:
Caffe has several dependencies:
CUDA is required for GPU mode.
library version 7+ and the latest driver version are recommended, but 6.* is fine too
5.5, and 5.0 are compatible but considered legacy
BLAS via ATLAS, MKL, or OpenBLAS.
Boost >= 1.55
protobuf, glog, gflags, hdf5
Optional dependencies:
OpenCV >= 2.4 including 3.0
IO libraries: lmdb, leveldb (note: leveldb requires snappy)
cuDNN for GPU acceleration (v6)
To achieve better performance, expensive deep learning operations are accelerated using dedicated libraries:
In OpenCV, convolution layer has been carefully optimized using multithreading, vectorized instruction, cache friendly optimizations. But still, better performance is achieved when Intel Inference Engine is installed.
FP16, INT8 precision
Half or reduced precision is also a mean to achieve better performance with better memory transfer, memory usage and higher operations per second. Higher theoretical FLOPS of FP16 vs FP32 is subject to the hardware used since if there is no dedicated FP16 CPU "unit", the operation should be done in FP32 precision. The following image shows FP16 hardware (Half Precision Benchmarking for HPC):
See also Mixed-Precision Training of Deep Neural Networks:
Mixed-precision training lowers the required resources by using lower-precision arithmetic, which has the following benefits.
Decrease the required amount of memory. Half-precision floating point format (FP16) uses 16 bits, compared to 32 ...
(more)
"does it have a "model zoo" ? -- yes, look here
wow - thats a lot of models - thank you!