How to Create an AI Model for a mobile environment.
How to Create an AI Model for a mobile environment.
The essence of the existence of technology is to make people don't feel its existence. AI is such an awesome thing, AI exists everywhere in your life, but you feel is normal. In 2013, there were only 13 Apps with AI feature popped up on the Google play store, and this number goes to 144 in 2015. In 2016, 873 Apps were newly uploaded. With the help of AI, App will have the power to simplify the users' operations, and the app will be more intuitively driven. Also, there are still some challenges for a developer to make an app with AI. Don't worry, this Blog will give you a basic idea of how to choose your AI model and how to insert your model into the App environment.
TOPIC 1: Depthwise Convolution and Pointwise Convolution
Depthwise (DW) Convolution and Pointwise (PW) convolution, together, are called Depthwise Separable Convolution (see Google's Xception). This structure is similar to the conventional convolution operation and can be used to extract features, but compared with the conventional convolution operation, the number of parameters and operation costs are lower. So, you see this structure in some lightweight networks like MobileNet.
Normal convolution
For a 5×5, three-channel color input image (shape is 5×5×3). After the convolution layer of a 3×3 convolution kernel (assuming that the number of output channels is 4, the shape of the convolution kernel is 3×3×3×4), 4 Feature maps are finally output. If the same padding is available, the size is the same as that of the input layer (5×5); if not, the size becomes 3×3.
Depthwise Convolution
Different from the conventional Convolution operation, one convolution kernel of depthwise convolution is responsible for one channel, and one channel is convolved by only one convolution kernel. The conventional convolution mentioned above is that each convolution kernel operates simultaneously on each channel of the input picture.
For a 5×5, three-channel color input image (shape is 5×5×3), depthwise convolution first goes through the first convolution operation. Different from the conventional convolution above, DW is completely carried out in the two-dimensional plane. The number of convolution kernels is the same as the number of channels in the previous layer (channel and convolution kernels correspond one-to-one). Therefore, a three-channel image is generated into three Feature maps after the operation (if there is the same padding, the size is 5×5 with the input layer), as shown in the figure below.
Pointwise convolution
The operation of Pointwise Convolution is very similar to the conventional convolution operation, except that the size of the convolution kernel is 1×1×M, and M is the depth of the previous layer. Therefore, the convolution operation here will combine the map of the previous step weighted in the depth direction to generate a new Feature map. There are several Feature maps for several filters. As shown in the figure below.
The parameters saved
Under normal circumstances, this reduces the memory and computational load required to run the model to 5% of what it was before, and as a result, the savings do not result in much worse output.
TOPIC 2: Faster C-RNN
Faster C-RNN is the mobile we want to put into our mobile environment. It has mainly two parts: VGG-19 perceptron and RPN. Faster RCNN supports the input of an image of any size, for example, P*Q*3, and sets the normalized scale of the image before entering the network. For example, the short edge of the image can be set to no more than 600, and the long edge of the image no more than 1000, so as to avoid the computational burden caused by excessive image resolution. We can assume that M*N=1000*600 (if the image is less than this size, the edge can be 0, that is, the image will have a black edge). After Conv layers and 16 times of downsampling, the picture size became (M/16)*(N/16), that is,60 *40, which (1000/16≈60), (600/16≈40); In this case, the Feature Map is 60*40* 512-D, indicating that the size of the feature map is 60*40 and the number of channels is 512.
What is the nature of the anchor box?
The essence is the reverse of the spatial pyramid pooling (SPP) idea. SPP is to resize the input of different sizes into the output of the same size. So, the reverse of SPP is to take an output of the same size and work backward to get an input of a different size.
Main contributions of Faster C-RNN:
1) Region Proposal Network (RPN). RPN implements the Proposal function through a deep network, replacing the previous SS (Selective Search) method, and greatly improving the detection speed.
Firstly, the shared convolutional layer is used to extract features for the whole image, and then the obtained feature maps are sent to RPN, which generates the box to be detected (specifying the location of RoI) and revises the bounding box of RoI for the first time. After that, it is the architecture of Fast R-CNN. According to the output of RPN, the RoI Pooling Layer selects the corresponding features of each RoI on the feature map and sets the dimension to a fixed value. Finally, the fully connected Layer (FC Layer) is used to classify the box, and the second correction of the target box is performed. In particular, Faster R-CNN really achieves end-to-end training.
For the generated anchor, RPN needs to do two things. The first is to judge whether the anchor is the foreground or the background, which means whether the anchor covers the target or not. The second is to make the first coordinate correction for the anchor belonging to the foreground. For the former problem, Faster R-CNN uses SoftMax Loss for direct training, and the anchor beyond the image boundary is excluded during training. For the latter problem, SmoothL1Loss is used for training.
Final Effect

original image
identification result
References:
[1]. https://arxiv.org/pdf/1506.01497.pdf
[2]. https://arxiv.org/pdf/1609.03605.pdf
[3]. https://arxiv.org/pdf/2008.08272.pdf
[4]. https://arxiv.org/pdf/1704.04861v1.pdf
Comments
Post a Comment