A Computer Vision and Machine learning blog

My photo
I'm a Computer Vision and Machine learning developer.

Friday, December 5, 2014

Image Fisher Vector In Python

Although the state of the art in image classification is deep learning,
Bag of words approaches still perform well on many image datasets.

Fisher vectors is the state of the art in that approach, allowing training more discriminative classifiers with a lower vocabulary size.

I wrote a simple Python implementation for calculating fisher vectors and using it to classify image categories:

You might want to look here for a derivation: 

The main improvement here is extracting a richer feature vector from images compared to bag of words.

In Bag of Words, for each local feature we find the closest word in the vocabulary, and add +1 to the histogram of the vocabulary in the input image.
But we could have sampled more data: 
- How far is each feature from its closest vocabulary word.
- How far is the feature from other vocabulary words.
- The distribution of the vocabulary words themselves. 

Brief outline of fisher vectors

Vocabulary learning with GMM:
- Sample many features from input images.
- Fit a Gaussian Mixture Model on those features. 
- The result is a vocabulary of dominant features in the image, and their distributions.

Image representation based on the vocabulary:
- Measure the expectation of the difference and distance of the image features, from each Gaussian distrubution, using the likelihood a feature belongs to certain gaussian. 
- Concatenate the resulting vector for each vocabulary word, into one large descriptor vector.

There is also a normalization step that I will skip here but is a part of the implementation, that is important if the features are fed into a classifier like SVM that needs normalized inputs.

This is a generalization of bag of words. If you set the likelihood of a feature to a vocabulary word to be 1 to it's closest word and 0 to the rest, 
and if you redefine the distance to be a constant "1", you get the original bag of words model.

Trying out the implementation:
python fisher.py <path_to_image_directory> <vocabulary size>
The image directory should contain two sub folders, one for the images of each class.

It currently just trains a model and then classifies the images.
The input images definitely need to be partitioned into training and validation parts.

One more thing:
Fisher vectors are successfully used in image recognition, check out:
In their paper they extract features densely from a grid,  reduce the dimensionality with PCA, and augment the features with their spacial location.

Monday, May 5, 2014

Bag Of Visual Words model for image classification

I wanted to play around with Bag Of Words for visual classification, so I coded a Matlab implementation that uses VLFEAT for the features and clustering.
It was tested on classifying Mac/Windows desktop screenshots.

For a small testing data set (about 50 images for each category), the best vocabulary size was about 80.
It scored 97% accuracy on the training set, and 85% accuracy on the cross validation set,
so the over-fitting can be improved a bit more.


1. Collect a data set of examples. I used a python script to download images from Google.
2. Partition the data set into a training set, and a cross validation set (80% - 20%).
3. Find key points in each image, using SIFT.
4. Take a patch around each key point, and calculate it's Histogram of Oriented Gradients (HoG). Gather all these features.
5. Build a visual vocabulary by finding representatives of the gathered features (quantization).
This done by k-means clustering.
6. Find the distribution of the vocabulary in each image in the training set.
This is done by a histogram with a bin for each vocabulary word.
The histogram values can be either hard values, or soft values.
Hard values means that for each descriptor of a key point patch in an image, we add 1 to the bin of the vocabulary word closest to it in absolute square value.
Soft values means that each patch votes to all histogram bins, but give a higher weight to bin representing words that are similar to that patch. Take a look here.
7. Train an SVM on the resulting histograms (each histogram is a feature vector, with a label).
8. Test the classifier on the cross validation set.
9. If results are not satisfactory, repeat 5 for a different vocabulary size and a different SVM parameters.

Visualization of the vocabulary learned by the clustering

Source Code


Wednesday, April 16, 2014

Refining the Hough Transform with CAMSHIFT

The Circular Hough Transform result is often not very accurate due to noise\details\occlusions.
Typical ways of dealing with this are:
1. Hand tuning the Hough Transform parameters.
2. Pre-processing the image aggressively before the transform is applied.

One trick I use to fix the circles positions is an iterative search in windows around the initial circles, I hope to have a future post about this here.

But now I will share a much simpler strategy that works well in some cases: Use CAMSHIFT to track the circular object in a window around the initial circles positions.

The idea is that the initial circle center position area holds information about how the circular object looks like, for example its color distribution. This is complementary to the Hough transform that uses only spatial information (the binary votes in the Hough space).

The steps:

  1. Find circles with the Circular Hough Transform.
  2. Find the histogram inside a small box around each circle. In a more general case we can use any kind of features we like, like texture features or something, but here we will stick with color features.
  3. For each pixel, find the probability it belongs to the circular object (back-projection).
  4. Optional: apply some strategy to fill holes in the back-projection image caused by occlusions. We can use morphology operations like dilating for example.
  5. Use CAMSHIFT to track the the circular object starting in a window around the initial circle position.

Conveniently for us, CAMSHIFT is included in OpenCV!

I encourage you to read the original CAMSHIFT paper to learn more about it:
Computer Vision Face Tracking For Use in a Perceptual User
Gary R. Bradski, Microcomputer Research Lab, Santa Clara, CA, Intel Corporation
Link to the paper

Code (C++, using OpenCV):
#include "opencv2/highgui/highgui.hpp"
#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/core/core.hpp"
#include "opencv2/video/tracking.hpp"
#include <tuple>

using namespace cv;
using namespace std;

//This is used to obtain a window inside the image,
//and cut it around the image borders.
Rect GetROI(Size const imageSize, Point const center, int windowSize)
    Point topLeft (center.x - windowSize / 2, center.y - windowSize / 2);

     if (topLeft.x + windowSize > imageSize.width || topLeft.y + windowSize >
         windowSize = 
         min(imageSize.width - topLeft.x, imageSize.height - topLeft.y);
    return Rect(topLeft.x, topLeft.y, windowSize, windowSize);

//This is used to find pixels that likely belong to the circular object
//we wish to track.
Mat HistogramBackProjectionForTracking(Mat const& image, Rect const window)
     const int sizes[] = {256,256,256};
     float rRange[] = {0,255};
     float gRange[] = {0,255};
     float bRange[] = {0,255};
     const float *ranges[] = {rRange,gRange,bRange};
     const int channels[] = {0, 1, 2};

     Mat roi = image(window);

     Mat hist;
     if (image.channels() == 3)
      calcHist(&roi, 1, channels, Mat(), hist, 3, sizes, ranges);
      calcHist(&roi, 1, &channels[0], Mat(), hist, 1, &sizes[0], &ranges[0]);

     Mat backproj;
     calcBackProject(&image, 1, channels, hist, backproj, ranges);
     return backproj;

//Return a new circle by using CAMSHIFT to track the object around the initial circle.
tuple<Point, int> HoughShift(Point const center, int const radius, Mat const& image)
    Mat backproj = HistogramBackProjectionForTracking(image, 
    GetROI(image.size(),center, radius));

     //Fill holes:
     cv::dilate(backproj, backproj, cv::Mat(), cv::Point(-1,-1));
     cv::dilate(backproj, backproj, cv::Mat(), cv::Point(-1,-1));

    const int windowTrackingSize = 4 * radius;
    RotatedRect track = CamShift(backproj, GetROI(image.size(), center,
    TermCriteria( CV_TERMCRIT_EPS | CV_TERMCRIT_ITER, 10, 1 ));

    return make_tuple(track.center, (track.size.width + track.size.height )/ 4);

int main(int argc, char** argv)
     Mat image = cv::imread("image.jpg");

     Mat before, after; image.copyTo(before); image.copyTo(after);
     Mat gray; cv::cvtColor(image, gray, CV_BGR2GRAY);

     std::vector<cv::Vec3f> circles;
     HoughCircles(gray, circles, CV_HOUGH_GRADIENT, 2, gray.cols / 3, 20, 40,
     gray.cols / 20, gray.cols / 5);

     for (int i = 0; i < circles.size(); ++i)
         auto circle = HoughShift(Point(circles[i][0], circles[i][1]), 
         circles[i][2], image);

         circle(before, Point(circles[i][0], circles[i][1]), circles[i][2], 
         Scalar(128, 128, 30),  2);
         circle(after, get<0>(circle), get<1>(circle), Scalar(255, 0 , 0), 2);

     imshow("Initial Circles", before);
     imshow("Refined Circles", after);

     return 0;