For a project I've been working on recently, I needed a method that would identify regions of interest (ROIs) using a depth camera that were likely to contain objects without knowing anything about the object beforehand. I searched for a ROS library that would handle this kind of request out of the box, and while I found some pre-existing options at   , ORK and REIN were centered around object recognition and required pre-training and Tabletop Object Detector segmented a PointCloud. I decided to write my own detector, and over the last couple days had fun developing it using the new scikit-image library. Although a simple prototype, it works surprisingly well for low entropy environments, and while developed using python's scikit-image library the performance limited primarily by the scikit-learn clustering algorithm.
Scikit-image is a new library being developed as a domain specific package for the popular scikit-learn under the same Modified BSD license. While at the time of writing it is still in the early phases of development, it is easier to use than the python binding to OpenCV and comes with a ton of cool features with more on the way. It was an excellent tool for prototyping and I'm excited to see it continue to develop.
For an overview of state of the art methods using just RGB see Hosang et al. EdgeBoxes is the best performing method from this paper according to the authors criteria but unfortunately I've only found a Matlab implementation from the authors. BING is also worth looking into if you have a real need for speed and exists as a simple c++ file.
Finding Good Blobs
The key to ROI detection is determining good blobs inside an RGB-D image. I tried a couple of the scikit threshold methods, but the one that worked the best for me was Otsu' method which produced rich, dense blobs . Otsu's method converts the RGB to gray-scale and then computes a histogram of intensity. It then chooses an optimal threshold value to split the image into two classes based on minimizing the intra-class variance as the selection criteria. It is very fast, but doesn't produce very meaningful separations when there is a high degree of variance in the image.
As shown in the images above, Otu's method works best when there is a high contrast difference between the background and the foreground. We can determine which side of the threshold is a background and which is a foreground by picking the side with the smaller number of total pixels since we are assuming interesting objects are sparse in the picture. While Otsu's method removed the floor and the walls, the true foreground and background in my picture are not well separated since those parts of the background are high entropy (disordered). Unfortunately, I do not believe there is much you can do about this with the raw rgb image without resorting to machine learning. However, because I have the depth channel from my Xtion sensor I can mask out the distant parts of the image that are appearing in Otsu's method using the depth image and a threshold. This improves the blob detection significantly.
As a final step, I use the remove_small_objects method from scikit-image to remove small spurious detections.
Okay great, but what depth threshold to use? Actually, here again we can use Otsu's method to determine the optimal threshold value for separating the depth image into foreground and background and just limit the thresholding to be no greater than your_personal_preference meters. This helps ensure that blob detection remains good even if the background is not extremely distant from ROIs.
The average time for all image processing described above using scikit-image was 0.0074 sec.
Now that I have good, separated blobs I need to separate them into distinct clusters. Fortunately, scikit-learn gives us a large number of options to choose from. Since one cannot possibly know the number of clusters for this kind of problem beforehand, and the number will change over time, we cannot use methods that require the number of clusters as a parameter. Of the remaining choices, I was already familiar with Mean Shift  and DBSCAN  so I only looked at those. For features, I used (x, y, depth, gray-scale) with depth and gray-scale intensity being optional.
While I didn't conduct an exhaustive study, my anecdotal findings are that mean shift performs better than DBSCAN using only (x, y) as features. As you added more features processing time scaled well but results were worse and grew increasingly unstable from image sample to image sample. For DBSCAN, using all features resulted in good, stable cluster assignments but the processing time required grew larger than Mean Shift.
From right to left: RGB Image, pre-processing data, Mean Shift clustering, DBSCAN clustering
Limiting the blob detection data points to a sample size of 3,000 resulted in a processing time of around 0.13 seconds for both DBSCAN and Mean Shift using (x, y, depth). I am unsure if switching to a direct C++ version of DBSCAN will result in faster performance.
While the cluster assignments provide a separated mask of the image, a lot of useful information about the detected region is thrown out via the thresholding process. Therefore, I feed the (x, y) data points for each cluster assignment into a GMM to fit a mean and co-variance matrix for the cluster. Since I am only fitting one model to each cluster independently, it is very efficient, and I can use that information to obtain regions that capture more of the object than the mask alone. Because the clustering algorithm can be overly aggressive, I merge two independently regions if they share a large amount of common area relative to their own size. This prevents objects from fragmenting into two separate but heavily overlapping regions. However, the downside of using such as simplistic approach is the it prevents two objects from getting too close to one another, even if the clustering algorithm has correctly separated them.
The following are some sample results taken with an Xtion Asus sensor from my desktop. I am using (x, y, depth) features, DBSCAN for the clustering algorithm, and a max depth threshold of 4 meters.
The main benefits of this method are:
- No training, generic region of interest detection.
- Simple to understand behavior.
- Adaptive choice of depth threshold.
1. Boxing is not a good method.
2. Will fail for high entropy environments
You can find my python file here.
 http://wiki.ros.org/rein  http://wiki.ros.org/tabletop_object_detector  http://wg-perception.github.io/object_recognition_core/  http://en.wikipedia.org/wiki/Otsu's_method  http://scikit-image.org/  Cheng, Yizong (August 1995). "Mean Shift, Mode Seeking, and Clustering". IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE) 17 (8): 790–799  Ester, M., H. P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, AAAI Press, pp. 226-231. 1996