Sunday, September 8, 2013

A quickie on machine vision

Ok, so I have some fairly unfounded but strong feelings about how to approach machine vision, and how not to. Some of this might even be unpopular... but here goes anyway:

1. OpenCV? Nah. Everybody is using it, and everybody is getting the same mediocre results.

2. CMU-whatever? Nah. Last time I looked it was $400 for the hardware.

3. Step 1: Convert to HSV! Really? That's not very creative...

4. It takes two (or three!) cameras to do stereo vision. Wrong! (although it is handy...)

5. Machine vision is really hard because the bar is set so high by the example of the human vision system. Ummm... No. If the human vision system was so awesome we'd never mis-read anything or walk into things or make mistakes as witnesses.

I actually think the only reason the human vision system is kinda nifty is the massive volume of data we discard.

Based on the idea that the more data you throw away, the less data you have to push through a processing pipeline, so the more time you can spend dwelling on that reduced data, and the better quality you're going to get as an end result.

Seems ok, right?

I'm also pretty sure the 'problem' of machine vision is often stated very poorly - probably because there is more than one 'problem' at hand. The underlying thing I think is missed in the defining the problem is that our vision is an answer-seeking system.

Imagine if you could hear the questions your brain had to ask to get your eyes (and head and body) to look where they do in a typical 5 minute walk down the street:

-Is that a "Don't Walk" light?
-Is that car going too fast to stop?
-What is barking, is it a friendly dog?
-Don't trip on that crack in the sidewalk!
-Is the other path clear? I don't want to walk under that tree full of birds...

As a participatory member in our thought processes, vision is pretty unique. In contrast, hearing is somewhat different since it's more temporally transient; once the sound is gone, it's gone. With vision we can choose which things to focus on.

I set out today to code a Processing sketch to reduce a moving visual scene to simpler parameters, so I could then feed an optical flow truth table (which turns out to be just a bunch of three or two bit AND's), but that code isn't here yet.

But to get started I needed a toolkit so I could modify an input stream and make some pretty quick and dirty adjustments to the image to see what else I might need to do to it.

The easiest way to start was simply using Processing (, since that has a video library built in, and dead-simple access to pixel level data. I also threw in some on-screen controls of the threshold values using the controlP5 library (also dead-simple). Oh, and for source material I recorded some 640x480 scenes in a local park, with the camera height fairly low to simulate the mast height of the rover.

Here is the result after an hour of coding, you might want to flip to full-screen mode for it:

And here is the code:

// vidproc demo 0.1 - 2013 - Noel Dodd (
// License terms: Creative Commons (see
// and GPL 2 (see
import controlP5.*;
Movie vid;
ControlP5 cp5;
PImage source;       // Source temp image
PImage destination;  // Destination image
int cwidth = 640;
int cheight = 480;
int numPix;
int b_threshold = 50;
int w_threshold = 220;
int g_threshold = 50;
boolean falsecolor = false;
void setup() {
  println("Start... ... ...");
  numPix = cwidth * cheight;
  cp5 = new ControlP5(this);
  // source = loadImage("source.jpg");
  vid = new Movie(this, "vid.MOV");
  // The destination image is created as a blank image the same size as the source.
  source = createImage(cwidth,cheight, RGB);
  destination = createImage(cwidth,cheight, RGB);
void movieEvent(Movie m) {;
  int loc = 0;
  for (int x = 0; x < cwidth; x++) {
    for (int y = 0; y < cheight; y++ ) {
      loc = y + ( y * x);
      // copy the pixel to give an unthresholded view to overwrite
      source.set(x,y, m.get(x,y));
void draw() {
  for (int x = 0; x < source.width; x++) {
    for (int y = 0; y < source.height; y++ ) {
      int loc = x + y*source.width;
      destination.pixels[loc] = source.pixels[loc];
      // Should it be grey?
      float red,green,blue;
      float rg,rb,bg;
      red = red(destination.pixels[loc]);
      green = green(destination.pixels[loc]);
      blue = blue(destination.pixels[loc]);
      rg = abs(red - green);
      rb = abs(red - blue);
      bg = abs(blue - green);
      if (( rg+rb+bg) < g_threshold) {
        destination.pixels[loc] = color(127); // grey
      // Test the brightness against the threshold
      if (brightness(source.pixels[loc]) < b_threshold) {
        destination.pixels[loc]  = color(0);  // black
      if (brightness(source.pixels[loc]) > w_threshold) {
        destination.pixels[loc]  = color(255);    // white
      // we can also test on colors
      if(falsecolor) {
        red = red(destination.pixels[loc]);
        green = green(destination.pixels[loc]);
        blue = blue(destination.pixels[loc]);    
        // check blue
        if ((blue > red) && (blue > green)) {
          destination.pixels[loc]  = color(0,0,255);
        // check red
        if ((red > blue) && (red > green)) {
          destination.pixels[loc]  = color(255,0,0);
        // check green
        if ((green > red) && (green > blue)) {
          destination.pixels[loc]  = color(0,255,0);
  // We changed the pixels in destination
  // Display the destination

No comments:

Post a Comment