Friday, January 14, 2005

Prototype

"My Cat" prototype

"This is my cat. I have a little cat, and her name is coco.
I love her so much.
Coco eats a lot.
Every time I put her food in her plate, it seems like she hasn't eaten for a month.
She is also very curious about a lot of things, especially butterflies.
When she sees a butterfly around, she chases it and never stops.
Sometimes she even jumps when chasing the butterfly."

What I wish to realize is such a scenario. The users will only need to draw the shapes, and the motions will be performed by the system automatically as a result of commonsense-based recognition and generation of shape and motion.
Two of the main examples of how it migh work are below:
  • "Every time I put her food in her plate"
    • putting a cat's food into her plate
      • commonsense inferencing - it might be adding something gradually
      • motion recognition - (adding something : the animation of appearance)
      • shape recognition - it's putting "into", but it could be full and seem like the food is "above" the plate
    • her (a cat's) food
      • commonsense inferencing - a cat's food might be a bunch of grains/pieces
      • shape recognition - (a bunch of : the look of repetition)
      • shape recognition - (grains : small circles)
    • her (a cat's) plate
      • commonsense inferencing - a cat's plate might be on the ground
      • commonsense inferencing - a plate is a container, similar to a bowl or a dish
      • shape recognition (the plate : the shape of a plate)
  • "Coco eats a lot", "It seems like she hasn't eaten for month"
    • commonsense inferencing - someone who doesn't eat for a month tends to be very hungry
    • commonsense inferencing - someone who eats a lot gets hungry easily
    • commonsense inferencing - people seem to be very hungry usually because they eat very fast and leave little food
    • commonsense inferencing - when people eat food, the food will be eaten from their mouths and will not appear anymore
    • commonsense inferencing - people need to get the food before they eat it
    • commonsense inferencing - sometimes one has to approach a thing before he/she gets it
    • motion generation - (approaching : drawing near on the screen)
    • motion generation - (eat the food -> food disappearing on the screen)
Still other inferencing/recognition/generation include:
  • "sees a butterfly around" -> the butterfly might be flying in the air, might be staying on something -> flying means to move above
  • "jumps" -> go up and down
The main problem of using current ShapeNet might be:
  1. Recognition of objects when they are connected - Because the shapes in ShapeNet are basically silhouettes, it is not trivial to recognize the objects when they are connected. That is, the grains and the plate are connected to each other, so it wouldn't be correct if we use the whole silhouette. To get the correct matches, we have to run the recognition as soon as they are drawn, and should be completed as soon as the drawings are finished.
  2. Utilizing the information of the spoken sentences' and the drawings' timing - People's focuses in speech and vision are typically on something, so I think the timing is quite important.


2 Comments:

At 12:35 AM, Blogger Drake said...

For ShapeNet's problem of recognizing objects when they are connected, is it possible to teach the computer? ex, how about circling the plate and grains seperately? For each circling, tell the computer what you wanna it to learn. And let the computer to realize the (spatial) relationship of grains and the plate.
As a human, people knows several relationship between grains and plates such that they can use kinda pattern recognition to figure out what's happening in front of their eyes. Am I right?

 
At 4:35 AM, Blogger Edward Shen 沈育德 said...

The problem here is how to build a universial computational representation of the spatial relationships taught by users. I mean, what should the computer deal with the circles? Can we interpret them without any form of knowledge base? If we don't need any existing knowledge base, then this problem will be solved simply by introducing users' encirclings. However, I don't think the computers are gonna understand all the encirclings if they don't have any ideas about spatial relationships.

So the way to do it, in my mind, is to base on ConceptNet as a knowledge base (in which all knowledge is described in text), and try to map users' encirclings with the spatial descriptions in ConceptNet. For example, suppose there's a website, where one after another VR scene can shown, and users can describe what's happening (in a spatial sense) in this simulated world, perhaps by using both text descriptions and gestural hints. Is that what you thought about?

 

Post a Comment

<< Home