Wednesday, January 12, 2005

Beyond Textual Concepts - Commonsense in Spatial Form

Scenario
Motivation
The Problem

As long as we can link the ontology between the text "throw the bad on the ground" with the throwing motion we do in the physical space, software agents will be able to benefit us proactively in the world we live. More clearly, I think the problem can be formed as follows. Right now the nodes in ConceptNet are simply textaul representations, and if we can hook the data of humans' motion, objects' shapes in 3D, and other spatial information onto the nodes, then the inferecing, learning, and other reasoning tasks about commonsense will be able to proceed in different forms.
Similar as how commonsense was first brought into practice, three main problems need to be oversome:
  • How do we represent commonsense with spatial information (motion, 3D shape, etc)?
  • How do we aquire such information? Or how do we aquire such ontology (between motion and text)?
  • How do we make use of such "commonsense in spatial form"?
Knowledge representation of what we do and how we move in the physical world

The first question is not easy to answer. I've surveyed some related literature, and it is clear that we can simply use rotation angles of a set of joints to describe a person's motion. However, when it comes to manipulations of some object, the knowledge representation becomes something hard to deal with. For example, how do we describe the action, "opening an coke can" ? Pulling the ring-pull relates to the motion of the person, force he/she takes, and the deformation of the ring-pull and the can as time passes by. If we can find a proper knowledge representation of such situation, I think then we can hook them onto the nodes in the current ConceptNet, but maybe we still gotta spent some time to figure that out.

Aquiring the ontology between text and motion/3D shapes

Second, I think we can refer to how they teach computers the meanings of our motions or manipulations on certain objects at the CSAIL, MIT. In Charles Kemp's project, what he sees and how he moves are recorded by the camera and sensors he wears, and with offline annotations the computers will eventually "learn" what such spatial data means. I think what we need to do to may not be necessarily the same, but basically this project provides the idea of how we might aquire such ontologies.





Another problem here that has to be overcome after we aquire all the information is, how should we define the similarity between two different motions, tasks, or objects? Or, another way to ask this question is, how do we catagorize the motions, tasks, or objects? I think literature in the computer graphics area has more or less addressed the problem of motion and object shapes, but I'm not sure about manipulation tasks. This is something to be explored too.

Making use of such "commonsense in spatial form"

Basically, I think reasoning, inferencing, or any other computing tasks on this "commonsense in spatial form" is similar to what have been done for the text commonsense knowledge. But if we would like to solve more practical problems, then the motion data we have will be applied in video understanding domain, instead of understanding motion in the 3D world. That way, human's motion in a 2D video will need to be recognized and mapped with the motion in the commonsense database. Therefore, the lower-level technique of extracting human motion from videos will be important.

More Applications

What else can be achieved with this new form of commonsense? I think at least the following research topics may be pushed with the achievement of spatial-form commonsense knowledge: video understanding, affective computing, 3D model recognition, cross-media information retrieval.
  1. Video Understanding - Content analysis researchers have been trying to make machines to understand videos or even still images such that the achievement of multimedia retrieval, catagorization, and so on, will be shifted from lower-level feature extraction/processing to semantic level computing. However, they just can't get rid of the problem that the video content has to be bounded in a specific domain, e.g., sports (tennis, soccer, baseball, and so on), because computers need more information to process these video data. Conversation understanding was also a tough problem, because traditional natural language processing techniques reason the passages without introducing the untold knowledge shared by the speakers. Commonsense computing has solved this problem to some degree (see Push's paper), and I think using commonsense to help solving the problem of video understanding is simply an analogy.
  2. Affective Computing - As illustrated in the scenario, affective computing in our physical world definitely needs such technique. Only with spatial-form commonsense, computers will be able to read people's motions (or even facial expressions), and then they can act as someone caring and thoughtful.
  3. 3D Model Retrieval - I worked on the 3D model retrieval project in my first year in the graphics group in NTU. I realized that almost all to-date techniques determining the similarities among different 3D models are using the models' shapes. Lots of important things are missing, such as the objects' functions, in the descriptios of these objects. This is why people still can't find all the things they think they're looking for, even though researchers have been working hard on this. If we could hook the spatial information, including the 3D shapes of objects, onto the nodes in ConceptNet, thenwe can in turn utilize all the related information as descriptions of a 3D object and enhance the search engine. For example, when we look for a phone, the search result will include cell phones, phones we use indoor, public phones on the streets, etc, even if they aren't alike in shape. Why? Because the motion of pressing the buttons and leaning the ear on the phones are similar, these motions are linked with the terms "making phone calls" in current ConceptNet, and we've got the textual commonsense "a phone is a tool for someone to make a call."
  4. Cross-Media Information Retrieval - Actually, the idea that I want to push is this sentence, "A concept, substantial or intangible, is formed with multiple elements - what we see, what we hear, what we feel, and what it means to us." Once we can describe things with these elements all together, I think the boundaries among text, motion, image, video, audio, all media will be much vaguer. In the past, the meanings of concepts are hard to form, but now I think commonsense computing has step forward toward this end. If we can hook all related media descriptions onto the nodes of ConceptNet, I think we will eventually realize the practice of cross-media information retrieval. That is, I can find everything about a car, be it video, audio, or text document, by simply typing the three letters - C, A, R.

1 Comments:

At 1:49 PM, Anonymous Anonymous said...

im working on developing commonsense in 3d form in my nelements translator project

http://translator.nelements.org

 

Post a Comment

<< Home