Commonsense-Based Cross-Media Information Retrieval
I’ve been thinking about commonsense computing and information retrieval.
And here’s my idea.
Since to-day content analysis works are mostly dependent on domain knowledge, they are basically limited in these domains.
It’s nearly impossible for people to find a general way to make the computers know what’s going on in any media, whether in images, videos,or in 3D models. I believe the reason for this situation is that, what we’re using are merely low-level features, which are by nature unreasonable for us to get high-level analytical results, unless we can introduce related high-level information.
Yesterday when I stop my motorcycle in front of a restaurant and saw a bicycle beside me, “a person riding on the bycicle” came up to my mind as soon as I looked at the bike’s shape. I immediately realized that the shape of a thing in our lives is actually highly related to its function, which is something long missing in modern information retrieval techniques. That is, 3D models are recognized through totally different approachs from videos. It’s somehow a serious problem, because the function a thing actually helps us, at least me, in the recognition process. I wouldn’t be able to recognize many things at first sight without thinking of its function. If the recognition/classification of all kinds of media can be processed according to the inherent CONCEPT of a medium instead of its SHAPE, COLOR, and other low level features, the we might get to somewhere extremely different.




3 Comments:
More explicitly, I think the problem can be formed as follows:
Suppose we already have a great amount of concepts about a bicycle such as, "A bicycle is something with two wheels for people to ride", if we can bind these concepts with all media files, in other words, we point out in photos or 3D models which parts the wheels are or in what shape they are, then computers would be capable of making use of common sense to understand what a bicycle really means to us humans, and lots of applications can be built to benefit people.
From one of the professor in our lab, I heard that this is a problem called "knowledge representation." People have been trying to solve this knowledge representation problem with a bottom-up approach. That is, they assert that, if we wish to narrow down the SEMANTIC GAP between text-represented concepts and visual multimedia, we have to push our content analysis technique to a even higher extent. However, I don't think this has to be true. I've been thinking maybe we could try drawing near the solution from both sides - high level concepts and low level features. How to integrate both of these types of information, however, is still unclear in my mind.
The knowledge representation stuff made me thought about something even more interesting and maybe more narrowed-down.
When we see a ball in a box, we know it's IN the box.
We also know other spatial relationships such as on, above, below, beside...at first sight of many situations.
When I see a button, I know it can be pressed. When I see a knob on a door, I know it can be revolved. By experience we learn a lot of spatial commonsense, which helps us in our daily lives in a great deal. If computers can have all the SPATIAL COMMONSENSE as we do, in other words, if at first sight computers can know how the things it sees may be manipulated, I believe it will be able to benefit us much more. I think the Roboverse Domain http://csc.media.mit.edu/RoboverseHome.htm is something worth referencing.
The knowledge representation stuff made me thought about something even more interesting and maybe more narrowed-down.
When we see a ball in a box, we know it's IN the box.
We also know other spatial relationships such as on, above, below, beside...at first sight of many situations.
When I see a button, I know it can be pressed. When I see a knob on a door, I know it can be revolved. By experience we learn a lot of spatial commonsense, which helps us in our daily lives in a great deal. If computers can have all the SPATIAL COMMONSENSE as we do, in other words, if at first sight computers can know how the things it sees may be manipulated, I believe it will be able to benefit us much more. I think the Roboverse Domain http://csc.media.mit.edu/RoboverseHome.htm is something worth referencing.
Post a Comment
<< Home