Ideas, Ideas, Ideas: January 2005

Processing - An Interesting Design Tool

http://processing.org

It's kinda fun

Commonsense-Based End-User-Programming

From the prototype we see the potential of commonsense computing for end-user-programming.

The prototype is an example of creating 2D sequential animation with speech, but actually we may build autonomous character animation in 2D, 3D, and even teach computer to perform particular tasks such as web browsing, document editing in a PROGRAMMING fashion. This new end-user-programming, different from all traditional works, is expected to be more general and easier to use, since it is based on commonsense computing. It would be more general and get rid of the major problem of traditional end-user-programming works - inapplicable in general cases and simply be like a toy. And, it is gonna be easier to use because the programming language will be exactly what we use everyday - our natural language.

Of course I'm not saying that commonsense technology is ready for this scenario right now. But I think incorporatting shape and motion into ConceptNet should help understanding natural language in end-user-programming tasks. So should not be too early to start working on these topics.

Thinking of the end-user-programming problem, an interseting question came into my mind. Suppose I wanna teach my computer to find all the references of a paper from Google, and I want all the files saved in a particular folder, with their names specified with the respective titles. The question is, should we build up a sense of "self" into computers, and tell it " you are capable of manipulating all the things within yourself" (e.g. save, delete, move files as your wish)?

Another question is, in order to collect more commonsense, can a computer, by itself, ask "Why" questions, make a guess, and refine its own answer by finding cues from all the information it gets? If it could, then we might make up a more efficient way of collecting commonsense knowledge from people's everyday lives.

延伸閱讀

Prototype

"My Cat" prototype

"This is my cat. I have a little cat, and her name is coco.
I love her so much.
Coco eats a lot.
Every time I put her food in her plate, it seems like she hasn't eaten for a month.
She is also very curious about a lot of things, especially butterflies.
When she sees a butterfly around, she chases it and never stops.
Sometimes she even jumps when chasing the butterfly."

What I wish to realize is such a scenario. The users will only need to draw the shapes, and the motions will be performed by the system automatically as a result of commonsense-based recognition and generation of shape and motion.
Two of the main examples of how it migh work are below:

"Every time I put her food in her plate"

putting a cat's food into her plate

commonsense inferencing - it might be adding something gradually

motion recognition - (adding something : the animation of appearance)
shape recognition - it's putting "into", but it could be full and seem like the food is "above" the plate

her (a cat's) food

commonsense inferencing - a cat's food might be a bunch of grains/pieces
shape recognition - (a bunch of : the look of repetition)
shape recognition - (grains : small circles)

her (a cat's) plate

commonsense inferencing - a cat's plate might be on the ground
commonsense inferencing - a plate is a container, similar to a bowl or a dish

shape recognition (the plate : the shape of a plate)

"Coco eats a lot", "It seems like she hasn't eaten for month"

commonsense inferencing - someone who doesn't eat for a month tends to be very hungry
commonsense inferencing - someone who eats a lot gets hungry easily
commonsense inferencing - people seem to be very hungry usually because they eat very fast and leave little food
commonsense inferencing - when people eat food, the food will be eaten from their mouths and will not appear anymore
commonsense inferencing - people need to get the food before they eat it
commonsense inferencing - sometimes one has to approach a thing before he/she gets it
motion generation - (approaching : drawing near on the screen)
motion generation - (eat the food -> food disappearing on the screen)

Still other inferencing/recognition/generation include:

"sees a butterfly around" -> the butterfly might be flying in the air, might be staying on something -> flying means to move above
"jumps" -> go up and down

The main problem of using current ShapeNet might be:

Recognition of objects when they are connected - Because the shapes in ShapeNet are basically silhouettes, it is not trivial to recognize the objects when they are connected. That is, the grains and the plate are connected to each other, so it wouldn't be correct if we use the whole silhouette. To get the correct matches, we have to run the recognition as soon as they are drawn, and should be completed as soon as the drawings are finished.
Utilizing the information of the spoken sentences' and the drawings' timing - People's focuses in speech and vision are typically on something, so I think the timing is quite important.

延伸閱讀

Commonsense Computing as a Non-Conventional Solution

You might find this title very weird.
Well, I just wanna point out that, a huge majority of the technology research society solve their problems by trying to find the exact, precise solutions that base on strong, general mathematical models.
Content analysis provides a great amount of examples: speech recognition, natural language understanding, and catagorization/retrieval/understanding of image, video, 3D model, human motion/gestures, facial expression understanding, emotion detection, and so on. Almost all the researchers try to solve all these problems from low levels, step by step, and wish to reach the higher levels gradually.
It's nothing wrong with it. But if we ask why, particularly for content analysis, non of their achievements are satisfying enough? I say, we should take a look at how we humans reason all the things in our world:

"By experience we learn it."

Nobody needs to understand complicated mathematical models in order to understand what people are saying, what their gestures/motions mean, what their facial expressions mean, what there are in the image, or what their emotions are. We just learn all these stuffs by experience. If we can teach computers all things by experience, I say it's gonna be easier for them to learn, as opposed to how researchers' original way of thinking this problem. Of course we don't have 20 years to wait for the computers to become an "experienced, grown-up computer" from a "new-born computer", but we can try to collect all the experience we have and give the computer the whole bunch of it in a relatively short time- afterall it's got a huge storage anyway. And that's how I view commonsense computing.

In other words, while the conventional thinking is quite valuable too, I think experience-based approach is even more important in the research of how people think about the world and themselves.

Affective Computing Based on Commonsense/Experience

I have this kinda feeling toward to-date affective computing too. Right now, people use as many sensors as they could, and try to conclude which sensor's data relates to the emotions the most. In my opinion, however, I think we should take advantage of how people use their experiences to recognize emotions through different modalities. That way, We wouldn' t need to construct an exact model about the relationship between particular emotion and sensor data but still get good - even better- recognition results. For example, if a woman walks very fast in a dark street alone with her body streched, we can easily infer that she might be afraid or nervous base on our commonsense. The commonsense inferencing provides much more information that we share then sensor data. Without such inferencing, I don't think it would be easy to recognize the emotions correctly, because the information provided by physiological phenomenona detected by sensors is too scarce.

延伸閱讀

Beyond Textual Concepts - Commonsense in Spatial Form

Scenario
Motivation
The Problem

As long as we can link the ontology between the text "throw the bad on the ground" with the throwing motion we do in the physical space, software agents will be able to benefit us proactively in the world we live. More clearly, I think the problem can be formed as follows. Right now the nodes in ConceptNet are simply textaul representations, and if we can hook the data of humans' motion, objects' shapes in 3D, and other spatial information onto the nodes, then the inferecing, learning, and other reasoning tasks about commonsense will be able to proceed in different forms.
Similar as how commonsense was first brought into practice, three main problems need to be oversome:

How do we represent commonsense with spatial information (motion, 3D shape, etc)?
How do we aquire such information? Or how do we aquire such ontology (between motion and text)?
How do we make use of such "commonsense in spatial form"?

Knowledge representation of what we do and how we move in the physical world

The first question is not easy to answer. I've surveyed some related literature, and it is clear that we can simply use rotation angles of a set of joints to describe a person's motion. However, when it comes to manipulations of some object, the knowledge representation becomes something hard to deal with. For example, how do we describe the action, "opening an coke can" ? Pulling the ring-pull relates to the motion of the person, force he/she takes, and the deformation of the ring-pull and the can as time passes by. If we can find a proper knowledge representation of such situation, I think then we can hook them onto the nodes in the current ConceptNet, but maybe we still gotta spent some time to figure that out.

Aquiring the ontology between text and motion/3D shapes

Second, I think we can refer to how they teach computers the meanings of our motions or manipulations on certain objects at the CSAIL, MIT. In Charles Kemp's project, what he sees and how he moves are recorded by the camera and sensors he wears, and with offline annotations the computers will eventually "learn" what such spatial data means. I think what we need to do to may not be necessarily the same, but basically this project provides the idea of how we might aquire such ontologies.

Another problem here that has to be overcome after we aquire all the information is, how should we define the similarity between two different motions, tasks, or objects? Or, another way to ask this question is, how do we catagorize the motions, tasks, or objects? I think literature in the computer graphics area has more or less addressed the problem of motion and object shapes, but I'm not sure about manipulation tasks. This is something to be explored too.

Making use of such "commonsense in spatial form"

Basically, I think reasoning, inferencing, or any other computing tasks on this "commonsense in spatial form" is similar to what have been done for the text commonsense knowledge. But if we would like to solve more practical problems, then the motion data we have will be applied in video understanding domain, instead of understanding motion in the 3D world. That way, human's motion in a 2D video will need to be recognized and mapped with the motion in the commonsense database. Therefore, the lower-level technique of extracting human motion from videos will be important.

More Applications

What else can be achieved with this new form of commonsense? I think at least the following research topics may be pushed with the achievement of spatial-form commonsense knowledge: video understanding, affective computing, 3D model recognition, cross-media information retrieval.

Video Understanding - Content analysis researchers have been trying to make machines to understand videos or even still images such that the achievement of multimedia retrieval, catagorization, and so on, will be shifted from lower-level feature extraction/processing to semantic level computing. However, they just can't get rid of the problem that the video content has to be bounded in a specific domain, e.g., sports (tennis, soccer, baseball, and so on), because computers need more information to process these video data. Conversation understanding was also a tough problem, because traditional natural language processing techniques reason the passages without introducing the untold knowledge shared by the speakers. Commonsense computing has solved this problem to some degree (see Push's paper), and I think using commonsense to help solving the problem of video understanding is simply an analogy.
Affective Computing - As illustrated in the scenario, affective computing in our physical world definitely needs such technique. Only with spatial-form commonsense, computers will be able to read people's motions (or even facial expressions), and then they can act as someone caring and thoughtful.
3D Model Retrieval - I worked on the 3D model retrieval project in my first year in the graphics group in NTU. I realized that almost all to-date techniques determining the similarities among different 3D models are using the models' shapes. Lots of important things are missing, such as the objects' functions, in the descriptios of these objects. This is why people still can't find all the things they think they're looking for, even though researchers have been working hard on this. If we could hook the spatial information, including the 3D shapes of objects, onto the nodes in ConceptNet, thenwe can in turn utilize all the related information as descriptions of a 3D object and enhance the search engine. For example, when we look for a phone, the search result will include cell phones, phones we use indoor, public phones on the streets, etc, even if they aren't alike in shape. Why? Because the motion of pressing the buttons and leaning the ear on the phones are similar, these motions are linked with the terms "making phone calls" in current ConceptNet, and we've got the textual commonsense "a phone is a tool for someone to make a call."
Cross-Media Information Retrieval - Actually, the idea that I want to push is this sentence, "A concept, substantial or intangible, is formed with multiple elements - what we see, what we hear, what we feel, and what it means to us." Once we can describe things with these elements all together, I think the boundaries among text, motion, image, video, audio, all media will be much vaguer. In the past, the meanings of concepts are hard to form, but now I think commonsense computing has step forward toward this end. If we can hook all related media descriptions onto the nodes of ConceptNet, I think we will eventually realize the practice of cross-media information retrieval. That is, I can find everything about a car, be it video, audio, or text document, by simply typing the three letters - C, A, R.

延伸閱讀

"As Robots Learn to Imitate"IST Results (01/03/05)

The IST-funded MIRROR project has spent three years studying how people recognize and mimic gestures by transferring that ability to a robotic system and observing the results. In the first year of the project, researchers employed a "cyber-glove" to collate visual and motor data that was used to explore the link between vision and action in the identification of hand gestures; the second year involved experiments with monkeys and human infants to determine how visual and motor data can be employed to draw distinctions between grasping actions, and then applying that information to robotic mimicry of simple object-directed actions; the third and final year of MIRROR focused on building a humanoid robot by combining the principles outlined in the previous years. The robot is constructed out of a binocular head, an arm, and a hand with multiple digits; though the device is still incomplete, the researchers believe they have discovered many components of a biologically-interoperable architecture that can be robotically duplicated. "From the robotics point of view, we demonstrated that it is easier to interpret actions performed by others if the system has built a representation of the action during learning," explains MIRROR project coordinator Giulio Sandini. MIRROR consortium members are now working on the FP6 IST RobotCub project, a follow-up effort whose goal is to construct a humanoid platform to investigate how manipulation skills are developed.Click Here to View Full Article

延伸閱讀

"Perfect Profiles"Information Highways (ACM Technews 12/04)

Kennedy, Mary Lee Companies are increasingly basing their interface design decisions on "personas," context-sensitive archetypes or profiles of natural groupings of real users that ensure that products will meet their requirements and goals. Personas are often assigned photos, names, and personalities to make them seem as real and credible as possible. Effective personas boost a company's opportunity to strengthen customer loyalty and make it more likely that every organizational member will do his or her utmost to fulfill user goals by determining what aspects should be considered as well as what aspects should be ignored; the overall result is shorter and less costly product development cycles, greater customer satisfaction and allegiance, and better understanding of user wants and needs. Project managers use personas to relate the project vision to senior executives and recognize the capabilities, features, and content that best suit the target audience. Expert usability reviews of existing products and usability testing scenarios can be influenced by personas, while marketing functions can employ personas to demonstrate their understanding of the target audience's goals via campaigns and materials; furthermore, personas' knowledge can be leveraged by support functions to organize effective knowledge bases and structure responses to assistance requests. Burgeoning Web and software use has certified the direct correlation between user satisfaction and user loyalty, and personas are playing an increasingly critical role in the development of software, Web applications, and other products. A 2001 Forrester Research study of Web redesign projects concluded that measurable user-experience goals are essential to successful Web sites, a notion that could be applied to any product development process. Using personas in conjunction with other user-oriented methods (task analysis, usability testing, etc.) increases the probability that a more usable design will emerge.Click Here to View Full Article

延伸閱讀

Rules for Expressing Experience?

James showed me this image he made today, and in a minute I found this image was made up with all my photos took in Spain. Then I started to think about, how's life of the girl on the top of the pic going right now? Is she working right now? getting married?what's her hair like? how's her family doing right?
And as I look at the people I met at the conference Eurographics, similar ideas came across too.
What are the guys doing recently? Are they coming up with new research topics?
Lots of things come and go in my mind, and I believe everyone is gonna feel different upon seeing such image designed particularly for him/her.

The question here is,
Experience can be represented through text, images, video, audio, and so on.
If we were making new media for users to re-experience our experience, that is,

to make them recall their memories,
to make them associate these memories to something else in the past, present, or even future,
and even to create new memories for the past, present, or future,

Can we find a general process that experience can be represented or expressed?
Can we fomulate an environment such that these goals can be tangible, or realized?

This invites even more fundemental questions,

How do humans memorize things?
How do they treat their memories? How do they interact with their memories?

Once I read about something about our memories that are false. It pointed that sometimes our brains make up new memories according to our exprience. Illusion made by this weird situation may be like,
"Oh! I remember that I have been to this situation in my dream!"
I wonder that maybe, in our minds , what we experience and what we create are seperated by merely a thin line. So another question might be,

What is the meaning of memories/experiences to us humans?

May we find the answers to these questions, I think an extremely interesting story behind about new technology might come.

延伸閱讀

Ideas, Ideas, Ideas

Wednesday, January 19, 2005

Processing - An Interesting Design Tool

Sunday, January 16, 2005

Commonsense-Based End-User-Programming

Friday, January 14, 2005

Prototype

Thursday, January 13, 2005

Commonsense Computing as a Non-Conventional Solution

Wednesday, January 12, 2005

Beyond Textual Concepts - Commonsense in Spatial Form

Thursday, January 06, 2005

"As Robots Learn to Imitate"IST Results (01/03/05)

"Perfect Profiles"Information Highways (ACM Technews 12/04)

Wednesday, January 05, 2005

Rules for Expressing Experience?

About Me

Links

Previous Posts

Archives