Storytelling is about making connections. That is, a narration process is in fact a series of endless decision making processes, each of which concerning about this question, "Which are the two story segments to be bridged?"
This question should be answered according to a huge amount of criteria: it has to be smooth in terms of the appearance, it has to make sense in terms of causality, it has to be consistent with what the audience know and what they don't know yet, ... etc. Nevertheless, whatever kind of media format is used (e.g., text, audio, video, and so on), the nature of the activity in which the storyteller is engaged is the same. It's all about making decisions for the connections.
But the thing is that, the granularity of both the story segments and the connections between them varies tremendously across different media types. In textual storytelling activities, the narration goes only within the text domain, where abstraction or abbreviation can be easily made (since text expresses *semantics* but no *senses*), so the segments' granularity can be large. In other words, the storyteller can leave the details blank, and the audience can fill them in by themselves using their own imagination. For example, the sentence "The man in a black suit and a hat slowly walked in, and stepped on the old, wooden floor" can be shortened as "The man in a black suit and a hat walked in," or even as simply as "The man came in." Even though information will be lost when abstraction or abbreviation is made, we often do this because we can focus more on the flow or evolvement of the story. After all, it is too bothersome if we have to detail everything in the stories.
On the other hand, if we consider the other extreme, using video as a media type for storytelling requires heavily detailed information, because video provides both visual and audio senses. Video makers need to handle - at every single moment within the video artifact - how it looks and how it sounds. As a result, the granularity of the stories' building blocks and the connections between them becomes much finer, and the criteria involved in the decision become much more as well (e.g. the correlation between the spoken words and the shown image, whether the video or audio is carrying the story plot, how to juxtapose back and forth between two related scenes along the same background audio, etc). Therefore, for instance, a building block in this problem domain might be a 0.5 sec video, or a 2 sec audio.
So what I'm trying to say right here is, if we really want to deal with the problem of video-based storytelling, then we will really need to look at this making-decision-for-connecting-these-fine-granule problem. Otherwise, the task will by no means different from simply dealing with textual stories.
But how are we gonna use commonsense computing or any AI techniques to do it? Since the most advanced techniques right now are all functioning in the text domain, one way of eperiment that we might try to do is to chop the materials into very fine segments, split the video and audio parts, and pose detailed annotations to all these granular building blocks. I understand that it might look pretty stupid cause no one would ever do think kind of work in the real world for practical usages. But ironically meanwhile there's something worth investigating here since nobody's ever done this kind of thing before, and there's no way we can tell by now how computer would be able to make use of this kind of materials to benefit the process of video-based storytelling.
The other direction that I may go is to take the advantage of the experience that I have with videos and work on something that is relatively easier - weblogs. Blog is a kind of textual story. It is organized in successions, so it's time-aware - just like the story *progression* in video footages. One recent post may share related mindsets with other previous posts, so referring to them can be analogous to the process of juxtaposing semantically related footages all together as well. One of today's blogging software's mayjor defects is, in my personal opinion, that the viewing activity can only follow one axis, which is the chronology of the posts. There's no sense of story progression in terms of other story elements such as emotion, topics, questions, characters, and so forth. Using commonsene computing technology, we may be able to come up with a novel *storied navigation* theme in the world of blogs.
延伸閱讀