---------------------------------------
ECCV 2016 (Accept - poster)
---------------------------------------


--------- Meta_Reviewer_1 ---------

* "Consolidation Report. Please summarize to the authors and your fellow ACs the rationale behind your recommendation. Please provide sufficient detail for your fellow ACs to understand your point of view. Please also remember to be polite and constructive."

Two of the reviewers recommend acceptance and one rejection. The negative reviewer thinks that human-human interaction is missing and as such a reason for rejection.
As stated by the authors this is future work, but does not limit the contribution of their approach.

The paper was discussed by the AC panel. The decision is to accept the paper as a poster.


--------- Meta_Reviewer_3 ---------

* "Consolidation Report. Please summarize to the authors and your fellow ACs the rationale behind your recommendation. Please provide sufficient detail for your fellow ACs to understand your point of view. Please also remember to be polite and constructive."

The authors provided a robust rebuttal, and there was some discussion on the BBS.
Competent reviewers feel strongly, but with opposing views about whether to accept or reject this paper. 

Overall, after discussions online with the reviewers and with the AC's, we are excited to accept this paper, with the revisions requested by the reviewers and promised by the authors, excluding human interactions. Integration with a tracker is not essential, but would be nice to see.


--------- Rebuttal ---------

* "Rebuttal (Visible to Reviewers, ACs and PCs)"

We thank the reviewers for their comments.

R3:LIMITED NOVELTY
As also noted by R1, a key novelty of our work "lies in transferring semantic context from training scenes into a new scene and mapping their behavioral characteristics". As noted by R5 this enables path prediction "to novel scenes that are similar in terms of functional properties". The main criticism of R3 is that "Walker [3] has already explored nearest-neighbors for agent interaction even without any semantic labeling". Our approach is significantly different. We use kNN to transfer the properties encoded in the navigation map. This is done by "automatic scene parsing plus using local semantic descriptors"; thus without the need of any additional ground truth label. Walker uses patch similarity in the image feature space and his focus is to predict also visual appearance. Instead, we transfer from the training set the functional properties that allows our DBN model to drive prediction on novel scenes with similar semantic.

R3:DATASETS
R3 believes we don't show any more scene diversity than prior work. Kitani [2] uses only two scenes from VIRAT. Walker uses videos from just one VIRAT scene and requires an additional dataset to train midlevel patches. These VIRAT videos are mostly from a road scenario (approx 500m^2) and variety of scene semantic is limited. Stanford-UAV has larger scenes (900m^2/scene) and more agent-scene interactions (see [42]). We use videos from 6 large physical areas corresponding to 15 different scenes (we'll make this point more clear in the paper). We demonstrate that our method can "generalize to an arbitrary class" by showing results on pedestrians and cyclists. These classes interact very differently with the same scenes, while all previous works reported results only on one class.

R5:MODEL
All reviewers appreciate the formulation of our model. However, R5 believes it is limited because "expressly suited for predicting human-scene interactions" and "does not consider the dynamics that hold when the scene is crowded". We tackle the problem of long-term path prediction by exploiting human-scene interactions, similarly to [2,3]. Other works investigate human-human interactions for short-term prediction in crowded scenes (e.g. [12]). We agree that a joint model would be an interesting avenue for future work, but we believe that human-scene interactions is by itself an interesting and challenging problem. Moreover we take a further step toward knowledge transfer by showing better performance than [2].

R1,R3,R5:EXPERIMENTS
R3:"Fig.7 suggest that the method is only learning texture information". As correctly noted by R1, the intent of Fig.7 is "illustrative to help the reader appreciate the method". It is just a qualitative visualization. Our method is learning the navigation maps shown in Fig.8 and, although they "are much noisier than the directly observed ones", the "path prediction is very similar appearing to smooth out the map noise". Nevertheless, we run a new experiment to see "how a patch matching scheme using simple texture descriptors compare as a baseline". We use the same visual features used for image parsing (lines 349-355); this gives 21.67 MHD vs 14.29 using the proposed context descriptors, confirming the effectiveness of our transfer procedure.

R5 asks how long does it take to learn DBN. We'll add a brief discussion about computational costs. Our approach is approx 4x faster than [2] and a full experiment on the UAV dataset takes around 2 hours (with unoptimized Matlab code).

R1 praised the extent of our experiments that are defined "convincing and sufficient". (S)he asks more details about experimental protocol and context descriptors computation. We thank for these comments and will revise the paper by incorporating all this information. We did experiments considering larger descriptors and a partitioning with both sectors and bands, but we reported only the scheme which provides the best results.


--------- Assigned_Reviewer_1 ---------

* "Paper Summary. Please summarize in your own words what the paper is about."

The paper presents an interesting problem that has been the subject of a few recent papers, predicting trajectories in a scene. Here, the primary novelty lies in transferring semantic context from training scenes into a new scene, and mapping their behavioral characteristics accordingly. This enables path prediction with no video observations on a new scene.

The formulation of scene context and matching is interesting and reasonable, relying on automatic scene parsing plus local semantic descriptors. The path prediction algorithm compensates for noisy data in the transfer.

The experiments are convincing and sufficient, using two previously-available datasets not created for this purpose. Results are compared to [2], which attempts the same problem without the same type of knowledge transfer. The proposed method achieves better performance. There are also useful, supporting experiments exploring the contribution of scene context.

* "Paper Strengths. Positive aspects of the paper. Be sure to comment on the paper's novelty, technical correctness, clarity and experimental evaluation. Notice that different papers may need different levels of evaluation: e.g., a theoretical paper vs. an application paper. "

The problem is quite interesting, and worthy of further study. There is a wealth of data in surveillance scenes that can be transferred, and little work in exploiting it as the large majority of methods perform scene-specific learning.

The paper is very clear, with a distinct statement of contributions framed against a detailed description of the most related previous work.

The features used to characterize the behavior at each grid cell are mostly reasonable. The "routing score" seems to be useful, but often would not correspond to "changes in behavior" as the authors suggest. For example, there may be an obstacle, such as a fence, that people walk around, inducing high curvature trajectories.

The primary novelty and most interesting part of the paper is the knowledge transfer formulation, including the local semantic descriptor. Incorporating scene type by discrete distance quanta has been useful in other works. Averaging the behavior scores from the matching patches makes sense and appears to work, given the limited dataset (see below).

In fig. 8 the predicted maps are much noiser than the directly observed ones, but the path prediction is very similar, appearing to smooth out the map noise.

The figures showing the hallucinated scenes are illustrative and help the reader appreciate the method.

The experiments across different parameter settings, number of trajectories and computed vs. truth segmentations provide depth to the paper and evidence that the method is robust.

* "Paper Weaknesses. Discuss the negative aspects: lack of novelty or clarity, technical errors, insufficient experimental evaluation, etc. Please justify your comments in great detail. If you think the paper is not novel, explain why and provide evidences."

The proposed knowledge transfer paradigm is general, because it only depends on matching semantic classes and their spatial distributions across scenes (with the underlying assumption that similar classes have similar dynamic behaviors). However the experiments only attempt transfer between very similar scenes - those taken from the same UAV dataset, in the same area near Stanford. The opportunity for more interesting transfer exists even in the data used in the paper - why not try transfer between the Stanford and UCLA datasets? The semantic class lexicons would have to be unified but that should be straightforward.

The split between training and testing data is not specified except stating 70/30, line 473 in sec. 5.2. For the knowledge transfer experiments, what was this split? There are 6 scenes in the UAV dataset. Did you use the other 5 for transfer to the 6th? The paper should be very clear on this point. Experiments with this split would be illustrative.

It seems that, during training, annotated trajectories are required. Is that the case? If so this is a major limitation. Have you tried using computed tracks, as most other video scene understanding methods do?

The semantic context descriptor ignores absolute and relative orientation because of the histograms across the annular regions. It seems that this could discard useful information such as linear structure, e.g. along roads.

The averaging across the three histograms also blurs their effect: did you consider a larger descriptor that does not collapse the histograms?

Taking the weighted sum (line 397) across a vector of 2d spatial distances and histogram bins should require some normalization. How was this done?

* "Preliminary Rating. Please rate the paper according to the following choices. Oral: these are papers whose quality is in the top 10% of the papers at ECCV. Examples include a theoretical breakthrough with no experiments; an interesting solution to a new problem, etc."

Oral/Poster

* "Preliminary Evaluation. Please indicate to the AC, your fellow reviewers, and the authors your current opinion on the paper. Please summarize the key things you would like the authors to include in their rebuttals to facilitate your decision making."

The paper is tackling a more interesting problem than most. It is very clear, well grounded against the SOA, with an interesting method. The experiments are convincing, with exploration of various parameters and factors which many papers do not include.

* "Confidence. Write 'Very Confident' to stress you are absolutely sure about your conclusions (e.g., you are an expert working in the area), 'Confident' to stress you are mostly sure about your conclusions (e.g., you are not an expert but are knowledgeable). 'Not Confident' in all the other cases."

Very Confident

* "Final Recommendation. After reading the author's rebuttal and the discussion, please explain your final recommendation. Your explanation will be of highest importance for making acceptance decisions and for decidinf between posters and orals."

While I understand R5's concern that human interactions are not considered in this work, I disagree that this is a basis for rejection. If so, then the most related works should not have been published either. Understand the potential dynamics of the scene without crowds is important in its own right, as scenes often do not have crowds or even multiple people. It is unfair, particularly given the related work, to punish this paper for not taking on the additional scope of introducing people.

The rebuttal makes strong points to address R3's concerns about novelty, which I also disagreed with. The knowledge transfer elements here are a significant advance beyond [3].

The scene diversity in this paper is considerably greater than previous works, as the authors point out. It would have been interesting to see transfer between the two used in the paper, but that is not a significant drawback.

Apologies to the AC for making it difficult, but I stand by my score. I believe this paper will attract significant interest at the conference.

* "Final rating. After reading the author's rebuttal, please rate the paper according to the following choices."

Oral/Poster


--------- Assigned_Reviewer_3 ---------

* "Paper Summary. Please summarize in your own words what the paper is about."

The authors focus on trajectory prediction of agents such as pedestrians or cyclists in static outdoor scenes. Specifically, they transfer decision information such as movement direction, velocity, routing, and popularity on patches in a novel scene using a nearest-neighbor approach. With these matches, they construct a navigation map. With this navigation map, a Dynamic Bayesian Network is used to predict a distribution of possible trajectories for agents. They evaluate their method on the Stanford-UAV as well as the UCLA-courtyard dataset against Kitani et. al. and a Linear Prediction baseline.

* "Paper Strengths. Positive aspects of the paper. Be sure to comment on the paper's novelty, technical correctness, clarity and experimental evaluation. Notice that different papers may need different levels of evaluation: e.g., a theoretical paper vs. an application paper. "

1. The authors show improved quantitative performance over Kitani et. al. 

2. The authors incorporate the novel concept of "routing points" which describe intermediate goals based on the structure of the environment.

3. The framework is described in rigorous mathematical language. The choice of distributions (such as Gamma for velocity) as well as the formulation for the conditional probability distributions in the DBN make intuitive sense.

4. The framework can generalize to an arbitrary object class.

* "Paper Weaknesses. Discuss the negative aspects: lack of novelty or clarity, technical errors, insufficient experimental evaluation, etc. Please justify your comments in great detail. If you think the paper is not novel, explain why and provide evidences."

1. The authors claim on lines 88-89 that the results generalize to a large set of scenes. The UCLA dataset only uses 2 scenes, and the Stanford-UAV dataset only uses 6 different scenes. This is not significantly larger than the scene diversity in the VIRAT dataset which was used in Kitani et al. It is unclear to me how these datasets are superior to the VIRAT dataset. At the very least, it would help to compare on the VIRAT dataset in addition to these two. 

2. The knowledge transfer does not seem novel. Walker et al. transferred contextual scene information using a nearest-neighbor method as well. In addition, the contextual model of Walker et al. required no semantic labels and used scenes from low-quality YouTube videos. 

3. The scene reconstructions in figure 7 suggest that the method is only learning simple texture information (i.e green grass vs grey sidewalk). How would a patch matching scheme using simple texture descriptors compare as a baseline?

4. On lines 42-43 the authors note that previous works only focused on cars and people. However, the authors only test their method on pedestrians and cyclists.

* "Preliminary Rating. Please rate the paper according to the following choices. Oral: these are papers whose quality is in the top 10% of the papers at ECCV. Examples include a theoretical breakthrough with no experiments; an interesting solution to a new problem, etc."

Weak Reject

* "Preliminary Evaluation. Please indicate to the AC, your fellow reviewers, and the authors your current opinion on the paper. Please summarize the key things you would like the authors to include in their rebuttals to facilitate your decision making."

I weakly reject the paper as it stands due to novelty. The authors need to address the following issues:
1. The approach does not seem to use any more scene diversity than previous work, 
2. Previous work has already explored nearest neighbor approaches for agent interaction understanding - even without any semantic labeling. 
3. It is unclear that the model can outperform simple rules based on texture information (grass vs street). 

* "Confidence. Write 'Very Confident' to stress you are absolutely sure about your conclusions (e.g., you are an expert working in the area), 'Confident' to stress you are mostly sure about your conclusions (e.g., you are not an expert but are knowledgeable). 'Not Confident' in all the other cases."

Very Confident

* "Final Recommendation. After reading the author's rebuttal and the discussion, please explain your final recommendation. Your explanation will be of highest importance for making acceptance decisions and for decidinf between posters and orals."

The authors have addressed most of my concerns. Walker et al does in fact transfer functional information from scene for interaction, but the information transferred is very basic - presence vs absence of active agents. This paper focuses on transferring more complex information such as directions, speeds, etc. While I agree with R5 that human-human interactions are needed for correct path forecasting, I believe that they are beyond the scope of this paper. I upgrade my score to a poster.

* "Final rating. After reading the author's rebuttal, please rate the paper according to the following choices."

Poster


--------- Assigned_Reviewer_5 ---------

* "Paper Summary. Please summarize in your own words what the paper is about."

In this paper, a prediction model based on a Dynamic Bayesian Network formulation is presented, in which the target state is updated by using the statistics encoded in a navigation map of the scene. The approach works by tessellating the navigation map into patches, where each patch has a set of functional properties that help to infer the behavior of people that walk over them. Given a scene and a moving target, the approach generates a set of likely trajectories of the target by using some prior information, that is the initial state of the target (position + velocity) and the knowledge of the scene (the functional properties). These properties are: 1) how many times a patch has been explored with respect to the others, 2) the probability of that patch of being a routing point (where the target is likely to change its dynamic, due to a bifurcation or an obstacle) 3) the probability of going in a particular direction and 4) with a particular speed. All of these quantities are fed into a DBN, which predicts the next state of the target. Also, the approach allows one to perform transfer knowledge, by essentially propagating the properties of the scene which has been trained to scenes that are similar in terms of functional properties.

* "Paper Strengths. Positive aspects of the paper. Be sure to comment on the paper's novelty, technical correctness, clarity and experimental evaluation. Notice that different papers may need different levels of evaluation: e.g., a theoretical paper vs. an application paper. "

The topic is interesting: having a prediction model which works robustly and on different situations would be of great benefit for a tracker.

* "Paper Weaknesses. Discuss the negative aspects: lack of novelty or clarity, technical errors, insufficient experimental evaluation, etc. Please justify your comments in great detail. If you think the paper is not novel, explain why and provide evidences."

Unfortunately, the prediction model has not been embedded in a tracker, but usedas simple path predictor. Seeing how it integrates with a tracker would be more satisfying for the community.

The approach is expressively suited for predicting human-scene interactions (where the people is going without colliding with static obstacles), so the study of how to predict trajectories in crowded scenarios (considering other moving people) is not taken into account.

* "Preliminary Rating. Please rate the paper according to the following choices. Oral: these are papers whose quality is in the top 10% of the papers at ECCV. Examples include a theoretical breakthrough with no experiments; an interesting solution to a new problem, etc."

Weak Reject

* "Preliminary Evaluation. Please indicate to the AC, your fellow reviewers, and the authors your current opinion on the paper. Please summarize the key things you would like the authors to include in their rebuttals to facilitate your decision making."

Dynamics forecasting is of great interest for the surveillance and robotics community
The idea of having functional properties connected to local patches, and then merging all the local estimations in a joint trajectory is carried out in a very elegant way
CONS
The approach is limited, in the sense that it does not take into account explicitly the dynamics that hold when the scene is crowded. In other words, it does not take into account for the behavior of other people in the same navigation map. My doubts are that, given the fact that the DBN is Markovian of order one, it cannot model long term interactions among people. Unfortunately, this aspect is not adequately discussed in the paper. Having additional material showing the sequences would have been beneficial at all. 

Another question which is tightly connected with the previous comment is: when the functional properties of a scene are studied, are the training data of the same kind of the testing one? In other words, if in a location many people are passing over, but only because in the other positions there are people blocking the flow, this would be absolutely misleading for the final model.

How long does it takes to learn DBN? A computational complexity analysis should be given. 
The experiments show sufficient strength against the competitors, but the datasets are of limited interest (modeling in particular people-scene interaction, and not people-people interaction). Having some examples of the town center data for example would be very convincing.

* "Confidence. Write 'Very Confident' to stress you are absolutely sure about your conclusions (e.g., you are an expert working in the area), 'Confident' to stress you are mostly sure about your conclusions (e.g., you are not an expert but are knowledgeable). 'Not Confident' in all the other cases."

Confident

* "Final Recommendation. After reading the author's rebuttal and the discussion, please explain your final recommendation. Your explanation will be of highest importance for making acceptance decisions and for decidinf between posters and orals."

The choice of avoiding to include human-human interaction model in path forecasting approaches is too limiting, and may lead to very strange results. If the approach would have considered a standard human-human interaction model into its engine, or discussed how the model would behave in the case of crowded situations, it would have been more satisfying. After haveing read the comments of the other reviewers and the rebuttal, I downgrade the score.

* "Final rating. After reading the author's rebuttal, please rate the paper according to the following choices."

Strong Reject