--------------------------------------- CVIU 2015 (accepted) --------------------------------------- ------------------- Revision #2 ------------------- --------- Area_Editor (C. V. Jawahar) --------- I am pleased to inform you that your manuscript referenced above has been accepted for publication in Computer Vision and Image Understanding. ------------------- Revision #1 ------------------- --------- Area_Editor (C. V. Jawahar) --------- In the revised manuscript, authors have done a good job of incorporating the suggestions from the first round. Paper is nearly ready for publication. Reviewer 4 had couple of suggestions that improve the readability of the paper. Authors are requested to address these. Recommendation: Minor revision --------- Assigned_Reviewer_1 --------- The authors have addressed all my concerns. --------- Assigned_Reviewer_4 --------- This paper introduces a video tag processing scheme that is able to enrich video tags and localize them to keyframes. It is a revised version. By reading through the manuscript and reviewers' comments, the authors have well addressed most of the concerns. This reviewer only has several suggestions. First, video tag localization is also a tag refinement task. For example, we can simply assign or video-level tags to all keyframes and then adopt existing image tag refinement methods to refine them. The authors are suggested to discuss this fact and add a review on works related to tag refinement. Several references include Image Retagging, ACM MM 2010; Assistive Tagging: A Survey of Multimedia Tagging with Human-Computer Joint Exploration, ACM Computing Surveys; ShotTagger: Tag Location for Internet Videos, ICMR 2011; Tag Ranking , WWW 2009; and the references therein. Second, since the whole process is on the fly, the authors are suggested to discuss the computational cost of the approach for processing one video. Finally, there are several notations without clear definitions. The notations and their definitions used throughout the paper should be tabulated. ------------------- Original submission ------------------- --------- Area_Editor (C. V. Jawahar) --------- This paper investigates an important problem of annotating videos. Tags are refined and the corresponding visual content is localized. This has applications in many modern internet applications. The method is simple. Authors validate on DUT_WEBV. This work was reviewed by three reviewers, and all of them agree that the problem is interesting, and there are useful results. However, all of them also argue that this paper needs more work and better presentation to be accepted in CVIU. Authors need to clearly present the technical novelty and contributions by contrasting with the past work in related areas. Experiments and applications should also be enhanced as reviewers suggest. I hope that the authors will be able to come up with a revised version which addresses all the concerns of the reviewers. Along with the revised version, please also mention clearly how the new version addresses these concerns. Recommendation: Major revision --------- Assigned_Reviewer_1 --------- In this paper the authors present a method for automatic video annotation that refines, enriches the tags, as well as localizes them associating to keyframes. This method exploits collective knowledge embedded in user generated tags and visual similarity of keyframes and images uploaded to social sites like YouTube and Flickr, as well as web image sources like Google and Bing. A large number of experimental results indicate the effectiveness of authors' method. However, some steps of this method is brute-force, i.e., the Dictionary D, which is risky. A serious problem is the model in this paper is either existing or simply extended. Pros: - Localizing the tags associating to keyframes is meaningful, especially for localizing and refining the tags jointly. - Experiment results show that the author's method has achieved good performance. Cons: - The model in this paper is not novel. - Some brute-force operation may affect the actual performance of proposed method. There are also some issues, as follows: 1. The content of Abstract did not present the authors' method for tag refinement and localization. 2. The section of Related work is too long. Please refine it. 3. Page 2: "Let D ⊇ T_v be the union of all the tags of the m images in S , after that they have been filtered with the same approach used for video tags." Can this simple method filter the video tags ? I think there exist a lot of noise in dictionary D. Please argue it. 4. Page 6: the confusing formula, how is the d(f,I_i) in Eq. 3 defined? 5. I think the Eq. (1) is not the best formulation. Why do the authors use the subtraction operation instead of division operation? The latter is more common. 6. How is the parameter lambda of Eq. (4) set? --------- Assigned_Reviewer_2 --------- This paper presents a data-driven approach for tag refinement and localization in web videos. This topic is interesting while only a few works touched since the characteristics of the video-tagging. The authors propose an effective scheme for annotation with smaller granularity - frames sampled from videos. The reviewer suggest to accept this paper after major revision. The strong points of this paper lie in three aspects: 1) The effectiveness of the scheme is verified with respect to precision and recall. 2) The motivation for this work is clear and it is highlighted. The related works provides an in-depth analysis on the related topics. 3) The framework of the scheme is instructive and the formulation is sound. However, there are two concerns on the reviewer's sides: 1) The technical contribution of this paper limited. It more likes a technical report or a made dish of many techniques. 2) The experiments need be improved. First of all, only DUT-WEBV is not enough support the claim. More similar algorithms are expected for performance comparison. Secondly, the reviewer suggest to add some experiments of real-world applications based on such frame level annotation within videos. --------- Assigned_Reviewer_3 --------- This paper proposes an approach to expand video tags and annotate tags to key frames in the video by exploiting web sources. Extensive experiments have been conducted. The approach mainly consists of the following steps: (1) Retrieve images from web sources with existing tags in a video. (2) Associate each key frame with tags shared by its nearest neighbors in the retrieved set. (3) Refine the candidate tags. (4) Expand the current video tags with the consideration of frame-level tags and temporal consistency. In my opinion, this paper is a bit hard to follow and has limited novelty. My major concerns as well as suggestions are listed as below: 1) To make the paper self-contained, it would be better for authors to explain "lazy learning" below Figure 1 and ''suggestion score'' above Equation 4. 2) The notation in Sec 3.1 is confusing, why tags from Flicker and (Google,Bing) are denoted differently? 3) Equation 1 does't make much sense, how can it be used to support the claim above Equation 1? What's the trade off between the nearest neighbor and the entire set. 4) Equation 3 is confusing, R is given by 1/d(f, I_i)^2 if t \in T_i. How these two different things (f, t) are connected? 5) In sec4.1 "For each video is provided the ground truth annotation of the associated tag,that indicates the time extent in which the concept appears." This sentence is grammatically wrong.