--------------------------------------- PR 2016 (Accept) --------------------------------------- ------------------- Revision #1 ------------------- --------- Editor --------- Comments: Based on these comments, it seems likely that your manuscript can meet our criteria for publication with further minor revisions. Please prepare a revised version of the paper addressing the points raised by the referee(s) and resubmit it. Recommendation: Minor revision --------- Assigned_Reviewer_3 --------- Comments: This paper proposes an approach based on KCCA to perform image annotation in a semantic space. The authors have take into account my main concerns and some major improvements have been made. In particular, some experiments have been added to justify experimentally some choices of the proposed approach. Some comparisons with existing works, that were missing, were also added. A major point is still that the contribution is not a major contribution from a theoretical perspective since the theoretical contribution was proposed in a previous paper of the authors. Nevertheless, this new paper adds some interesting features to the previous approach. The paper is well written and the experiments have shown the effectiveness of the approach for image annotation. I recommend the paper to be accepted. --------- Assigned_Reviewer_4 --------- Comments: Authors have explained carefully my concerns. Now it becomes more clear. I have the following problems for authors' next modification. (1) For CCA algorithms, the related CCA variants should be surveyed, just list a few, e.g. "Canonical Correlation Analysis Networks for Two-view Image Recognition", "Tensor Canonical Correlation Analysis for Multi-View Dimension Reduction", etc. (2) As authors claimed, one important contribution of this manuscript is it employs multimodal extensions, hence, some multiview or multimodal algorithms should be mentioned. Particularly, the difference between the proposed solution and the related works should be carefully discussed. Some related multiview learning works include but not limited to, "Multiview Hessian regularized logistic regression for action recognition", "Multi-View Intact Space Learning", "Deep Multimodal Distance Metric Learning Using Click Constraints for Image Ranking", etc. ------------------- Original submission ------------------- --------- Editor --------- Comments: Authors should carefully address the comments provided by Reviewers. In particular, Authors should better clarify the novelty of the proposed method in comparison with [39]. Moreover, experiments should be extended as suggested by Reviewer 3. Recommendation: Major revision --------- Assigned_Reviewer_2 --------- Comments: This well written paper proposes to use KCCA to learn a semantic space that combines image and textual descriptors in such a way that image descriptors and related textual tags fall "close" together when projected into the learnt space. Image annotation is then performed by projecting an unseen test image into the semantic space (using its corresponding visual features) and propagating the textual tags from the nearest neighbouring images onto the test image. Several different schemes are explored for the label propagation that are existing successful image annotation models: e.g. Tagprop, Tag Relevance, 2PKNN etc. While on the downside the novelty (KCCA, Tagprop etc are known methods and this paper can therefore be thought of as an incremental contribution) and impact of the paper are somewhat limited, overall I found the work to be a good contribution to the body of knowledge on image annotation. I believe that other researchers in the field would find the work sufficiently interesting. A good review is made of previous related work that touches on important prior contributions e.g. 2PKNN, Tagprop. Two different evaluation paradigms were also explored: the case of expert label, and the perhaps more common scenario of noisy labels e.g. obtained from the web. I was particularly impressed by the ability of the semantic space to increase the performance of many different image annotation models both with expert and noisy labels, hinting at the generality of the proposed method. I would be especially curious to see if any other non-linear methods for learning a multimodal embedding space e.g. a stacked denoising autoencoder, could improve on the presented results. The evaluation itself is sound and follows previously excepted work allowing direct comparison to existing work: for example, the evaluation on ESPGame, and using standard metrics (precision@5, recall@5). In all cases the proposed method significantly increases the image annotation effectiveness. I recommend the paper be accepted with no changes. --------- Assigned_Reviewer_3 --------- Comments: This paper, well written, proposes a multimodal embedding based on KCCA to improve the image representation used in the context of nearest-neighbor based image annotation approaches. This paper extends a previous contribution of the authors [39]. My main concern is related to the novelty of this paper related to this previous work. Indeed, the KCCA approach to project visual and textual information in a common semantic space is a contribution of [39] as well as its use for image annotation by a nearest-neighbor scheme. In fact, the present paper extends [39] mainly with a more complete and detailed related work section and by one more experimental validation on the NUS-WIDE dataset. As a consequence, I am not convinced that both the originality and the novelty of the proposed approach are sufficient for a publication in Pattern Recognition. I suggest to the authors to discuss this issue in a more detailed way than done in page 5. In the introduction, the authors first present the task of automatic image annotation and defend nearest neighbor approaches compared to machine and deep learning approaches by their ability to scale well in terms of number of labels (that can change). Nevertheless, no experimental validation of this claim is done in the rest of the paper. They also claim that images that shared the same label will be closer in the semantic space than in the original visual space. This point is illustrated on an example on Figure 1 and qualitatively on Figure 3 but we can regret a deeper discussion on this claim and more quantitative results. Moreover, some points have to be refined in the related work section. For instance, in section 2.2, the authors conclude by saying that the approaches described in this section only focus on the visual modality. I believe it is not true. For instance, it is not the case of ref [51]. Moreover, some references related to multimodal embedding are missing as for instance: Jason Weston, Samy Bengio, and Nicolas Usunier. 2010. Large scale image annotation: learning to rank with joint word-image embeddings. Mach. Learn. 81, 1 (October 2010), 21-35. DOI=http://dx.doi.org/10.1007/s10994-010-5198-3 DeViSE: A Deep Visual-Semantic Embedding Model A. Frome, G. Corrado, J. Shlens, S. Bengio, J. Dean, M.A. Ranzato, T. MikolovIn Annual Conference on Neural Information Processing Systems (NIPS) | 2013 I have also some concerns related to the proposed approach in addition with the concern related to its novelty. Indeed, except the visual features and the tag feature denoising step, the big part of the proposed approach is described in [39]. Concerning the textual features, the authors propose to use some kind of bag of word representation for expert labels. This kind of representation considers that the different words are independent and in particular, semantic relations between labels are not considered (for instance relations between the label dog and the label shepherd. In the proposed frameword, the weight associated to the different label is either 0 and 1. Others ponderation scheme should be studied and discussed. From my point of view, the denoising step for user-generated tags should also be more justified. In particular, the choice of using visual neighbourhood and the bias it included. The benefits of this approach for image annotation should also be compared with others approaches that also tackled noisy tags such as tag ranking or tag completion[15, 20…] Hamid Izadinia, Bryan C. Russell, Ali Farhadi, Matthew D. Hoffman, and Aaron Hertzmann. 2015. Deep Classifiers from Image Tags in the Wild. In Proceedings of the 2015 Workshop on Community-Organized Multimodal Mining: Opportunities for Novel Solutions (MMCommons '15). ACM, New York, NY, USA, 13-18. DOI=http://dx.doi.org/10.1145/2814815.2814821 I also want to discuss the assumption discussed in the section 3.3 that similar images share common labels. From my point of view, this can be considered as true when labels are what the authors named expert labels but in the case of free labels or tags, this point should be discussed. Indeed, it has been shown in many works that image description is not the only and main motivation of tagging. Similar images share common content-based tags but common tags can also be shared between visually different images but that share the same context. The previous step of denoising is a kind of answer to this issue but not the only one and other kind of social information could be used. It could be interesting to have a deeper discussion on this assumption. Concerning the relevance functions : * What is k in equation (12) ? Section 4 describes the experimental work. Some first remarks: * The statistics related to the different datasets given in table 1 could be more precise. For instance with information related to the user tag vocabulary size or on information related to the types of the tags used. * Extensive experiments have been done on various datasets nevertheless, I believe that a comparison of the proposed approach with state of the art approaches on image annotation is missing. For instance, for the method proposed in [14], better results have obtained on image annotation. This comparison with the best approaches of the state of the art on image annotation have to be done as well as the comparison with other semantic embedding approaches. --------- Assigned_Reviewer_4 --------- Comments: This paper presents a label propagation framework based on Kernel Canonical Correlation Analysis (KCCA) for automatic image annotation. It builds a latent semantic space by canonical correlation analysis between visual and textual features. Extensive experiments are conducted to evaluated the proposed solution and significant improvements are verified. However, as authors claimed, the key contribution of our work is to improve image representations using a simple multimodal embedding based on KCCA. The employed algorithm i.e. KCCA is an exited and well studied algorithm. Hence, the contribution of this article seems little from theoretical point of view. Therefore, I don't think the novelty is enough for publication on this top journal.