Self-Supervised Visual Representation Learning
JULAIN journal club, 19 October 2020
A Simple Framework for Contrastive Learning of Visual Representations T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, 2020 http://arxiv.org/abs/2002.05709
--> already cited more than 200 times
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey L. Jing and Y. Tian, CVPR2019 http://arxiv.org/abs/1902.06162
Christian Schiffer (CS):
- features in latent space, useful for further downstream tasks
- contrastive learning, idea goes back to 90s? (2001)
- how to define similarity?
- take images, without label
- for input, produces views (augmented version of the original image)
- guess for input related views and non-input related views which is which (classification with pseudo-labels)
Discussion and questions
Q Stefan Kesselheim (SK): Why does this approach work so well?
- avoids overfitting by avoiding narrow task
- choice of data augmentation seems to be key to performance, e.g. example given in Fig 6 about color distribution
Stephan Bialonski (SB) interessted in self-supervised learning for time series, here pre-text task is hard to properly define
Q: Max Riedel (MR) - projection head vs ResNet representation
- Sec 4.2. provides intuition: projection head may remove information provided in the hidden layer before
- Jenia Jitsev (JJ): comment - h contains information about applied transformations, g(h) removes it, section here: "In particular, z = g(h) is trained to be invariant to data transformation. Thus, g can remove information that may be useful for the downstream task, such as the color or orientation of objects. By leveraging the nonlinear transformation g(·), more information can be formed and maintained in h. " "Table 3 shows h contains much more information about the transformation applied, while g(h) loses information."
Q Scarlet Stadtler (SS): Section 2.2: "a Batch size of 8192 gives us 16382 negative examples per positive pair from both Augmentation views" - isn't this very imbalanced? Maybe I do not understand correctly, but I had the impressions that it is rather difficult for NNs to learn from highly imbalanced data. How come that it works here so well?
- former paper (2006) provides intuation why lot of negative examples fix the definition of similarity in the feature space ==> negative examples are very usefull to completely define the data distribution in feature space
- the more negative examples the more strong the gradients, the more stable training (paper on supervised contrastive learning)
- JJ comment: Losses for positive and negative examples are different and also reweight the examples (see Table 2, page 6 for negative losses; page 2 eq. 1 for positive)
- Dimensionality Reduction by Learning an Invariant Mapping, 2006 (http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf)
- Supervised Contrastive Learning (https://arxiv.org/abs/2004.11362)
CS reports about experience with contrastive leanring and micrsocopic image date from brain sections - characterised more by texture rather than object structure
- did not work at all, i.e. network converges but learned features in hidden layer not usefull for downstream tasks at all
- instead supervised contrastive learning, see paper above, idea: two images are similar if they come from the same class (in that case, coming from the same brain region), requires some labels
- here again lot of negative examples help
SB reports about successful student project: time series classification in sleep research with contrastive predictive coding (CPC paper: https://arxiv.org/abs/1807.03748)