A Year in Computer Vision: The M Tank, 2017
A Year in Computer Vision
Edited for The M Tank by
Benjamin F. Duffy
Daniel R. Flynn
The M Tank
Computer Vision typically refers to the scientific discipline of giving machines the ability of sight, or perhaps more colourfully, enabling machines to visually analyse their environments and the stimuli within them. This process typically involves the evaluation of an image, images or video. The British Machine Vision Association (BMVA) defines Computer Vision as “the automatic extraction, analysis and understanding of useful information from a single image or a sequence of images.”
The term understanding provides an interesting counterpoint to an otherwise mechanical definition of vision, one which serves to demonstrate both the significance and complexity of the Computer Vision field. True understanding of our environment is not achieved through visual representations alone. Rather, visual cues travel through the optic nerve to the primary visual cortex and are interpreted by the brain, in a highly stylised sense. The interpretations drawn from this sensory information encompass the near-totality of our natural programming and subjective experiences, i.e. how evolution has wired us to survive and what we learn about the world throughout our lives.
In this respect, vision only relates to the transmission of images for interpretation; while computing said images is more analogous to thought or cognition, drawing on a multitude of the brain’s faculties. Hence, many believe that Computer Vision, a true understanding of visual environments and their contexts, paves the way for future iterations of Strong Artificial Intelligence, due to its cross-domain mastery.
However, put down the pitchforks as we’re still very much in the embryonic stages of this fascinating field. This piece simply aims to shed some light on 2016’s biggest Computer Vision advancements. And hopefully ground some of these advancements in a healthy mix of expected near-term societal-interactions and, where applicable, tongue-in-cheek prognostications of the end of life as we know it.
While our work is always written to be as accessible as possible, sections within this particular piece may be oblique at times due to the subject matter. We do provide rudimentary definitions throughout, however, these only convey a facile understanding of key concepts. In keeping our focus on work produced in 2016, often omissions are made in the interest of brevity.
One such glaring omission relates to the functionality of Convolutional Neural Networks (hereafter CNNs or ConvNets), which are ubiquitous within the field of Computer Vision. The success of AlexNet  in 2012, a CNN architecture which blindsided ImageNet competitors, proved instigator of a de facto revolution within the field, with numerous researchers adopting neural network-based approaches as part of Computer Vision’s new period of ‘normal science’.
Over four years later and CNN variants still make up the bulk of new neural network architectures for vision tasks, with researchers reconstructing them like legos; a working testament to the power of both open source information and Deep Learning. However, an explanation of CNNs could easily span several postings and is best left to those with a deeper expertise on the subject and an affinity for making the complex understandable.
For casual readers who wish to gain a quick grounding before proceeding we recommend the first two resources below. For those who wish to go further still, we have ordered the resources below to facilitate that:
For those wishing to understand more about Neural Networks and Deep Learning in general we suggest:
As a whole this piece is disjointed and spasmodic, a reflection of the authors’ excitement and the spirit in which it was intended to be utilised, section by section. Information is partitioned using our own heuristics and judgements, a necessary compromise due to the cross-domain influence of much of the work presented.
We hope that readers benefit from our aggregation of the information here to further their own knowledge, regardless of previous experience.
From all our contributors,
The M Tank
The task of classification, when it relates to images, generally refers to assigning a label to the whole image, e.g. ‘cat’. Assuming this, Localisation may then refer to finding where the object is in said image, usually denoted by the output of some form of bounding box around the object. Current classification/localisation techniques on ImageNet have likely surpassed an ensemble of trained humans. For this reason, we place greater emphasis on subsequent sections of the blog.
Figure 1: Computer Vision Tasks
Source: Fei-Fei Li, Andrej Karpathy & Justin Johnson (2016) cs231n, Lecture 8 - Slide 8, Spatial Localization and Detection (01/02/2016). Available: http://cs231n.stanford.edu/slides/2016/winter1516_lecture8.pdf
However, the introduction of larger datasets with an increased number of classes will likely provide new metrics for progress in the near future. On that point, François Chollet, the creator of Keras, has applied new techniques, including the popular architecture Xception, to an internal google dataset with over 350 million multi-label images containing 17,000 classes. ,
Figure 2: Classification/Localisation results from ILSVRC (2010-2016)
Note: ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The change in results from 2011-2012 resulting from the AlexNet submission. For a review of the challenge requirements relating to Classification and Localization see: http://www.image-net.org/challenges/LSVRC/2016/index#comp
Source: Jia Deng (2016). ILSVRC2016 object localisation: introduction, results. Slide 2. Available: http://image-net.org/challenges/talks/2016/ILSVRC2016_10_09_clsloc.pdf
Interesting takeaways from the ImageNet LSVRC (2016):
As one can imagine the process of Object Detection does exactly that, detects objects within images. The definition provided for object detection by the ILSVRC 2016 includes outputting bounding boxes and labels for individual objects. This differs from the classification/localisation task by applying classification and localisation to many objects instead of just a single dominant object.
Figure 3: Object Detection With Face as the Only Class
Note: Picture is an example of face detection, Object Detection of a single class. The authors cite one of the persistent issues in Object Detection to be the detection of small objects. Using small faces as a test class they explore the role of scale invariance, image resolution, and contextual reasoning.
Source: Hu and Ramanan (2016, p. 1)
One of 2016’s major trends in Object Detection was the shift towards a quicker, more efficient detection system. This was visible in approaches like YOLO, SSD and R-FCN as a move towards sharing computation on a whole image. Hence, differentiating themselves from the costly subnetworks associated with Fast/Faster R-CNN techniques. This is typically referred to as ‘end-to-end training/learning’ and features throughout this piece.
The rationale generally is to avoid having separate algorithms focus on their respective subproblems in isolation as this typically increases training time and can lower network accuracy. That being said this end-to-end adaptation of networks typically takes place after initial sub-network solutions and, as such, is a retrospective optimisation. However, Fast/Faster R-CNN techniques remain highly effective and are still used extensively for object detection.
YOLO9000 implements a joint training method for detection and classification extending its prediction capabilities beyond the labelled detection data available i.e. it is able to detect objects that it has never seen labelled detection data for. The YOLO9000 model provides real-time object detection across 9000+ categories, closing the dataset size gap between classification and detection. Additional details, pre-trained models and a video showing it in action is available here.
Figure 4: Accuracy tradeoffs in Object Detection
Note: Y-axis displays mAP (mean Average Precision) and the X-axis displays meta-architecture variability across each feature extractor (VGG, MobileNet...Inception ResNet V2). Additionally, mAP small, medium and large describe the average precision for small, medium and large objects, respectively. As such accuracy is “stratified by object size, meta-architecture and feature extractor” and “image resolution is fixed to 300”. While Faster R-CNN performs comparatively well in the above sample, it is worth noting that the meta-architecture is considerably slower than more recent approaches, such as R-FCN.
Source: Huang et al. (2016, p. 9)
Huang et al. (2016) present a paper which provides an in depth performance comparison between R-FCN, SSD and Faster R-CNN. Due to the issues around accurate comparison of Machine Learning (ML) techniques we’d like to point to the merits of producing a standardised approach here. They view these architectures as ‘meta-architectures’ since they can be combined with different kinds of feature extractors such as ResNet or Inception.
The authors study the trade-off between accuracy and speed by varying meta-architecture, feature extractor and image resolution. The choice of feature extractor for example produces large variations between meta-architectures.
The trend of making object detection cheap and efficient while still retaining the accuracy required for real-time commercial applications, notably in autonomous driving applications, is also demonstrated by SqueezeDet and PVANet  papers. While a Chinese company, DeepGlint, provides a good example of object detection in operation as a CCTV integration, albeit in a vaguely Orwellian manner: Video. 
Results from ILSVRC and COCO Detection Challenge
COCO (Common Objects in Context) is another popular image dataset. However, it is comparatively smaller and more curated than alternatives like ImageNet, with a focus on object recognition within the broader context of scene understanding. The organizers host a yearly challenge for Object Detection, segmentation and keypoints. Detection results from both the ILSVRC and the COCO Detection Challenge are;
In review of the detection results for 2016, ImageNet stated that the ‘MSRAVC 2015 set a very high bar for performance [introduction of ResNets to competition]. Performance on all classes has improved across entries. Localization improved greatly in both challenges. High relative improvement on small object instances’ (ImageNet, 2016).
Figure 5: ILSVRC detection results from images (2013-2016)
Note: ILSVRC Object Detection results from images (DET) (2013-2016).
Source: ImageNet. 2016. [Online] Workshop Presentation, Slide 2. Available: http://image-net.org/challenges/talks/2016/ECCV2016_ilsvrc_coco_detection_segmentation.pdf
Refers to the process of following a specific object of interest, or multiple objects, in a given scene. It traditionally has applications in video and real-world interactions where observations are made following an initial object detection; the process is crucial to autonomous driving systems for example.
Video of GOTURN (Generic Object Tracking Using Regression
“This paper presents an investigation of the impact of deep motion features in a tracking-by-detection framework. We further show that hand-crafted, deep RGB, and deep motion features contain complementary information. To the best of our knowledge, we are the first to propose fusing appearance information with deep motion features for visual tracking. Comprehensive experiments clearly suggest that our fusion approach with deep motion features outperforms standard methods relying on appearance information alone.”
Central to Computer Vision is the process of Segmentation, which divides whole images into pixel groupings which can then be labelled and classified. Moreover, Semantic Segmentation goes further by trying to semantically understand the role of each pixel in the image e.g. is it a cat, car or some other type of class? Instance Segmentation takes this even further by segmenting different instances of classes e.g. labelling three different dogs with three different colours. It is one of a barrage of Computer Vision applications currently employed in autonomous driving technology suites.
Perhaps some of the best improvements in the area of segmentation come courtesy of FAIR, who continue to build upon their DeepMask work from 2015.  DeepMask generates rough ‘masks’ over objects as an initial form of segmentation. In 2016, Fair introduced SharpMask which refines the ‘masks’ provided by DeepMask, correcting the loss of detail and improving semantic segmentation. In addition to this, MultiPathNet identifies the objects delineated by each mask.
“To capture general object shape, you have to have a high-level understanding of what you are looking at (DeepMask), but to accurately place the boundaries you need to look back at lower-level features all the way down to the pixels (SharpMask).” - Piotr Dollar, 2016.
Figure 6: Demonstration of FAIR techniques in action
Note: The above pictures demonstrate the segmentation techniques employed by FAIR. These include the application of DeepMask, SharpMask and MultiPathNet techniques which are applied in that order. This process allows accurate segmentation and classification in a variety of scenes.
Source: Dollar (2016).
Video Propagation Networks attempt to create a simple model to propagate accurate object masks, assigned at first frame, through the entire video sequence along with some additional information.
In 2016, researchers worked on finding alternative network configurations to tackle the aforementioned issues of scale and localisation. DeepLab is one such example of this which achieves encouraging results for semantic image segmentation tasks. Khoreva et al. (2016) build on Deeplab’s earlier work (circa 2015) and propose a weakly supervised training method which achieves comparable results to fully supervised networks.
Computer Vision further refined the network sharing of useful information approach through the use of end-to-end networks, which reduce the computational requirements of multiple omni-directional subtasks for classification. Two key papers using this approach are:
While ENet, a DNN architecture for real-time semantic segmentation, is not of this category, it does demonstrate the commercial merits of reducing computation costs and giving greater access to mobile devices.
Our work wishes to relate as much of these advancements back to tangible public applications as possible. With this in mind, the following contains some of the most interesting healthcare application of segmentation in 2016;
One of our favourite quasi-medical segmentation applications is FusionNet- a deep fully residual convolutional neural network for image segmentation in connectomics benchmarked against SOTA electron microscopy (EM) segmentation methods.
Not all research in Computer Vision serves to extend the pseudo-cognitive abilities of machines, and often the fabled malleability of neural networks, as well as other ML techniques, lend themselves to a variety of other novel applications that spill into the public space. Last year’s advancements in Super-resolution, Style Transfer & Colourisation occupied that space for us.
Super-resolution refers to the process of estimating a high resolution image from a low resolution counterpart, and also the prediction of image features at different magnifications, something which the human brain can do almost effortlessly. Originally super-resolution was performed by simple techniques like bicubic-interpolation and nearest neighbours. In terms of commercial applications, the desire to overcome low-resolution constraints stemming from source quality and realisation of ‘CSI Miami’ style image enhancement has driven research in the field. Here are some of the year’s advances and their potential impact:
Figure 7: Super-resolution SRGAN example
Note: From left to right: bicubic interpolation (the objective worst performer for focus), Deep residual network optimised for MSE, deep residual generative adversarial network optimized for a loss more sensitive to human perception, original High Resolution (HR) image. Corresponding peak signal to noise ratio (PSNR) and structural similarity (SSIM) are shown in two brackets. [4 x upscaling] The reader may wish to zoom in on the middle two images (SRResNet and SRGAN) to see the difference between image smoothness vs more realistic fine details.
Source: Ledig et al. (2017)
The use of Generative Adversarial Networks (GANs) represent current SOTA for Super-resolution:
Qualitatively SRGAN performs the best, although SRResNet performs best with peak-signal-to-noise-ratio (PSNR) metric but SRGAN gets the finer texture details and achieves the best Mean Opinion Score (MOS). “To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4× upscaling factors.” All previous approaches fail to recover the finer texture details at large upscaling factors.
Figure 8: Style Transfer from Nikulin & Novakle
Note: Transferring different styles to a photo of a cat (original top left).
Source: Nikulin & Novak (2016)
Undoubtedly, Style Transfer epitomises a novel use of neural networks that has ebbed into the public domain, specifically through last year’s facebook integrations and companies like Prisma and Artomatix. Style transfer is an older technique but converted to a neural networks in 2015 with the publication of a Neural Algorithm of Artistic Style. Since then, the concept of style transfer was expanded upon by Nikulin and Novak and also applied to video, as is the common progression within Computer Vision.
Figure 9: Further examples of Style Transfer
Note: The top row (left to right) represent the artistic style which is transposed onto the original images which are displayed in the first column (Woman, Golden Gate Bridge and Meadow Environment). Using conditional instance normalisation a single style transfer network can capture 32 style simultaneously, five of which are displayed here. The full suite of images in available in the source paper’s appendix. This work will feature in the International Conference on Learning Representations (ICLR) 2017.
Source: Dumoulin et al. (2017, p. 2)
Style transfer as a topic is fairly intuitive once visualised; take an image and imagine it with the stylistic features of a different image. For example, in the style of a famous painting or artist. This year Facebook released Caffe2Go, their deep learning system which integrates into mobile devices. Google also released some interesting work which sought to blend multiple styles to generate entirely unique image styles: Research blog and full paper.
Besides mobile integrations, style transfer has applications in the creation of game assets. Members of our team recently saw a presentation by the Founder and CTO of Artomatix, Eric Risser, who discussed the technique’s novel application for content generation in games (texture mutation, etc.) and, therefore, dramatically minimises the work of a conventional texture artist.
Colourisation is the process of changing monochrome images to new full-colour versions. Originally this was done manually by people who painstakingly selected colours to represent specific pixels in each image. In 2016, it became possible to automate this process while maintaining the appearance of realism indicative of the human-centric colourisation process. While humans may not accurately represent the true colours of a given scene, their real world knowledge allows the application of colours in a way which is consistent with the image and another person viewing said image.
The process of colourisation is interesting in that the network assigns the most likely colouring for images based on its understanding of object location, textures and environment, e.g. it learns that skin is pinkish and the sky is blueish.
Three of the most influential works of the year are as follows:
Figure 10: Comparison of Colourisation Researchle
Note: From top to bottom - column one contains the original monochrome image input which is subsequently colourised through various techniques. The remaining columns display the results generated by other prominent colourisation research in 2016. When viewed from left to right, these are Larsson et al. 84 2016 (column two), Zhang et al. 83 2016 (Column three), and Lizuka, Simo-Serra and Ishikawa. 85 2016, also referred to as “ours” by the authors (Column four). The quality difference in colourisation is most evident in row three (from the top) which depicts a group of young boys. We believe Lizuka et al.’s work to be qualitatively superior (Column four).
Source: Lizuka et al. 2016
“Furthermore, our architecture can process images of any resolution, unlike most existing approaches based on CNN.”
In a test to see how natural their colourisation was, users were given a random image from their models and were asked, "does this image look natural to you?"
Their approach achieved 92.6%, the baseline achieved roughly 70% and the ground truth (the actual colour photos) were considered 97.7% of the time to be natural.
The task of action recognition refers to the both the classification of an action within a given video frame, and more recently, algorithms which can predict the likely outcomes of interactions given only a few frames before the action takes place. In this respect we see recent research attempt to imbed context into algorithmic decisions, similar to other areas of Computer Vision. Some key papers in this space are:
“We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%).”
“Each stream initially performs video recognition on its own and for final classification, softmax scores are combined by late fusion. To date, this approach is the most effective approach of applying deep learning to action recognition, especially with limited training data. In our work we directly convert image ConvNets into 3D architectures and show greatly improved performance over the two-stream baseline.” - 94% on UCF101 and 70.6% on HMDB51. Feichtenhofer et al. made improvements over traditional improved dense trajectory (iDT) methods and generated better results through use of both techniques.
"The key idea behind our approach is that we can train deep networks to predict the visual representation of images in the future. Visual representations are a promising prediction target because they encode images at a higher semantic level than pixels yet are automatic to compute. We then apply recognition algorithms on our predicted representation to anticipate objects and actions".
“A key goal of Computer Vision is to recover the underlying 3D structure from 2D observations of the world.” - Rezende et al. (2016, p. 1)
In Computer Vision, the classification of scenes, objects and activities, along with the output of bounding boxes and image segmentation is, as we have seen, the focus of much new research. In essence, these approaches apply computation to gain an ‘understanding’ of the 2D space of an image. However, detractors note that a 3D understanding is imperative for systems to successfully interpret, and navigate, the real world.
For instance, a network may locate a cat in an image, colour all of its pixels and classify it as a cat. But does the network fully understand where the cat in the image is, in the context of the cat’s environment?
One could argue that the computer learns very little about the 3D world from the above tasks. Contrary to this, humans understand the world in 3D even when examining 2D pictures, i.e. perspective, occlusion, depth, how objects in a scene are related, etc. Imparting these 3D representations and their associated knowledge to artificial systems represents one of the next great frontiers of Computer Vision. A major reason for thinking this is that, generally;
“the 2D projection of a scene is a complex function of the attributes and positions of the camera, lights and objects that make up the scene. If endowed with 3D understanding, agents can abstract away from this complexity to form stable, disentangled representations, e.g., recognizing that a chair is a chair whether seen from above or from the side, under different lighting conditions, or under partial occlusion.”
However, 3D understanding has traditionally faced several impediments. The first concerns the problem of both ‘self and normal occlusion’ along with the numerous 3D shapes which fit a given 2D representation. Understanding problems are further compounded by the inability to map different images of the same structures to the same 3D space, and in the handling of the multi-modality of these representations. Finally, ground-truth 3D datasets were traditionally quite expensive and difficult to obtain which, when coupled with divergent approaches for representing 3D structures, may have led to training limitations.
We feel that the work being conducted in this space is important to be mindful of. From the embryonic, albeit titillating early theoretical applications for future AGI systems and robotics, to the immersive, captivating applications in augmented, virtual and mixed reality which will affect our societies in the near future. We cautiously predict exponential growth in this area of Computer Vision, as a result of lucrative commercial applications, which means that soon computers may start reasoning about the world rather than just about pixels.
This first section is a tad scattered, acting as a catch-all for computation applied to objects represented with 3D data, inference of 3D object shape from 2D images and Pose Estimation; determining the transformation of an object’s 3D pose from 2D images. The process of reconstruction also creeps in ahead of the following section which deals with it explicitly. However, with these points in mind, we present the work which excited our team the most in this general area:
Figure 11: Example of 3D-R2N2 functionality
Note: Images taken from Ebay (left) and an overview of the functionality of 3D-R2N2 (right).
Note from source: Some sample images of the objects we [the authors] wish to reconstruct - notice that views are separated by a large baseline and objects’ appearance shows little texture and/or are non-lambertian. (b) An overview of our proposed 3D-R2N2: The network takes a sequence of images (or just one image) from arbitrary (uncalibrated) viewpoints as input (in this example, 3 views of the armchair) and generates voxelized 3D reconstruction as an output. The reconstruction is incrementally refined as the network sees more views of the object.
Source: Choy et al. (2016, p. 3)
3D-R2N2 generates ‘rendered images and voxelized models’ using ShapeNet models and facilitates 3D object reconstruction where structure from motion (SfM) and simultaneous localisation and mapping (SLAM) approaches typically fail:
“Our extensive experimental analysis shows that our reconstruction framework i) outperforms the state-of-the-art methods for single view reconstruction, and ii) enables the 3D reconstruction of objects in situations when traditional SFM/SLAM methods fail.”
Figure 12: PrGAN architecture segment
Note from source: The PrGAN architecture for generating 2D images of shapes. A 3D voxel representation (323) and viewpoint are independently generated from the input z (201-d vector). The projection module renders the voxel shape from a given viewpoint (θ, φ) to create an image. The discriminator consists of 2D convolutional and pooling layers and aims to classify if the input image is generated or real.
Source: Gadhelha et al. (2016, p. 3)
In this way the inference ability is learned through an unsupervised environment:
“The addition of a projection module allows us to infer the underlying 3D shape distribution without using any 3D, viewpoint information, or annotation during the learning phase. ”
Additionally, the internal representation of the shapes can be interpolated, meaning discrete commonalities in voxel shapes allow transformations from object to object, e.g. from car to aeroplane.
DeepMind’s strong generative model runs on both volumetric and mesh-based representations. The use of Mesh-based representations with OpenGL allows more knowledge to be built in, e.g. how light affects the scene and the materials used. “Using a 3D mesh-based representation and training with a fully-fledged black-box renderer in the loop enables learning of the interactions between an object’s colours, materials and textures, positions of lights, and of other objects.”
The models are of high quality, capture uncertainty and are amenable to probabilistic inference, allowing for applications in 3D generation and simulation. The team achieve the first quantitative benchmark for 3D density modelling on 3D MNIST and ShapeNet. This approach demonstrates that models may be trained end-to-end unsupervised on 2D images, requiring no ground-truth 3D labels.
Human Pose Estimation attempts to find the orientation and configuration of human body parts. 2D Human Pose Estimation, or Keypoint Detection, generally refers to localising body parts of humans e.g finding the 2D location of the knees, eyes, feet, etc.
However, 3D Pose Estimation takes this even further by finding the orientation of the body parts in 3D space and then an optional step of shape estimation/modelling can be performed. There has been a tremendous amount of improvement across these sub-domains in the last few years.
In terms of competitive evaluation “the COCO 2016 Keypoint Challenge involves simultaneously detecting people and localizing their keypoints”. The European Convention on Computer Vision (ECCV) provides more extensive literature on these subjects, however we would like to highlight:
“We describe the first method to automatically estimate the 3D pose of the human body as well as its 3D shape from a single unconstrained image. We estimate a full 3D mesh and show that 2D joints alone carry a surprising amount of information about body shape. The problem is challenging because of the complexity of the human body, articulation, occlusion, clothing, lighting, and the inherent ambiguity in inferring 3D from 2D”.
As mentioned, a previous section presented some examples of reconstruction but with a general focus on objects, specifically their shape and pose. While some of this is technically reconstruction, the field itself comprises many different types of reconstruction, e.g. scene reconstruction, multi-view and single view reconstruction, structure from motion (SfM), SLAM, etc. Furthermore, some reconstruction approaches leverage additional (and multiple) sensors and equipment, such as Event or RGB-D cameras, and can often layer multiple techniques to drive progress.
The result? Whole scenes can be reconstructed non-rigidly and change spatio-temporally, e.g. a high-fidelity reconstruction of yourself, and your movements, updated in real-time.
As identified previously, issues persist around the mapping of 2D images to 3D space. The following papers present a plethora of approaches to create high-fidelity, real-time reconstructions:
Figure 13: Fusion4D examples from real-time feed
Note from source: “We present a new method for real-time high quality 4D (i.e. spatio-temporally coherent) performance capture, allowing for incremental non-rigid reconstruction from noisy input from multiple RGBD cameras. Our system demonstrates unprecedented reconstructions of challenging non-rigid sequences, at real-time rates, including robust handling of large frame-to-frame motions and topology changes.”
Source: Dou et al. (2016, p. 1)
Fusion4D creates real-time, high fidelity voxel representations which have impressive applications in virtual reality, augmented reality and telepresence. This work from Microsoft will likely revolutionise motion capture, possibly for live sports. An example of the technology in real-time use is available here: Video 
Figure 14: Examples of the Real-Time 3D Reconstruction
Note from source: Demonstrations in various settings of the different aspects of our joint estimation algorithm. (a) visualisation of the input event stream; (b) estimated gradient keyframes; (c) reconstructed intensity keyframes with super resolution and high dynamic range properties; (d) estimated depth maps; (e) semi-dense 3D point clouds.
Source: Kim et al. (2016, p. 12)
The Event camera is gaining favour with researchers in Computer Vision due to its reduced latency, lower power consumption and higher dynamic range when compared to traditional cameras. Instead of a sequence of frames outputted by a regular camera, the event camera outputs “a stream of asynchronous spikes, each with pixel location, sign and precise timing, indicating when individual pixels record a threshold log intensity change.”
This approach is incredibly impressive when one considers the real-time image rendering and depth estimation involved using a single view-point:
“We propose a method which can perform real-time 3D reconstruction from a single hand-held event camera with no additional sensing, and works in unstructured scenes of which it has no prior knowledge.”
“Given a single photo of a room and a large database of furniture CAD models, our goal is to reconstruct a scene that is as similar as possible to the scene depicted in the photograph, and composed of objects drawn from the database.”
The authors present an automatic system which ‘iteratively optimizes object placements and scales’ to best match input from real images. The rendered scenes validate against the original images using metrics trained using deep CNNs.
Figure 15: Example of IM2CAD rendering bedroom scene
Note : Left: input image. Right: Automatically created CAD model from input.
Note from source: The reconstruction results. In each example the left image is the real input image and the right image is the rendered 3D CAD model produced by IM2CAD.
Source: Izadinia et al. (2016, p. 10) 
Why care about IM2CAD?
The issue tackled by the authors is one of the first meaningful advancements on the techniques demonstrated by Lawrence Roberts in 1963, which allowed inference of a 3D scene from a photo using a known-object database, albeit in the very simple case of line drawings.
“While Robert’s method was visionary, more than a half century of subsequent research in Computer Vision has still not yet led to practical extensions of his approach that work reliably on realistic images and scenes.”
The authors introduce a variant of the problem, aiming to reconstruct a high fidelity scene from a photo using ‘objects taken from a database of 3D object models’ for reconstruction.
The process behind IM2CAD is quite involved and includes:
Again in this domain, ShapeNet proves invaluable:
“First, we leverage ShapeNet, which contains millions of 3D models of objects, including thousands of different chairs, tables, and other household items. This dataset is a game changer for 3D scene understanding research, and was key to enabling our work.”
The term homography comes from projective geometry and refers to a type of transformation that maps one plane to another. ‘Estimating a 2D homography from a pair of images is a fundamental task in computer vision, and an essential part of monocular SLAM systems’.
The authors also provide a method for producing a “seemingly infinite dataset”, from existing datasets of real images such as MS-COCO, which offsets some of data requirements of deeper networks. They manage to create “a nearly unlimited number of labeled training examples by applying random projective transformations to a large image dataset”.
“In this work, we build upon the 2D transformation layers originally proposed in the spatial transformer networks and provide various novel extensions that perform geometric transformations which are often used in geometric computer vision.”
"This opens up applications in learning invariance to 3D geometric transformation for place recognition, end-to-end visual odometry, depth estimation and unsupervised learning through warping with a parametric transformation for image reconstruction error."
Throughout this section we cut a swath across the field of 3D understanding, focusing primarily on the areas of Pose Estimation, Reconstruction, Depth Estimation and Homography. But there is considerably more superb work which will go unmentioned by us, constrained as we are by volume. And so, we hope to have provided the reader with a valuable starting point, which is to say by no means an absolute.
A large portion of the highlighted work may be classified under Geometric Vision, which generally deals with measuring real-world quantities like distances, shapes, areas and volumes directly from images. Our heuristic is that recognition-based tasks focus more on higher level semantic information than typically concerns applications in Geometric Vision. However, often we find that much of these different areas of 3D understanding are inextricably linked.
One of the largest Geometric problems is that of simultaneous localisation and mapping (SLAM), with researchers considering whether SLAM will be in the next problems tackled by Deep Learning. Skeptics of the so-called ‘universality’ of deep learning, of which there are many, point to the importance and functionality of SLAM as an algorithm:
“Visual SLAM algorithms are able to simultaneously build 3D maps of the world while tracking the location and orientation of the camera.” The geometric estimation portion of the SLAM approach is not currently suited to deep learning approaches and end-to-end learning remains unlikely. SLAM represents one of the most important algorithms in robotics and was designed with large input from the Computer Vision field. The technique has found its home in applications like Google Maps, autonomous vehicles, AR devices like Google Tango and even the Mars Rover.
That being said, Tomasz Malisiewicz delivers the anecdotal aggregate opinion of some prominent researchers on the issue, who agree “that semantics are necessary to build bigger and better SLAM systems.” This potentially shows promise for future applications of Deep Learning in the SLAM domain.
We reached out to Mark Cummins, co-founder of Plink and Pointy, who provided us with his thoughts on the issue. Mark completed his PhD on SLAM techniques:
“The core geometric estimation part of SLAM is pretty well solved by the current approaches, but the high-level semantics and the lower-level system components can all benefit from deep learning. In particular:
Overall the structure of SLAM solvers probably remains the same, but the components improve. It is possible to imagine doing something radically new with deep learning, like throwing away the geometry entirely and have a more recognition-based navigation system. But for systems where the goal is a precise geometric map, deep learning in SLAM is likely more about improving components than doing something completely new.”
In summation, we believe that SLAM is not likely to be completely replaced by Deep Learning. However, it is entirely likely that the two approaches may become complements to each other going forward. If you wish to learn more about SLAM, and its current SOTA, we wholeheartedly recommend Tomasz Malisiewicz’s blog for that task: The Future of Real-Time SLAM and Deep Learning vs SLAM
“DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters”.
Figure 16: Example of DenseNet Architecture
Note: A 5-layer dense block with a growth rate of k = 4. Each layer takes all preceding feature-maps as input.
Source: Huang et al. (2016)
The model was evaluated on CIFAR-10, CIFAR-100, SVHN and ImageNet; it achieved SOTA on a number of them. Impressively, DenseNets achieve these results while using less memory and with reduced computational requirements. There are multiple implementations (Keras, Tensorflow, etc) here.
“FractalNets repeatedly combine several parallel layer sequences with different numbers of convolutional blocks to obtain a large nominal depth, while maintaining many short paths in the network”.
The network achieved SOTA performance on CIFAR and ImageNet, while demonstrating some additional properties. For instance, they call into question the role of residuals in the success of extremely deep ConvNets, while also providing insight into the nature of answers attained by various subnetwork depths.
“In this work, we present a very simple fully convolutional network architecture of 13 layers, with minimum reliance on new features which outperforms almost all deeper architectures with 2 to 25 times fewer parameters. Our architecture can be a very good candidate for many scenarios, especially for use in embedded devices.”
“It can be furthermore compressed using methods such as DeepCompression and thus its memory consumption can be decreased drastically. We intentionally tried to create a mother architecture with minimum reliance on new features proposed recently, to show the effectiveness of a well-crafted yet simple convolutional architecture which can then later be enhanced with existing or new methods presented in the literature.”
Here are some additional techniques which complement ConvNet Architectures:
A Rectified Linear Unit (ReLU) is traditionally the dominant activation function for all Neural Networks. However, here are some recent alternatives:
Moving towards equivariance in ConvNets
ConvNets are translation invariant - meaning they can identify the same features in multiple parts of an image. However, the typical CNN isn’t rotation invariant - meaning that if a feature or the whole image is rotated then the network’s performance suffers. Usually ConvNets learn to (sort of) deal with rotation invariance through data augmentation (e.g. purposefully rotating the images by small random amounts during training). This means the network gains slight rotation invariant properties without specifically designing rotation invariance into the network. This means that rotation invariance is fundamentally limited in networks using current techniques. This is an interesting parallel with humans who also typically fare worse at recognising characters upside down, although there is no reason for machines to suffer this limitation.
The following papers tackle rotation-invariant ConvNets. While each approach has novelties, they all improve rotation invariance through more efficient parameter usage leading to eventual global rotation equivariance:
“To improve the statistical efficiency of machine learning methods, many have sought to learn invariant representations. In deep learning, however, intermediate layers should not be fully invariant, because the relative pose of local features must be preserved for further layers. Thus, one is led to the idea of equivariance: a network is equivariant if the representations it produces transform in a predictable linear manner under transformations of the input. In other words, equivariant networks produce representations that are steerable. Steerability makes it possible to apply filters not just in every position (as in a standard convolution layer), but in every pose, thus allowing for increased parameter sharing.”107
Figure 17: Test-Error Rates on CIFAR Datasets
Note: Yellow highlight indicates that these papers feature within this piece. Pre-resnet refers to "Identity Mappings in Deep Residual Networks" (see following section). Furthermore, while not included in the table we believe that “Learning Identity Mappings with Residual Gates” produced some of the lowest error rates of 2016 with 3.65% and 18.27% on CIFAR-10 and CIFAR-100, respectively.
Source: Abdi and Nahavandi (2016, p. 6)
Residual Networks and their variants became incredibly popular in 2016, following the success of Microsoft’s ResNet, with many open source versions and pre-trained models now available. In 2015, ResNet won 1st place in ImageNet’s Detection, Localisation and Classification tasks as well as in COCO’s Detection and Segmentation challenges. Although questions still abound about depth, ResNets tackling of the vanishing gradient problem provided more impetus for the “increased depth produces superior abstraction” philosophy which underpins much of Deep Learning at present.
ResNets are often conceptualised as an ensemble of shallower networks, which somewhat counteract the hierarchical nature of Deep Neural Networks (DNNs) by running shortcut connections parallel to their convolutional layers. These shortcuts or skip connections mitigate vanishing/exploding gradient problems associated with DNNs, by allowing easier back-propagation of gradients throughout the network layers. For more information there is a Quora thread available here.
Residual Learning, Theory and Improvements
Other residual theory and improvements
Although a relatively recent idea, there is quite a considerable body of work being created around ResNets presently. The following represents some additional theories and improvements which we wished to highlight for interested readers:
The significance of rich datasets for all facets of machine learning cannot be overstated. Hence, we feel it is prudent to include some of the largest advancements in this domain. To paraphrase Ben Hamner, the CTO and co-founder of Kaggle, ‘a new dataset can make a thousand papers flourish’, that is to say the availability of data can promote new approaches, as well as breath new life into previously ineffectual techniques.
In 2016, traditional datasets such as ImageNet, Common Objects in Context (COCO), the CIFARs and MNIST were joined by a host of new entries. We also noted the rise of synthetic datasets spurred on by progress in graphics. Synthetic datasets are an interesting work-around of the large data requirements for Artificial Neural Networks (ANNs). In the interest of brevity, we have selected our (subjective) most important new datasets for 2016:
Figure 18: Examples from SceneNet RGB-D
Note: Examples taken from SceneNet RGB-D, a dataset with 5M Photorealistic Images of Synthetic Indoor Trajectories with Ground Truth. The photo (a) is rendered through computer graphics with available ground truth for specific tasks from (b) to (e). Creation of synthetic datasets should aid the process of domain adaptation. Synthetic datasets are somewhat pointless if the knowledge learned from them cannot be applied to the real world. This is where domain adaptation comes in, which refers to this transfer learning process of moving knowledge from one domain to another, e.g. from synthetic to real-world environments. Domain adaptation has recently been improving very rapidly again highlighting the recent efforts in transfer learning. Columns (c) vs (d) show the difference between instance and semantic/class segmentation.
Source: McCormac et al. (2017)
Figure 19: CMPlaces cross-modal scene representations
Note: Taken from the CMPlaces paper showing two examples, bedrooms and kindergarten classrooms, across different modalities. Conventional Neural Network approaches learn representations that don’t transfer well across modalities and this paper attempts to generate a shared representation “agnostic of modality”.
Source: Aytar et al. (2016)
In CMPlaces we see explicit mention of transfer learning, domain invariant representations, domain adaptation and multi-modal learning, all of which serve to demonstrate further the current undertow of Computer Vision research. The authors focus on trying to find “domain/modality-independent representations”, which could correspond to the higher level abstractions where humans draw their unified representations from. For instance take ‘cat’ across its various modalities, humans see the word ‘cat’ in writing, a picture drawn in a sketchbook, a real world-image or mentioned in speech but we still have the same unified representation abstracted at a higher level above these modalities.
“Humans are able to leverage knowledge and experiences independently of the modality they perceive it in, and a similar capability in machines would enable several important applications in retrieval and recognition”.
That being said, advancements in image understanding, such as segmentation, object classification and detection have brought video understanding to the fore of research. However, prior to this dataset release there was a real lack in the variety and scale of real-world video datasets available. Furthermore, this dataset was just recently updated, and this year in association with Kaggle, Google is organising a video understanding competition as part of CVPR 2017.
As this piece draws to a close, we lament the limitations under which we had to construct it. Indeed, the field of Computer Vision is too expansive to cover in any real, meaningful depth, and as such many omissions were made. One such omission is, unfortunately, almost everything that didn’t use Neural Networks. We know there is great work outside of NNs, and we acknowledge our own biases, but we feel that the impetus lies with these approaches currently, and our subjective selection of material for inclusion was predominantly based on the reception received from the research community at large (and the results speak for themselves).
We would also like to stress that there are hundreds of other papers in the above topics, and this amalgam of topics is not curated as a definitive, but rather hopes to encourage interested parties to read further along the entrances we provide. As such, this final section acts as a catch all for some of the other applications we loved, trends we wished to highlight and justifications we wanted to make to the reader.
However, while work continues on improving the error rates of these algorithms their value as a tool for medical practitioners appears increasingly evident. This is particularly striking when we consider the performance improvements in breast cancer detection achieved by combining AI systems with medical specialists. In this instance, robot-human symbiosis produces accuracy far greater than the sum of its parts at 99.5%.
This is just one example of the torrent of medical applications currently being pursued by the deep learning/machine learning communities. Some cynical members of our team jokingly make light of these attempts as a means to ingratiate society to the idea of AI research as a ubiquitous, benevolent force. But as long as the technology helps the healthcare industry, and it is introduced in a safe and considered manner, we wholeheartedly welcome such advances.
The Movidius Fathom stick, which also uses the Myriad2’s technology, allows users to add SOTA Computer Vision performance to consumer devices. The Fathom stick, which has the physical properties of a USB stick, brings the power of a Neural Network to almost any device: Brains on a stick.
Corporate partners Lenovo brought affordable Tango enabled phones to market in 2016, allowing hundreds of developers to begin creating applications for the platform. Tango employs the following software technologies: Motion Tracking, Area Learning, and Depth Perception.
Omissions based on forthcoming publications
There is also considerable, and increasing overlap between Computer Vision techniques and other domains in Machine Learning and Artificial Intelligence. These other domains and hybrid use cases are the subject of The M Tank’s forthcoming publications and, as with the whole of this piece, we partitioned content based on our own heuristics.
For instance, we decided to place the two integral Computer Vision tasks, Image Captioning and Visual Question Answering, in our forthcoming NLP piece along with Visual Speech Recognition because of the combination of CV and NLP involved. Whereas the application of Generative Models to images we place in our work on Generative Models. Examples included in these future works are:
*Disclaimer: The team wishes to mention that they do not condone Network on Network (NoN) violence in any form and are sympathisers to the movement towards Generative Unadversarial Networks (GUNs).
In the final section, we’ll offer some concluding remarks and a recapitulation of some of the trends we identified. We would hope that we were comprehensive enough to show a bird’s-eye view of where the Computer Vision field is loosely situated and where it is headed in the near-term. We also would like to draw particular attention to the fact that our work does not cover January-August 2017. The blistering pace of research output means that much of this work could be outdated already; we encourage readers to go and find out whether it is for themselves. But this rapid pace of growth also brings with it lucrative opportunities as the Computer Vision hardware and software markets are expected to reach $48.6 Billion by 2022.
Figure 20: Computer Vision Revenue by Application Market
Note: Estimation of Computer Vision revenue by application market spanning the period from 2015-2022. The largest growth is forecasted to come from applications within the automotive, consumer, robotics and machine vision sectors.
Source: Tractica (2016)
In conclusion we’d like to highlight some of the trends and recurring themes that cropped up repeatedly throughout our research review process. First and foremost, we’d like to draw attention to the Machine Learning research community’s voracious pursuit of optimisation. This is most notable in the year on year changes in accuracy rates, but especially in the intra-year changes in accuracy. We’d like to underscore this point and return to it in a moment.
Error rates are not the only fanatically optimised parameter, with researchers working on improving speed, efficiency and even the algorithm’s ability to generalise to other tasks and problems in completely new ways. We are acutely aware of the research coming to the fore with approaches like one-shot learning, generative modelling, transfer learning and, as of recently, evolutionary learning, and we feel that these research principles are gradually exerting greater influence on the approaches of the best performing work.
While this last point is unequivocally meant in commendation for, rather than denigration of, this trend, one can’t help but to cast their mind toward the (very) distant spectre of Artificial General Intelligence, whether merited a thought or not. Far from being alarmist, we just wish to highlight to both experts and laypersons that this concern arises from here, from the startling progress that’s already evident in Computer Vision and other AI subfields. Properly articulated concerns from the public can only come through education about these advancements and their impacts in general. This may then in turn quell the power of media sentiment and misinformation in AI.
We chose to focus on a one year timeline for two reasons. The first relates to the sheer volume of work being produced. Even for people who follow the field very closely, it is becoming increasingly difficult to remain abreast of research as the number of publications grow exponentially. The second brings us back to our point on intra-year changes.
In taking a single year snapshot of progress, the reader can begin to comprehend the pace of research at present. We see improvement after improvement in such short time spans, but why? Researchers have cultivated a global community where building on previous approaches (architectures, meta-architectures, techniques, ideas, tips, wacky hacks, results, etc.), and infrastructures (libraries like Keras, TensorFlow and PyTorch, GPUs, etc.), is not only encouraged but also celebrated. A predominantly open source community with few parallels, which is continuously attracting new researchers and having its techniques reappropriated by fields like economics, physics and countless others.
It’s important to understand for those who have yet to notice, that among the already frantic chorus of divergent voices proclaiming divine insight into the true nature of this technology, there is at least agreement; agreement that this technology will alter the world in new and exciting ways. However, much disagreement still comes over the timeline on which these alterations will unravel.
Until such a time as we can accurately model the progress of these developments we will continue to provide information to the best of our abilities. With this resource we hoped to cater to the spectrum of AI experience, from researchers playing catch-up to anyone who simply wishes to obtain a grounding in Computer Vision and Artificial Intelligence. With this our project hopes to have added some value to the open source revolution that quietly hums beneath the technology of a lifetime.
The M Tank
 Krizhevsky, A., Sutskever, I. and Hinton, G. E. 2012. ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada. Available: http://www.cs.toronto.edu/~kriz/imagenet_classification_with_deep_convolutional.pdf
 Kuhn, T. S. 1962. The Structure of Scientific Revolutions. 4th ed. United States: The University of Chicago Press.
 Quora. 2016. What is a convolutional neural network? [Online] Available: https://www.quora.com/What-is-a-convolutional-neural-network [Accessed: 21/12/2016]
 Goodfellow et al. 2016. Deep Learning. MIT Press. [Online] http://www.deeplearningbook.org/ [Accessed: 21/12/2016] Note: Chapter 9, Convolutional Networks [Available: http://www.deeplearningbook.org/contents/convnets.html]
 Nielsen, M. 2017. Neural Networks and Deep Learning. [Online] EBook. Available: http://neuralnetworksanddeeplearning.com/index.html [Accessed: 06/03/2017].
 ImageNet refers to a popular image dataset for Computer Vision. Each year entrants compete in a series of different tasks called the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Available: http://image-net.org/challenges/LSVRC/2016/index
 See “What I learned from competing against a ConvNet on ImageNet” by Andrej Karpathy. The blog post details the author’s journey to provide a human benchmark against the ILSVRC 2014 dataset. The error rate was approximately 5.1% versus a then state-of-the-art GoogLeNet classification error of 6.8%. Available: http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/
 See new datasets later in this piece.
 Hikvision. 2016. Hikvision ranked No.1 in Scene Classification at ImageNet 2016 challenge. [Online] Security News Desk. Available: http://www.securitynewsdesk.com/hikvision-ranked-no-1-scene-classification-imagenet-2016-challenge/ [Accessed: 20/03/2017].
 See Residual Networks in Part Four of this publication for more details.
 Details available under team information Trimps-Soushen from: http://image-net.org/challenges/LSVRC/2016/results
 YOLO stands for “You Only Look Once”.
 Facebook’s Artificial Intelligence Research
 Common Objects in Context (COCO) image dataset
 Wu et al. 2016. SqueezeDet: Unified, Small, Low Power Fully Convolutional Neural Networks for Real-Time Object Detection for Autonomous Driving. [Online] arXiv: 1612.01051. Available: arXiv:1612.01051v2
 ILSRVC results taken from: ImageNet. 2016. Large Scale Visual Recognition Challenge 2016.
[Website] Object Detection. Available: http://image-net.org/challenges/LSVRC/2016/results [Accessed: 04/01/2017].
 COCO Detection Challenge results taken from: COCO - Common Objects in Common. 2016. Detections Leaderboard [Website] mscoco.org. Available: http://mscoco.org/dataset/#detections-leaderboard [Accessed: 05/01/2017].
 ImageNet. 2016. [Online] Workshop Presentation, Slide 31. Available: http://image-net.org/challenges/talks/2016/ECCV2016_ilsvrc_coco_detection_segmentation.pdf [Accessed: 06/01/2017].
 Dollar, P. 2016. Segmenting and refining images with SharpMask. [Online] Facebook Code. Available: https://code.facebook.com/posts/561187904071636/segmenting-and-refining-images-with-sharpmask/
 Dasgupta and Singh. 2016. A Fully Convolutional Neural Network based Structured Prediction Approach Towards the Retinal Vessel Segmentation. [Online] arXiv: 1611.02064. Available: arXiv:1611.02064v2
 Connectomics refers to the mapping of all connections within an organism’s nervous system, i.e. neurons and their connections.
 Milanfar, P. 2016. Enhance! RAISR Sharp Images with Machine Learning. [Blog] Google Research Blog. Available: https://research.googleblog.com/2016/11/enhance-raisr-sharp-images-with-machine.html [Accessed: 20/03/2017].
 Ledig et al. 2017. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. [Online] arXiv: 1609.04802
 Jia and Vajda. 2016. Delivering real-time AI in the palm of your hand. [Online] Facebook Code. Available: https://code.facebook.com/posts/196146247499076/delivering-real-time-ai-in-the-palm-of-your-hand/ [Accessed: 20/01/2017].
 Dumoulin et al. 2016. Supercharging Style Transfer. [Online] Google Research Blog. Available: https://research.googleblog.com/2016/10/supercharging-style-transfer.html [Accessed: 20/01/2017].
 Lizuka, Simo-Serra and Ishikawa. 2016. Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification. [Online] ACM Transaction on Graphics (Proc. of SIGGRAPH), 35(4):110. Available: http://hi.cs.waseda.ac.jp/~iizuka/projects/colorization/en/
 Conner-Simons, A., Gordon, R. 2016. Teaching machines to predict the future. [Online] MIT NEWS. Available: https://news.mit.edu/2016/teaching-machines-to-predict-the-future-0621 [Accessed: 03/02/2017].
 Pose Estimation can refer to either just an object’s orientation, or both orientation and position in 3D space.
 Xiang et al. 2016. ObjectNet3D: A Large Scale Database for 3D Object Recognition. [Online] Computer Vision and Geometry Lab, Stanford University (cvgl.stanford.edu). Available from: http://cvgl.stanford.edu/projects/objectnet3d/
 Colyer, A. 2017. Unsupervised learning of 3D structure from images. [Blog] the morning paper. Available: https://blog.acolyer.org/2017/01/05/unsupervised-learning-of-3d-structure-from-images/ [Accessed: 04/03/2017].
 COCO. 2016. Welcome to the COCO 2016 Keypoint Challenge! [Online] Common Objects in Common (mscoco.org). Available: http://mscoco.org/dataset/#keypoints-challenge2016 [Accessed: 27/01/2017].
 Zhe Cao. 2016. Realtime Multi-Person 2D Human Pose Estimation using Part Affinity Fields, CVPR 2017 Oral. [Online] YouTube.com. Available: https://www.youtube.com/watch?v=pW6nZXeWlGM [Accessed: 04/03/2017].
 Microsoft Research. 2016. Fusion4D: Real-time Performance Capture of Challenging Scenes. [Online] YouTube.com. Available: https://www.youtube.com/watch?v=2dkcJ1YhYw4&feature=youtu.be [Accessed: 04/03/2017].
 I3D Past Projects. 2016. holoportation: virtual 3D teleportation in real-time (Microsoft Research). [Online] YouTube.com. Available: https://www.youtube.com/watch?v=7d59O6cfaM0&feature=youtu.be [Accessed: 03/03/2017].
 Kim et al. 2016. Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera. [Online] Department of Computer, Imperial College London (www.doc.ic.ac.uk). Available: https://www.doc.ic.ac.uk/~ajd/Publications/kim_etal_eccv2016.pdf
 Kim et al. 2014. Simultaneous Mosaicing and Tracking with an Event Camera. [Online] Department of Computer, Imperial College London (www.doc.ic.ac.uk). Available: https://www.doc.ic.ac.uk/~ajd/Publications/kim_etal_bmvc2014.pdf
 Yet more neural network spillover
 Malisiewicz. 2016. The Future of Real-Time SLAM and Deep Learning vs SLAM. [Blog] Tombone's Computer Vision Blog. Available: http://www.computervisionblog.com/2016/01/why-slam-matters-future-of-real-time.html [Accessed: 01/03/2017].
 Malisiewicz. 2016. The Future of Real-Time SLAM and Deep Learning vs SLAM. [Blog] Tombone's Computer Vision Blog. Available: http://www.computervisionblog.com/2016/01/why-slam-matters-future-of-real-time.html [Accessed: 01/03/2017].
 Quora. 2017. What is an intuitive explanation of Deep Residual Networks? [Website] www.quora.com. Available: https://www.quora.com/What-is-an-intuitive-explanation-of-Deep-Residual-Networks [Accessed: 03/04/2017].
 Ben Hamner. 2016. Twitter Status. [Online] Twitter. Available: https://twitter.com/benhamner/status/789909204832227329
 Natsev, P. 2017. An updated YouTube-8M, a video understanding challenge, and a CVPR workshop. Oh my!. [Online] Google Research Blog. Available: https://research.googleblog.com/2017/02/an-updated-youtube-8m-video.html [Accessed: 26/02/2017].
 YouTube-8M. 2017. CVPR'17 Workshop on YouTube-8M Large-Scale Video Understanding. [Online] Google Research. Available: https://research.google.com/youtube8m/workshop.html [Accessed: 26/02/2017].
 Wu, Pique & Wieland. 2016. Using Artificial Intelligence to Help Blind People ‘See’ Facebook. [Online] Facebook Newsroom. Available: http://newsroom.fb.com/news/2016/04/using-artificial-intelligence-to-help-blind-people-see-facebook/ [Accessed: 02/03/2017].
 Metz. 2016. Artificial Intelligence Finally Entered Our Everyday World. [Online] Wired. Available: https://www.wired.com/2016/01/2015-was-the-year-ai-finally-entered-the-everyday-world/ [Accessed: 02/03/2017].
 Doerrfeld. 2015. 20+ Emotion Recognition APIs That Will Leave You Impressed, and Concerned. [Online] Nordic Apis. Available: http://nordicapis.com/20-emotion-recognition-apis-that-will-leave-you-impressed-and-concerned/ [Accessed: 02/03/2017].
 Johnson, A. 2016. Trailbehind/DeepOSM - Train a deep learning net with OpenStreetMap features and satellite imagery. [Online] Github.com. Available: https://github.com/trailbehind/DeepOSM [Accessed: 29/03/2017].
 Gros and Tiecke. 2016. Connecting the world with better maps. [Online] Facebook Code. Available: https://code.facebook.com/posts/1676452492623525/connecting-the-world-with-better-maps/ [Accessed: 02/03/2017].
 Reisinger, D. 2017. Amazon’s Cashier-Free Store Might Be Easy to Break. [Online] Fortune Tech. Available: http://fortune.com/2017/03/28/amazon-go-cashier-free-store/ [Accessed: 29/03/2017].
 Mueller-Freitag, M. 2017. Germany asleep at the wheel? [Blog] Twenty Billion Neurons - Medium.com. Available: https://medium.com/twentybn/germany-asleep-at-the-wheel-d800445d6da2
 Rosenfeld, J. 2016. AI Achieves Near-Human Detection of Breast Cancer. [Online] Mentalfloss.com. Available: http://mentalfloss.com/article/82415/ai-achieves-near-human-detection-breast-cancer [Accessed: 27/03/2017].
 Sato, K. 2016. How a Japanese cucumber farmer is using deep learning and TensorFlow. [Blog] Google Cloud Platform. Available: https://cloud.google.com/blog/big-data/2016/08/how-a-japanese-cucumber-farmer-is-using-deep-learning-and-tensorflow
 Banerjee, P. 2016. The Rise of VPUs: Giving eyes to machines. [Online] www.digit.in. Available: http://www.digit.in/general/the-rise-of-vpus-giving-eyes-to-machines-29561.html [Accessed: 22/03/2017.
 Movidius. 2017. Embedded Neural Network Compute Framework: Fathom. [Online] Movidius.com. Available: https://www.movidius.com/solutions/machine-vision-algorithms/machine-learning [Accessed: 03/03/2017].
 Dzyre, N. 2016. 10 Forthcoming Augmented Reality & Smart Glasses You Can Buy. [Blog] Hongkiat. Available: http://www.hongkiat.com/blog/augmented-reality-smart-glasses/ [Accessed: 03/03/2017].
 Tractica. 2016. Computer Vision Hardware and Software Market to Reach $48.6 Billion by 2022. [Website] www.tractica.com. Available: https://www.tractica.com/newsroom/press-releases/computer-vision-hardware-and-software-market-to-reach-48-6-billion-by-2022/ [Accessed: 12/03/2017].