Land cover mapping via classiﬁcation is one of the most pervasive applications of hyperspectral imagery. Increased availability of hyperspectral data via new sensors and enabling technologies, as well as anticipation of upcoming satellite missions, is inspiring research in classification focused on addressing opportunities and challenges inherent in hyperspectral imagery. This chapter provides an overview of recent advances in classification methods for mapping vegetation using hyperspectral data. High dimensionality and redundant features are longstanding problems, typically addressed by the introduction of frontend feature selection or extraction approaches. A review of traditional methods for feature selection and linear feature extraction is provided, and nonlinear manifold learning is introduced to accommodate nonlinear responses exhibited in hyperspectral data. Spatial variability at multiple scales commonly occurs in both natural vegetation and agricultural croplands, so strategies for accommodating spatial context are reviewed. NonGaussian inputs are a common problem in application of traditional parametric classifiers, motivating the development of alternative models and nonparametric classifiers. A summary of Gaussian mixture models and popular nonparametric classification strategies such as support vector machines and deep learning via convolutional neural networks is provided, with illustrative examples. Finally, new frameworks for accommodating classification of multisource, multitemporal, multiscale classification of hyperspectral data are explored. Multikernel learning has been introduced and successfully demonstrated for both single and multisource inputs with various backend classifiers. Approaches for domain adaptation and transfer learning for classification of multitemporal and multiscale learning are introduced and illustrated. Finally, active and metric learning strategies are outlined as flexible frameworks for mitigating the impact of limited training data on supervised classifiers.
Hyperspectral data are becoming more widely available via sensors on airborne and unmanned aerial vehicle (UAV) platforms, as well as proximal platforms. While spacebased hyperspectral data continue to be limited in availability, multiple spaceborne Earthobserving missions on traditional platforms are scheduled for launch, and companies are experimenting with small satellites for constellations to observe the Earth, as well as for planetary missions. Land cover mapping via classification is one of the most important applications of hyperspectral remote sensing and will increase in significance as time series of imagery are more readily available.
However, while the narrow bands of hyperspectral data provide new opportunities for chemistrybased modeling and mapping, challenges remain. Hyperspectral data are high dimensional, and many bands are highly correlated or irrelevant for a given classification problem. For supervised classification methods, the quantity of training data is typically limited relative to the dimension of the input space. The resulting Hughes phenomenon [1], often referred to as the curse of dimensionality, increases potential for unstable parameter estimates, overfitting, and poor generalization of classifiers [2]. This is particularly problematic for parametric approaches such as Gaussian maximum likelihood–based classifiers that have been the backbone of pixelbased multispectral classification methods. This issue has motivated investigation of alternatives, including regularization of the class covariance matrices [3], ensembles of weak classifiers [4,5], development of feature selection and extraction methods [6], adoption of nonparametric classifiers, and exploration of methods to exploit unlabeled samples via semisupervised [7] and active learning [8,9]. Data sets are also quite large, motivating computationally efficient algorithms and implementations.
This chapter provides an overview of the recent advances in classification methods for mapping vegetation using hyperspectral data. Three data sets that are used in the hyperspectral classification literature (e.g., Botswana Hyperion satellite data and AVIRIS airborne data over both Kennedy Space Center and Indian Pines) are described in Section 3.2 and used to illustrate methods described in the chapter. An additional highresolution hyperspectral data set acquired by a SpecTIR sensor on an airborne platform over the Indian Pines area is included to exemplify the use of new deep learning approaches, and a multiplatform example of airborne hyperspectral data is provided to demonstrate transfer learning in hyperspectral image classification.
Classical approaches for supervised and unsupervised feature selection and extraction are reviewed in Section 3.3. In particular, nonlinearities exhibited in hyperspectral imagery have motivated development of nonlinear feature extraction methods in manifold learning, which are outlined in Section 3.3.1.4. Spatial context is also important in classification of both natural vegetation with complex textural patterns and large agricultural fields with significant local variability within fields. Approaches to exploit spatial features at both the pixel level (e.g., cooccurrence–based texture and extended morphological attribute profiles [EMAPs]) and integration of segmentation approaches (e.g., HSeg) are discussed in this context in Section 3.3.2.
Recently, classification methods that leverage nonparametric methods originating in the machine learning community have grown in popularity. An overview of both widely used and newly emerging approaches, including support vector machines (SVMs), Gaussian mixture models, and deep learning based on convolutional neural networks is provided in Section 3.4. Strategies to exploit unlabeled samples, including active learning and metric learning, which combine feature extraction and augmentation of the pool of training samples in an active learning framework, are outlined in Section 3.5. Integration of image segmentation with classification to accommodate spatial coherence typically observed in vegetation is also explored, including as an integrated active learning system. Exploitation of multisensor strategies for augmenting the pool of training samples is investigated via a transfer learning framework in Section 3.5.1.2. Finally, we look to the future, considering opportunities soon to be provided by new paradigms, as hyperspectral sensing is becoming common at multiple scales from groundbased and airborne autonomous vehicles to manned aircraft and spacebased platforms.
Five publically available hyperspectral benchmark data sets, which have been used to evaluate classification algorithms for vegetationbased studies in the literature, are included to illustrate the methodology presented in this chapter. The data were acquired by spacebased and airborne sensors covering the visible and shortwave infrared portions of the spectrum at spatial resolutions ranging from 2 to 30 m. The higher spatial resolution acquisition allows both discrimination of smaller objects and utilization of texture information by the classification algorithms. Spectral signatures of the classes are complex and often overlapping, and spatial patterns include agricultural fields with regular boundaries and natural vegetation where classes are often fragmented or mixed. Characteristics of the data sets are listed in Table 3.1.
Data Set 
BOT 
KSC 
Indian Pine 1992 
Indian Pine 2010 
Galveston, Texas 
Galveston, Texas 

Scene description 
Vegetation and flooding 
Wetland/upland vegetation 
Early season agriculture 
Early season agriculture 
Wetland vegetation 
Wetland vegetation 
Sensor 
Hyperion 
AVIRIS 
AVIRIS 
SpecTIR 
SpecTIR 
Headwall nano 
Platform 
Satellite 
Airborne 
Airborne 
Airborne 
Airborne 
Terrestrial 
Spectral range 
0.4–2.5 μm 
0.4–2.5 μm 
0.4–2.5 μm 
0.4–2.5 μm 
0.4–2.5 μm 
0.4–1.0 μm 
Spatial resolution 
30 m 
18 m 
18 m 
2 m 
1 m 
Variable 
No. of bands 
220 
176 
176 
360 
360 
274 
No. of classes 
9 
13 
16 
12 
12 
12 
The NASA EO1 satellite was launched in November 2000, with Hyperion as an “auxiliary” hyperspectral sensor. The EO1 platform was designed for one year of operation, but was finally decommissioned nearly two decades later in 2017. Hyperion data were acquired in 7.7km strips at 30 m spatial resolution for a multiyear study of flooding in the Okavango Delta, Botswana. Uncalibrated and noisy bands were removed, leaving 145 bands as candidate features for classification. Nine classes of complex natural vegetation were identified by researchers at the Okavango Research Center. Class groupings include seasonal swamps, occasional swamps, and woodlands, and are distributed in fragmented patterns over a large area. RGB images of the area, maps of the ground reference data, and a class legend are included in Figure 3.1. As shown in the figure, the pointing angle of the satellite changed after the May acquisition, necessitating development of knowledge transfer models. Signatures of several classes overlap spectrally, resulting in a challenging data set for classification. Class 3 (riparian) and Class 6 (woodlands) are particularly difficult to discriminate. After removing water absorption and noisy and overlapping spectral bands in the visible and nearinfrared (VNIR) and shortwave infrared (SWIR) sensors, 145 bands of May and July 2001 images were used for classification experiments.
Figure 3.1 Botswana (BOT) Hyperion data. True color composites with corresponding ground reference data in (a) May, (c) June, and (e) July 2001, class labels and # labeled samples. The pointing angle changed after the May acquisition, and then remained the same for subsequent dates.
Airborne hyperspectral data were acquired by NASA AVIRIS at 18 m spatial resolution and 10 nm spectral resolution over a natural wetland/upland environment adjacent to the Kennedy Space Center, Florida, in March 1996, to evaluate the impact of drainage management practices on the incursion of invasive species into an endangered species habitat. Figure 3.2 includes an RGB image of the area, ground reference data, and class reference information. Noisy and water absorption bands were removed from the reflectance data, leaving 176 features for 13 wetland and upland classes. The spectral signatures of multiple classes are mixed and often exhibit only subtle differences. Cabbage Palm Hammock (Class 3) and Broad Leaf/Oak Hammock (Class 6) are upland trees; Willow Swamp (Class 2), Hardwood Swamp (Class 7), Graminoid Marsh (Class 8), and Spartina Marsh (Class 9) are trees and grasses in transition wetlands. Classification results for all 13 classes and for these difficult classes are reported in several publications.
Figure 3.2 Kennedy Space Center (KSC) AVIRIS data. RGB truecolor composite and corresponding ground reference map, class labels, and # labeled samples.
The historical data acquired by the NASA AVIRIS sensor in June 1992 over a central Indiana farming area have been widely used to evaluate classification methods that exploit spatial information. After removing 20 water absorption bands, 200 bands are used for analysis. The scene is composed of agricultural fields with regular geometry, providing an opportunity to evaluate the impact of withinclass and betweenclass variability at medium spatial resolution. The spectral signatures of corn and soybean fields, which were planted only a short time prior to the acquisition, illustrate the impact of tillage management practices. The 16 classes of labeled reference data are reported at the field scale for crops, but significant withinfield variability resulted in heterogeneous spectral responses. Although labeled as vegetation, the spectral responses of many classes are dominated by soil and residue signatures from the previous year. Figure 3.3 includes an RGB image of the area, class legends, and the corresponding labeled data.
Figure 3.3 Indian Pine 1992 AVIRIS data. RGB truecolor composite and corresponding ground reference map, class labels, and # labeled samples.
Additional hyperspectral imagery was acquired by the airborne ProSpecTIR VS2 VNIR/SWIR in June 2010 for a study of residue cover estimates over an area near the location of the original Indian Pine AVIRIS data. The data were collected at 2 m spatial resolution in 360 channels at 5 nm spectral resolution over the range of 390–2450 nm. Bands were aggregated to 10 nm, and 178 spectral bands were used for analysis. Nineteen classes of crops, residue, and buildings were identified. A true color composite is shown in Figure 3.4 with associated ground reference data and class information.
Figure 3.4 Indian Pine 2010 SpecTIR data. RGB truecolor composite and corresponding ground reference map, class labels, and # labeled samples.
A heterogeneous hyperspectral data set composed of airborne imagery and a unit collected over wetland vegetation are included to illustrate domain adaptation (transfer learning) (Figure 3.5). Changes in distribution of wetland vegetation species can have profound impacts on the coastal economy and ecology; hence, studying wetland vegetation through remote sensing is of great importance, particularly over extended geographic areas.. Marshes in MissonAransas estuary, which were previously dominated by smooth cordgrass (Spartina alterniflora), have been replaced by black mangroves (Avicennia germinans).
Figure 3.5 Galveston, Texas, data. Truecolor images of the aerial view (target domain) and street view (source domain) wetland data. The location of the study site is indicated by the red box in a Google Earth screenshot image.
Hyperspectral imagery was acquired by the airborne ProSpecTIR VS sensor with a spatial coverage of 3462 × 5037 pixels at a 1 m spatial resolution. A field survey was undertaken on September 16, 2016. Figure 3.6 depicts the mean spectral signatures of 12 identified species/classes (Table 3.2) As part of this field survey, sidelooking hyperspectral imagery (called “street view” imagery in this chapter) was collected using a Headwall NanoHyperspec sensor of the same area. It has 274 bands spanning 400−1000 nm at a 3 nm spectral resolution. This resulted in a unique domain adaptation problem where models were trained from very highresolution streetview (terrestrial) imagery and transferred to aerial imagery. It is also very challenging because the two domains (street view and aerial) are different in many ways, including the viewpoints, illumination conditions, atmospheric conditions, and so on.
Figure 3.6 Galveston, Texas, SpecTIR airborne and Headwall handheld camera data. Mean spectral signatures of the (a) aerial view (target domain) and (b) street view (source domain) wetland data.
Class 
# Labeled Samples (Aerial View) 
# Labeled Samples (Street View) 

C1: Upland grass 
794 
1463 
C2: St. Augustine grass 
100 
1009 
C3: Sesbania 
294 
1021 
C4: Upland tree 
426 
1040 
C5: Phragmites austrails 
780 
1029 
C6: Sabal mexicana 
74 
1189 
C7: Spartina alterniflora 
733 
1152 
C8: Juncus roemerianus 
202 
1264 
C9: Batis maritima/Distichlis spicata 
596 
1106 
C10: Distichlis spicata 
1197 
1087 
C11: Baccharis halimifolia 
360 
1017 
C12: Avicennia germinans 
1663 
1119 
Identifying features that are effective for modeling data class characteristics is a critical preprocessing step for hyperspectral image classification. Apart from computational requirements, classifiers tend to have low generalization capabilities when data are characterized by high dimensionality, especially when the number of training samples is limited with respect to the number of features. A traditional solution to deal with this problem is represented by feature selection (FS), which aims to reduce the dimensionality of the original feature space by choosing the best—and ideally the minimum—subset of features. Numerous FS approaches have been proposed in the last few decades [6] and can be grouped in two main categories: filter and wrapper methods. Filter methods perform FS as a preprocessing step where the selection criterion is independent of the classifiers used to subsequently perform classification of the data [10]. Wrapper strategies perform FS based on the performance of a given classifier [11]. These techniques are generally applied to the original spectral bands, although they can be extended to newly generated [12] or extracted features [13]. Selection from the original feature space is advantageous in the sense that the resulting features retain a physical relationship to the original process. We focus in the following on filter methods, which can be categorized as supervised and unsupervised approaches.
Supervised FS can be further subdivided in two major groups, that is, parametric and nonparametric methods. Parametric supervised FS involves class modeling using training data. A widely adopted method is based on the JeffriesMatusita (JM) distance, which measures the separability of two class density functions [14]. When classes are assumed to be Gaussian distributed [15], computation of the JM distance is based on the Bhattacharyya distance. This approach performs well when the Gaussian distribution assumption is valid. It is used to find a subset of features to best accommodate class data variations at multiple sites/locations and generate a visual representation of the separation capability provided by each band, which then leads to quantitative band selection [16]. Other distance measures can also be used, including spectral angle, Euclidean distance, and Mahalanobis distance. Apart from distances between classes, the ratio of withinclass and betweenclass variance can be used as well to define separability [17]. Nonparametric supervised FS considers the information provided by the training data directly, without requiring class data modeling. As an example, overlapping and noisy bands can be removed using a canonical correlation measure to obtain an optimal subset of features that provide the best estimate of the center of classes [18]. Information theory can also be applied to perform nonparametric supervised FS. Mutual information (MI) provides a measure of linear and nonlinear dependency between two variables [19] and is suitable for general cases since no assumptions about the shape of the class data density functions must be made.
Unsupervised FS aims at reducing feature redundancy. For example, features that are dissimilar to those already selected can be chosen one by one via linear prediction error analysis [20]. Other methods use similarity measures to partition the original set of features into a number of homogeneous clusters and then select a representative feature from each cluster [21]. MI is used to find the subset of features with minimum dependency [22]. Subsets of representative features can also be selected simultaneously through geometrybased FS methods [23].
After defining the selection criterion, adopting the appropriate searching strategy is a challenging task. A first solution is represented by exhaustive search in which all the possible combinations are evaluated. Exhaustive search is often impractical, so alternative suboptimal approaches are usually adopted and defined as greedy search. These optimization problems are usually not convex, and heuristic strategies are needed. In the case of a monotonic criterion, the branch and bound method [24] can be applied to avoid exhaustive search in a moderatesized searching space. Sequential forward selection and backward elimination are fast approaches, but do not allow feedback to revise previously selected features. Improvements are represented by sequential forward floating selection and sequential backward floating selection [25] in which the selected features are reconsidered for inclusion or deletion at each iteration. Combinatorial optimization approaches use heuristic methods to reduce the number of features. Proposed solutions include methods based on genetic algorithms [26], particle swarm optimization [27], and clonal selection [28].
Spectral indices computed as ratios of broadband spectral bands or of normalized differences between pairs of bands based on multispectral radiance and reflectance data have been used for nearly half a century in studies of vegetation. More recently, narrowband indices based on spectral bands and derivatives of the reflectance spectrum have been explored in vegetation studies for their value in characterizing biophysical properties, predicting variables of interest, and mapping or unmixing land cover. The continuous spectrum provided by hyperspectral sensors and the relationship of spectral absorption features in spectral signatures to chemistrybased properties provides robust capability to target explicit characteristics compared to what can be achieved by broadband multispectral indices. Thenkabail et al. [29] conducted a comprehensive review of spectral indices derived from the EO1 Hyperion hyperspectral sensor using data acquired for a wide range of agricultural applications and geographic sites. Recently, narrowband indices have also become a focus of highthroughput phenotyping by plant breeders seeking to map and relate phenotypic traits to genotypes [30].
Spectral indices also provide an appealing physicallybased capability for reducing the dimensionality and redundancy that are inherent in hyperspectral signatures over vegetation. Bridging bandspecific feature selection and extraction approaches that seek to represent relevant information in the spectrum via global transformations, hyperspectral indices provide robust, local transformations that are useful for a wide range of applications in remote sensing–based studies of vegetation, including classification.
Feature extraction (FE) seeks to project the original feature space onto a small number of features. While extracted features lose their relationship to the original physical phenomena, they provide a compressed version of the original feature set. Each feature in the original hyperspectral data is characterized by a contribution determined by the transformation matrix associated with the chosen extraction method. FE techniques can be categorized as unsupervised (global data oriented) and supervised (class data oriented) or as linear and nonlinear.
In supervised FE methods, new discriminative features can be obtained by combining the original features into groups. As an example, adjacent correlated features can be combined into a smaller number of features retaining the original spectral interpretation [31]. Bdis can be adopted as a grouping criterion before creating new features as a weighted sum of the features in each group [32]. In [33], contiguous groups of features are averaged based on JM distance. Linear discriminant analysis (LDA) and canonical analysis are traditional parametric FE techniques that are based on the mean vector and covariance matrix of each class. The ratio of withinclass to betweenclass scatter matrices is used to formulate an effective criterion of class separability. Limitations of LDA include its dependence on the distributions of classes being approximately Gaussian and its inability to handle cases where class data do not form a single cluster. Further, the maximum rank of the betweenclass scatter matrix is the number of classes (M) minus one. Decision boundary FE (DBFE), an early method developed specifically for hyperspectral data [34], aims at finding new features that are normal to class decision boundaries. Nonparametric FE via regularization techniques have also been proposed to overcome the limitations of LDA and obtain more stable results [35]. Nonparametric discriminant analysis (NDA) [36] defines a nonparametric betweenclass scatter matrix based on a critical finding that samples close to the boundary are more relevant than those far from it. Nonparametric weighted FE (NWFE) was developed in light of NDA, introducing regularization techniques to achieve better performance for hyperspectral image classification than LDA and NDA. Double nearest proportion (DNP) FE builds new scatter matrices based on a double nearest proportion structure [37].
Unsupervised FE is usually obtained by combining the original group of features via an average or weighted sum operation. For example, topdown and bottomup decompositions can be adopted to merge highly correlated adjacent features and then project them onto their Fisher directions [31]. Apart from combining groups of contiguous features, a more general FE approach consists of mapping the original highdimensional space into a lowdimensional one via a data transform. Two typical data transformation methods are principal component analysis (PCA), which reduces the original set of features into a smaller set of orthogonal ones computed as linear combinations of the original features with maximum variance [38], and independent component analysis (ICA), a statistical technique for separating the independent signals from overlapping signals [39]. Other techniques seek projections via different optimization models. Projection pursuit (PP) methods search for projections that optimize certain projection indexes [40]. In both supervised and unsupervised FE, the kernel trick is an easy way to extend linear models to nonlinear ones, as we describe in more detail in the next section. In [41–43], angular distance–based supervised, unsupervised, and semisupervised discriminant analysis and their kernel variants were proposed and shown to outperform Euclidean distance–based variants of such algorithms for hyperspectral classification.
Although hyperspectral data are typically modeled assuming that the data originate from linear processes, and linear feature extraction approaches are simple and straightforward to implement, nonlinearities associated with physical processes are often exhibited in the narrowband data [44]. Nonlinear feature extraction techniques assume that the highdimensional data inherently lie on a lowerdimensional manifold, as shown in Figure 3.7.
Figure 3.7 Nonlinearity in the spectral data is exhibited in a plot of bands 13, 65, and 31 for the Kennedy Space Center data set.
The machine learning community initially demonstrated the potential of manifoldbased approaches for nonlinear dimensionality reduction and modeling of nonlinear structures [46–50], and its application to classification of hyperspectral data has been exhibited over the past decade for unsupervised, supervised, and semisupervised learning [45,51], as well as for multitemporal scenarios [52,53]. Nonlinear manifold learning methods are broadly characterized as globally or locally based approaches. Global manifold methods seek to maintain the fidelity of the overall topology of the data set at multiple scales of the data, while local methods preserve local geometry and are computationally efficient because they only require sparse matrix computations. Global manifolds are generally less susceptible to overfitting the data, which is beneficial for generalization in classification, but local methods potentially provide opportunities for better representation of heterogeneous data with submanifolds. Traditional manifold learning methods whose theoretical foundation is associated with the eigenspectrum and kernel framework include isometric feature mapping (ISOMAP) [46], kernel principal component analysis (KPCA) [47], and locally linear embedding (LLE) [48].
Manifold learning is classically represented in a graphembedding approach, both for efficiency and computational simplicity. A Laplacian regularizer is employed to constrain the classification function to be smooth with respect to the data manifold, and the resulting coordinates are obtained from the eigenvectors of a graph Laplacian matrix. Different manifold learning methods correspond to specific graph structures. Compared to global methods that represent distances across the full manifold, local manifold learning evaluates the local geometry in each neighborhood, and data points that are included in neighborhood are connected in the graph. In the graphembedding framework, manifold coordinates are obtained from the eigenvectors of the graph Laplacian matrix. The graph can either be developed from the combined training and testing data (unsupervised), or manifold learning can be applied to the training data and an outofsample extension method employed to incorporate the testing data (supervised) [54,55]. The first strategy can provide more accurate manifold coordinates, while the latter is advantageous when the quantity of testing data is large.
In the general formulation of the graph Laplacianbased framework, samples are defined in a data matrix X = [x _{1}, x _{2}, … , x_{n} ], x_{i} $\in $ ℝ ^{m} , where n is the number of samples and m is the feature dimension. The dimensionality reduction problem seeks a set of manifold coordinates Z = [z _{1} , z _{2} … , z_{n} ], z_{i} $\in $ ℝ ^{p} , where typically, m >> p, through feature mapping Φ: x → y, which may be analytical (explicit) or data driven (implicit), and linear or nonlinear. Assuming an undirected weighted graph G = {X, W} with data samples X and algorithmdependent similarity matrix W, the graph Laplacian L = D − W, with diagonal degree matrix D_{ii} = Σ _{j} W_{ij} , ∀ _{i} . Given labeled data X_{l} = [x _{1}, x _{2}, … , x_{l} ] and unlabeled data X_{u} = [x_{l} _{+1}, x_{l} _{+2}, … , x_{l} _{+u }], Ω = [Ω_{1}, … , Ω _{C} ], where C is the number of classes. Class labels of X_{l} are denoted as Y_{l} $\in $ ℝ ^{C} ^{×l } with Y_{ij} = 1 if x_{j} $\in $ Ω _{i} . The data points X = [X_{l} , X_{u} ] produce a weighted graph G = {X, W}, where X consists of N = l + u data points, and W $\in $ ℝ ^{N} ^{×N } is the adjacency matrix of the connected edges between data points. For a directed graph, a symmetric adjacency matrix W = W + W^{T} can be assumed. The dimensionality reduction criterion can be represented as ${Y}^{*}=\underset{Y\phantom{\rule{0.3em}{0ex}}B{Y}^{T}=I}{\mathrm{arg}\phantom{\rule{0.3em}{0ex}}min\phantom{\rule{0.3em}{0ex}}tr\text{(}Y\phantom{\rule{0.3em}{0ex}}L{T}^{T}\text{)}}$ , where B is a constraint matrix that depends on the dimensionality reduction method.
Plots of coordinates obtained using PCA, Isomap, and LLE for the nineclass NASA Hyperion BOT data, with optimal parameters for each manifold embedding, are shown in Figure 3.8. Differences in the separation of classes, which relate to potential discrimination via classification, are shown for the three projections. The objective functions for neither linear projections such as PCA nor manifold learning are related to the classification objective, so the resulting projections may or may not provide improved separation of classes.
Figure 3.8 Plots of coordinates obtained from (first line) PCA, (second line) ISOMAP, and (third line) LLE for the nine Botswana classes (C1C9). (a) bands 1–2, (b) bands 3–4, (c) bands 5–6, (d) bands 7–8.
For many applications, a key discriminative component that makes hyperspectral imaging appealing is the richness of the spectral reflectance profiles. There is much to be gained for most applications, including vegetation classification, by leveraging the spatial context in the imagery. A common approach to incorporating spatial context for classification is to extract meaningful morphological and textural features [56,57] and then learn the classifier in the resulting feature space. Within this space, a choice that has been demonstrated to be particularly successful for hyperspectral classification tasks is extended morphological attribute profile (EMAP) features, that represent profiles created by removing connected components that do not meet some criteria specified a priori. For situations where the criteria are satisfied, the regions are kept intact; otherwise, they are set to the gray level of a darker or brighter surrounding region. Attributes (features) can represent the morphology/geometric properties of the objects (e.g., image moments and shape) or the textural information about the objects (e.g., range, standard deviation, and entropy). EMAP features have found significant success in hyperspectral image analysis—in the context of multichannel images, such features are often computed from a subset of features extracted from the hypercube (e.g., selected bands or the first few principal components of the cube). They have been applied to a wide variety of remote sensing applications to extract spatial context for image analysis [58,59].
Within the realm of geospatial image analysis in general and hyperspectral image analysis in particular, a significant focus within the research community has been on the design of feature reduction and analysis algorithms (classification, change and anomaly detection, target recognition, spectral unmixing, etc.) [60–63]. Similar to feature extraction and dimension reduction, classifiers that utilize domainspecific properties of hyperspectral data have emerged as popular choices [64–68]. Over the past decade, the choice of classifiers for hyperspectral classification has shifted from traditional approaches such as Knearest neighbors, Gaussian maximumlikelihood classification, and random forests to more advanced approaches that offer greater capacity in modeling nonlinear decision surfaces in the feature space. Knearest neighbor–based classifiers assume that data can be classified by surveying the K training data points nearest the test point in the feature space—an assumption that is effective when coupled with local manifold learning approaches, but is far too simplistic as a classification scheme in itself. Likewise, Gaussian maximumlikelihood classifiers assume that classconditional likelihoods are modeled as Gaussian distributions in the spectral reflectance parameter space, and the underlying parametrization is then learned via maximum likelihood—the approach has several shortcomings, including the necessity to undertake feature reduction as a preprocessing step prior to inferring the highdimensional parameters and departure of classconditional distributions from unimodal Gaussian behavior under a variety of practical remote sensing scenarios (such as the same class being present in a wellilluminated part of the scene and under cloud shadows). More advanced classification approaches have hence been proposed and utilized in recent years for hyperspectral classification. These include multikernel learning and variants of support vector machines [69,70], and other modelbased classifiers, spectrally constrained implementations of Gaussian mixture models (GMMs) [66], sparse representation classification (SRC) [71,72], local approaches [73], and so on.
Given a training data set $\mathbf{X}={\left\{{\mathbf{x}}_{i}\right\}}_{i=1}^{n}$ in ℝ ^{d} with class labels y_{i} $\in $ {+1, −1} and a nonlinear kernel function ϕ(·), an SVM [74] classifies data by learning the optimal hyperplane in the Hilbert space induced by the kernel function
SVM has become one of the most widely used nonparametric classifiers for hyperspectral data, in part due to its relative independence from data dimensionality, and it is included now in many software libraries.
Bayesian classification entails modeling classconditional likelihoods [p(xy_{i} ), i $\in $ {1, …, c}] through an underlying probability model that is learned in the feature space (or some other appropriate subspace) from the training data. The state of the art in such methods for hyperspectral classification assumes that the classconditional likelihoods are best modeled as mixtures (weighted linear combination) of “basis” Gaussian density functions. Depending upon whether one assumes there are finite or infinite Gaussian components in the mixture, the resulting model is either a traditional Gaussian mixture model or infinite Gaussian mixture model (IGMM), respectively.
A GMM is a weighted linear combination of a finite number (K) of Gaussian components, such that the resulting likelihood function of $\mathbf{X}={\left\{{\mathbf{x}}_{i}\right\}}_{i=1}^{n}$ in ℝ ^{d} is
GMMs are hence a parametric representation of the underlying probability model of the data, where the parametrization is formed by the parametric representations of each of the K Gaussian distributions (mean vector, μ_{k} , and covariance matrix, Σ _{k} ), as well as the weights of each component in the mixture α _{k} , $\Theta ={\{{\alpha}_{k},{\mu}_{k},{\Sigma}_{k}\}}_{k=1}^{K}$ .
Training GMM models (i.e., given X, estimating $\Theta ={\{{\alpha}_{k},{\mu}_{k},{\Sigma}_{k}\}}_{k=1}^{K}$ ) is typically undertaken using the expectation maximization (EM) approach. A crucial design choice when training such GMM models is then determining the number of mixtures/components, K. This is often learned empirically by making use of an information theoretic measure, such as the Akaike information criterion (AIC) [75].
Such a Bayesian model can be used for both unsupervised and supervised learning. In an unsupervised learning framework, GMMs are often used as a robust Bayesian paradigm for clustering data—for instance, given an imagery data set (a set of pixels, X) with no labels, cluster the data into its K constituent clusters, where K is the underlying number of components (e.g., number of classes in the scene, assumed to be known a priori). In a supervised learning framework, GMMs can be used to learn classconditional likelihood functions, with the assumption that each class may have a likelihood function that requires a mixture of multiple Gaussians to represent it effectively. In the context of remotely sensed imagery, such multimodal classconditional distributions may arise due to practical effects (e.g., the same class in the image under different illumination conditions or variations in spectral response of vegetation due to varying vegetation stress within the same class in different parts of the scene).
A fundamental challenge of traditional GMMs is that the number of mixture components, K, may be difficult to ascertain a priori, and empirical approaches to estimating K may result in under or overestimation of the required number of components within the mixture model. A recent development in the field of Bayesian nonparametrics, the IGMM, addresses this issue very effectively. Unlike traditional GMMs, the number of mixture components in IGMMs can be ascertained as part of the Bayesian inference process. IGMMs are a specific variant [76] of Dirichlet process mixture models (DPMMs). In DPMM formulations, often a stickbreaking construction process [76] is used to generate mixture/component weights that have a Dirichlet process prior placed on them. Unlike traditional GMMs, IGMMs do not assume that the number of mixture components is known a priori and instead infer the mixture components via Bayesian inference strategies, such as Markov chain Monte Carlo sampling [76].
A key point to be made here with respect to Bayesian inference approaches is that the success of such approaches is contingent on the training sample size relative to the dimensionality of the feature space. For example, with a Kcomponent GMM, assuming full covariance matrices, one needs to estimate K(1 + d(d − 1)/2) + Kd parameters from the training data. Learning such highdimensional parameters with limited training data is highly impractical—hence, feature extraction (dimensionality reduction) is often a critical preprocessing step to such Bayesian inference strategies. Figure 3.9 illustrates the performance of GMMbased classification coupled with a locality preserving feature reduction approach for classification of the Indian Pine 1992 data set. For this data set, LFDASVM and LFDAGMM consistently resulted in the highest overall accuracy. In our recent work [77], we found that for hyperspectral image analysis, including for vegetation, localitypreserving approaches to subspace learning are the most effective feature extraction/preprocessing for Bayesian inference, as they preserve locality (neighborhood structure) of the feature space in the embedded subspace.
Figure 3.9 Classification accuracy as a function of training sample size using GMM classifiers coupled with LFDAbased feature reduction for the Indian Pine 1992 data set. Comparisons to baseline methods are also provided. Linear discriminant analysismax likelihood estimate (LDAMLE); regularized linear discriminant analysismax likelihood estimate (RLDAMLE); subspace LDAGaussian mixture model (Subspace LDAGMM); recursive feature eliminationSVM (RFESVM); kernel local Fisher discriminant analysisSVM (KLFDASVM); local Fisher discriminant analysisSVM (LFDASVM); local Fisher discriminant analysisGaussian mixture models.
Deep learning has emerged as a very effective approach in contemporary computer vision and signal analysis tasks. This has been driven in part by increased availability of highperformance computing infrastructures and rich libraries for training models. Deep learning algorithms include deep convolutional neural networks (CNNs), recurrent neural networks (RNNs), and deep belief networks (DBNs) that have been successfully deployed for speech recognition, computer vision, natural language processing, and, more recently, remote sensing applications [79–81]. CNNs and their variants have been successfully used for tasks such as largescale object detection, transfer learning/domain adaptation, and so on. RNNs and their variants have also been demonstrated to be useful for temporal modeling for applications such as speech recognition [82,83].
Within the context of pixellevel hyperspectral image analysis, we have observed that the spectral reflectance characteristics of the data can be modeled quite effectively through what we call convolutional recurrent neural networks (CRNNs). The approach entails a series of convolutional and pooling layers followed by recurrent neural network layers, as shown in Figure 3.10. The convolutional layers are adept at extracting stable, locally invariant features from the spectral reflectance features. Pooling layers help minimize effects of overfitting. The recurrent layers extract the interchannel relationship in spectral reflectance profiles. Each convolutional layer has multiple 1D convolutional filters where the filter support is a datadependent parameter. The pooling layers are used for subsampling to reduce the dimensionality of the network, which can help reduce computation and control overfitting. A recurrent layer has a recursive function f that takes as input one input vector x _{t} and the previous hidden state h _{t} _{−1}, and returns the new hidden state as:
Figure 3.10 CRNN architecture for hyperspectral data classification.
# Training Samples 
1900 
3800 
5700 

RBFSVM 
92.82(±1.07) 
94.36(±0.93) 
95.13(±0.64) 
CNN 
93.11(±0.95) 
94.53(±0.39) 
95.84(±0.31) 
RNN 
84.83(±1.62) 
89.74(±0.98) 
91.86(±0.77) 
CRNN 
94.43(±1.01) 
96.24(±0.60) 
96.83(±0.47) 
Source: Adapted from H. Wu and S. Prasad. Remote Sensing, vol. 9, no. 3, p. 298, 2017. [86]
Segmentation is often an effective precursor to classification, providing the capability to identify homogeneous regions that are classified as objects. A myriad of image segmentation approaches have been proposed and developed over the years (see, e.g., [87]). These approaches can be adapted to segmenting hyperspectral imagery as a preprocessing stage to classification by first reducing the data dimensionality through feature selection and extraction (see previous section) or by utilizing an appropriate region similarity (or dissimilarity) criterion.
A dissimilarity criterion designed for hyperspectral data is the spectral angle mapper (SAM) criterion [88]. An important property of the SAM criterion is that poorly illuminated and more brightly illuminated pixels of the same color will be mapped to the same spectral angle despite the difference in illumination.
The most successful approaches to spatialspectral image segmentation are based on best merge region growing. An early version of best merge region growing, hierarchical stepwise optimization (HSWO), is an iterative form of region growing in which the iterations consist of finding the most optimal or best segmentation with one region less than the current segmentation [89]. The HSWO approach can be summarized as follows:
HSWO naturally produces a segmentation hierarchy consisting of the entire sequence of segmentations. For practical applications, however, a subset of segmentation needs to be selected from this exhaustive segmentation hierarchy. A portion of such a segmentation hierarchy is illustrated in Figure 3.11 (the selection of a single segmentation from a segmentation hierarchy is discussed in a later section).
Figure 3.11 The last four levels of an nlevel segmentation hierarchy produced by a region growing segmentation process. Note that when depicted in this manner, the region growing process is a “bottomup” approach.
A unique feature of the segmentation hierarchy produced by HSWO and related region growing segmentation approaches is that the segment or region boundaries are maintained at the full image spatial resolution for all levels of the segmentation hierarchy. This is important for classification problems.
Many variations on best merge region growing have been described in the literature. As early as 1994, [90] described an implementation of the HSWO form of best merge region growing that utilized a heap data structure [91] for efficient determination of best merges and a dissimilarity criterion based on minimizing the mean squared error between the region mean image and original image. The main differences between most of these region growing approaches are the dissimilarity criterion employed and, perhaps, some control logic designed to remove small regions or otherwise tailor the segmentation output. In complex scenes, such as remotely sensed images of the Earth, objects with similar spectral signatures (e.g., lakes, agricultural fields, buildings, etc.) appear in spatially separated locations. In such cases, it is useful to aggregate these spectrally similar but spatially disjoint region objects into groups of region objects, or region classes. This is the basis of the hybridization of HSWO best merge region growing with spectral clustering [92,93] called HSeg (hierarchical segmentation). HSeg adds to HSWO a step following each step of adjacent region merges in which all pairs of spatially nonadjacent regions are merged that have dissimilarity ≤ S_{w}T_{merge} , where 0.0 ≤ S_{w} ≤ 1.0 is a factor that sets the priority between spatially adjacent and nonadjacent region merges. Note that when S_{w} = 0.0, HSeg reduces to HSWO.
A recursive divideandconquer approximation of HSeg (called RHSeg) [94] was developed to enable a straightforward parallel implementation. The computational requirements of HSeg were further reduced by refinements discussed in [93]. HSeg segmentation results are also compared in [93] with other segmentation approaches based on region growing.
Determining the optimal level of segmentation detail for a particular application is a challenge for all image segmentation approaches. The level of segmentation detail produced by a segmentation approach can usually be specified by adjusting one or more of the approach's parameters. Additionally, segmentation approaches based on region growing can also be easily designed to output a hierarchical set of image segmentations that can be examined later to select an optimal level of segmentation detail. In the case of HSeg, a particular set of hierarchical segmentations can be specified by providing a set of merge thresholds or number of region classes. HSeg can also automatically select a hierarchical set of image segmentations over a specified range of the number of region classes. This automatic approach outputs the image segmentation result at the region growing iteration prior to the point where any region classes are involved in a second merge since the last segmentation results output.
The following subsections describe various approaches for selecting an optimal level of segmentation detail out of an image segmentation hierarchy, such as that produced by HSeg:
Figure 3.12 Example of using HSegLearn to generate a ground reference map of vegetation. (a) RGB image panel displaying a subsection of a hyperspectral image over a portion of downtown Bowie, MD. (b) HSegLearn Current Region Labels Panel after selecting some positive example region classes (vegetation, yellow) and some negative example region classes (nonvegetation, white). (c) HSegLearn Current Region Labels Panel: the positive example regions (vegetation, green) and negative regions (nonvegetation, red); region labeling by finding the coarsest levels in the HSeg segmentation hierarchy that do not conflict with the analyst labeling. These hyperspectral data were obtained by NASA Goddard's LiDAR, Hyperspectral and Thermal (GLiHT) Airborne Imager on June 28, 2012, at a nominal 1m spatial resolution. The data set has 114 spectral bands over the spectral range of 418–918 nm.
It is well understood that with traditional SVM classifiers, the choice of kernel function and the associated kernel parameter (e.g., the width of the RBF kernel) have a significant impact on the classification performance. A key practical aspect of using SVM classifiers with hyperspectral data is the need to “tune” the kernel parameters to find the appropriate (datadependent) parameters that result in optimal decision boundaries. Additionally, when dealing with multisource data (e.g., data from different sensors), traditional singlekernel SVMs do not offer a compelling approach to fuse such data into a unified classification product. In recent work, multikernel learning methods have been shown to fill in these gaps that traditional SVMs struggle to classify. Multikernel learning can be thought of as learning a traditional SVM, with the exception that instead of traditional Mercer's kernels, one uses a mixture of “basis” kernels, where the mixing weights are learned as part of the SVM optimization.
In the multisource scenario, for a specific source p, the combined kernel function K between two pixels ${\mathbf{x}}_{i}^{p}\phantom{\rule{0.3em}{0ex}}\text{and}\phantom{\rule{0.3em}{0ex}}{\mathbf{x}}_{j}^{p}$ can be represented as
As with traditional SVMs, SimpleMKL also has a dual representation:
An alternative approach to use the MKL framework entails learning sourcespecific optimal MKL classifiers, which are then fused through decision fusion strategies, such as majority voting or linear opinion pools. In this alternative construction, the mixture of kernels per SVM yields a much more discriminative and linearly separable Hilbert space than a traditional single kernel–based SVM. We used multikernel SVMs to systematically fuse multisensor (hyperspectral and LiDAR) geospatial data in an active learning paradigm [100].
A common image analysis scenario when dealing with multisource data involves transferring knowledge/information between sources. For instance, given a source data set rich in labeled data where one can optimally learn a robust classifier, can this knowledge/model be transferred to a (different) target data set where there may not otherwise be sufficient training data to train classifiers for effective classification of target data? It is common to encounter such scenarios when undertaking multiscale, multisensor, and multitemporal data sets. Here, we refer to recent efforts [53,101,102] aimed at transferring knowledge across sources (e.g., across different sensors or different viewpoints or times corresponding to the same sensor).
Denote the source domain as D _{S} with samples {x _{1}, … , x_{n} } $\in $ ${\mathbb{R}}^{{d}_{s}}$ and corresponding class labels $\left\{{l}_{{x}_{1}},\phantom{\rule{0.3em}{0ex}}\dots \phantom{\rule{0.3em}{0ex}},{l}_{{x}_{n}}\right\}$ . Similarly, denote the target domain as D _{T} with labeled samples $\{{y}_{1},\dots ,{y}_{m}\}\in {\mathbb{R}}^{{d}_{t}}$ , d_{s} ≠ dt, and corresponding class labels $\left\{{l}_{{y}_{1}},\phantom{\rule{0.3em}{0ex}}\dots \phantom{\rule{0.3em}{0ex}},{l}_{{y}_{m}}\right\},m\ll n$ . Given vectors x $\in $ D _{S} and y $\in $ D _{T} , the goal is to project these data to a latent space ${\mathbb{R}}^{{d}_{c}}$ that is discriminative for the underlying classification task.
In recent work, we developed mappings A and AB that maximize the overlap of withinclass samples in the latent space by ensuring that withinclass samples in D _{S} and D _{T} are located in the same cluster/region of the latent feature space. Table 3.4 depicts results of this domain adaptation based on transformation learning (DATL) approach with the Botswana data and compares it to traditional approaches (including semisupervised transfer component analysis (SSTCA) and kernel principal component analysis) to domain adaptation. Class specific accuracies reported in Table 3.5 result from training on a large pool of source data (May Botswana Hyperion image) and transferring the models learned to classify a different image acquired over the same region in July (target image) using between 1 and 15 target training samples per class to facilitate the alignment.
Algorithm 
# Target Domain Training Samples per Class 


1 
3 
5 
7 
9 
11 
13 
15 

KPCA 
34.0 (4.5) 
35.1 (4.6) 
37.5 (4.8) 
38.5 (4.5) 
40.5 (4.6) 
40.8 (4.8) 
41.2 (4.6) 
42.0 (5.5) 
SSTCA 
47.7 (8.4) 
47.5 (7.2) 
50.5 (9.1) 
49.6 (7.6) 
52.4 (7.5) 
53.0 (7.8) 
52.6 (7.3) 
51.7 (6.5) 
DATL 
46.6 (12.2) 
63.6 (13.4) 
68.2 (8.5) 
71.3 (10.0) 
73.3 (7.9) 
76.9 (5.9) 
76.8 (7.9) 
77.6 (6.7) 
Source: Adapted from X. Zhou and S. Prasad. IEEE Transactions on Computational Imaging, vol. 3, no. 4, pp. 822–836, December 2017. [102]
Algorithm 
Class Index 


1 
2 
3 
4 
5 
6 
7 
8 
9 

KPCA 
100.0 
13.4 
50.1 
97.9 
0.6 
14.0 
0.5 
1.8 
15.1 
SSTCA 
100.0 
54.4 
61.1 
99.8 
10.0 
32.0 
15.4 
18.1 
37.8 
DATL 
68.5 
50.4 
47.2 
94.4 
77.3 
42.8 
59.2 
89.5 
84.5 
Accuracies in Tables 3.6 and 3.7 represent the coastal wetland ecosystem monitoring application where the source domain refers to streetview (sidelooking, terrestrial) hyperspectral imagery captured during our field campaigns to identify the vegetation cover. The target domain refers to aerial imagery.
# Target Domain Training Samples per Class 


Algorithm 
1 
3 
5 
7 
9 
11 
13 
15 
KPCA 
26.4 (7.6) 
46.8 (9.5) 
59.9 (3.3) 
64.4 (3.4) 
67.0 (5.0) 
70.0 (5.3) 
72.5 (5.6) 
73.6 (6.1) 
SSTCA 
25.4 (9.8) 
51.0 (8.3) 
65.1 (3.2) 
70.0 (3.0) 
74.3 (2.4) 
77.9 (3.2) 
80.9 (3.5) 
82.7 (4.2) 
DATL 
42.0 (5.4) 
59.8 (6.6) 
68.2 (7.8) 
70.6 (8.0) 
74.5 (7.0) 
75.6 (4.7) 
78.7 (5.9) 
81.0 (4.9) 
Source: Adapted from X. Zhou and S. Prasad. IEEE Transactions on Computational Imaging, vol. 3, no. 4, pp. 822–836, December 2017. [102]
Algorithm 
Class Index 


1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 

KPCA 
64.7 
48.5 
57.5 
29.7 
67.5 
23.5 
45.8 
72.5 
88.0 
78.7 
75.3 
67.3 
SSTCA 
65.0 
63.3 
59.8 
30.8 
75.0 
32.7 
49.2 
73.2 
91.8 
86.5 
81.2 
72.5 
DATL 
71.0 
66.5 
69.3 
41.6 
67.4 
25.2 
47.5 
71.5 
82.0 
91.2 
70.6 
62.6 
Source:Adapted from X. Zhou and S. Prasad. IEEE Transactions on Computational Imaging, vol. 3, no. 4, pp. 822–836, December 2017. [102]
The goal of such a framework is to learn models from the rich source domain data (due to being imaged at close range, there is an abundance of pixels per class to train models in the source domain) and transfer this knowledge to undertake classification in the target domain. For this data, as few as 5–15 samples per class from the target domain, very effective classification accuracies are achieved leveraging domain information. Results are particularly promising because of different sunsensorcanopy geometry but also different sensors (source spans the VNIR range, while the target spans the VNIRSWIR range).
Semisupervised learning (SSL) approaches, where unlabeled samples are classified and some are incorporated into the training set, have been investigated by many researchers for classification of highdimensional data. SSL directly utilizes the unlabeled data to facilitate the learning process without requiring any humanlabeling efforts. In addition to semisupervised learning approaches, many researchers have explored active learning (AL) as an alternative strategy for leveraging unlabeled samples to mitigate the impact of limited training samples for supervised classification of remotely sensed hyperspectral data. Unlike SSL approaches, AL heuristics focus on identifying the most “informative” unlabeled samples, then obtaining the corresponding labels and incorporating the newly labeled data into the training pool.
In traditional supervised classification, as shown in the box in Figure 3.13, labeled samples/features are presented to the classifier for training, and unlabeled samples from the data pool are then classified using the resulting model. AL involves an iterative training/evaluation phase, whereby an initial model is developed using a small number of samples, and the classification results are evaluated. Samples are then selected from the unlabeled pool, evaluated according to “query” criteria, and ranked. Sample(s) are chosen from the list for labeling according to specified selection criteria, labeled, and incorporated into the training set. The remaining candidates are returned to the unlabeled pool. The classifier is updated using the augmented training set, and the process is repeated until a stopping criterion (e.g., number of iterations, number of samples, classification accuracy) is achieved.
Figure 3.13 Flow chart for active and passive learning.
AL strategies typically result in biased sampling of classes, with more samples being selected from classes that are more difficult to discriminate, as shown in the example of Figure 3.14. AL strategies are generally characterized by the criteria used to select candidate samples for labeling. The approaches include (1) margin sampling (MS)–based approaches, where the samples closest to the separating hyperplane of the classifier, such as support vector machines, are considered as the most uncertain ones [103,104]; (2) committee of learners–based approaches, where those samples exhibiting maximal disagreement between the different classification models are selected [8,105,106]; and (3) class probability distribution–based approaches, where breaking ties (BT) is a representative method in which the difference between the two highest posterior probabilities is used to quantify the uncertainty of a pixel [107,108]. Algorithms are typically implemented with spectral data as the singlesource inputs, although multiplesource inputs have also been studied [109,110]. Active learning for classification has been used in multiple applications [111,112] including largescale scenarios [113]. We refer to [9,114] for surveys on AL methods.
Figure 3.14 Distribution of labeled class samples for Kennedy Space Center data. (a) True distribution, (b) AL sampling, (c) random sampling.
Results from [106] are included to illustrate AL based on a committee of learners. Views can be composed of multiple classifiers resulting from different methods, different inputs, and so on. Here, multiple views of the problem were derived by segmenting the spectral data into disjoint contiguous subband sets generated by (1) correlation of contiguous bands, (2) kmeans–based band clustering, (3) deterministic selection of every kth band (band slicing), and (4) random sampling. Views generated by correlation and clustering are diverse, but may differ in their discriminative ability for individual classes, so there is a risk of insufficiency for the classification, while views obtained from band slicing may be redundant, but are sufficient for covering the full space of inputs. Finally, random sampling provides diverse views, although they are not guaranteed to be either sufficient or accurate. Figure 3.15 illustrates multiview subsetting of the input space of the KSC data based on interband correlation. A twostage query and sample selection process was used, where the first ranked samples by the maximum disagreement of the classification results across views, and the second invoked an entropy criterion to increase diversity of the samples. Figures 3.16 and 3.17 show the accuracy curves as AL progresses. Margin sampling had the highest overall accuracy, although multiview methods converged to approximately the same overall accuracy.
Figure 3.15 Correlation matrix for Kennedy Space Center AVIRIS data. Each block corresponds to a view.
Figure 3.16 Overall classification accuracy for SVM classification of Kennedy Space Center with multiview active learning based on band correlation Cr , kmeans clustering Ck , uniform band slicing Us , and random view generation RU compared to random sampling RS and margin sampling SVMMS.
Figure 3.17 The overall classification accuracy and view performance derived from correlationbased view generation Cr for Kennedy Space Center data.
Most of the AL methods proposed in the literature deal solely with spectral information, but improvements in terms of classification accuracies can be obtained by also exploiting the spatial dimension. Only recently, researchers have started to integrate spatial information with spectral features in the AL framework [109,115–117]. A recent strategy [98] that incorporates AL, semisupervised learning, and segmentation into a unique framework is shown in Figure 3.18 and detailed in the following.
Figure 3.18 Flow chart illustrating the segmentationbased AL framework that incorporates AL, semisupervised learning, and segmentation into a unique framework.
The approach relies on the HSeg algorithm introduced earlier in this chapter, whose output is a segmentation hierarchy that contains different levels of detail of the image. The appropriate level can be selected in a supervised way by quantitatively evaluating the segmentation results at each hierarchical level (as discussed previously in this chapter and in [97]) or in an unsupervised way by interactively using the HSegViewer tool. Both approaches assume that the best segmentation corresponds to a single specific level of the hierarchy. However, coherent objects may be found at different levels [118] and therefore, the best segmentation should be defined by selecting regions at different levels. This process is usually referred to as pruning, in which subtrees of the hierarchy that are homogeneous with respect to some defined homogeneity criteria are removed. Although different pruning strategies based on supervised [119], semisupervised [120], and unsupervised [121] strategies have been proposed, they are not suitable to be used within an AL framework. While AL and SSL have different workflows, they both aim to make the best use of unlabeled data and reduce human labeling efforts. It is natural to combine the two strategies to take advantage of both paradigms in the classification task. The main idea of the strategy is to find an optimal segmentation map from the segmentation hierarchy by considering a novel supervised pruning strategy. This strategy aims at removing redundant subtrees composed of nodes that are homogeneous with respect to some criteria of interest from the HSeg output. As a result, it generates an optimal segmentation map that can provide spatial information for the classification. The best segmentation does not represent one of the actual levels of the hierarchy, but incorporates regions selected from potentially different levels. Two merging criteria based on the node size and the Bhattacharyya coefficient are considered. The whole pruning process is integrated within the AL framework. Compared to the unsupervised pruning strategy proposed in [121], the new method can exploit the labeled information provided by the user (which increases through the AL process) and thus update the best segmentation map at each AL iteration. At the end of the pruning, a single segmentation map is obtained, which is further exploited to incorporate spatial information into the framework in two different ways: (1) spatial features (e.g., mean and standard deviation) are extracted from each segment and concatenated with the original spectral ones into a single stacked feature vector. An SVM is adopted as the backend classifier and trained on the enriched feature space. (2) The continuously updated segmentation map is considered to expand the training set by employing both AL and selflearning–based SSL approaches. At each iteration, both labeled and pseudolabeled samples are added to the training set and used jointly to train the classifier. The most uncertain samples are selected using the BT criterion and then labeled by the human expert. Pseudolabels are assigned automatically by taking advantage of spatial information. Such pseudolabeled samples help increase the size of the training set when the available labeled training set is small. For this purpose, first, the set of candidate unlabeled samples is defined as the samples that belong to regions with identically labeled samples based on the current segmentation map. The framework is validated experimentally on the Indian Pine 1992 data set. Improvements in terms of overall classification accuracies are exhibited by the new framework in comparison with other spectralspatial strategies, as reported in Figures 3.19 through 3.21.
Figure 3.19 Learning curve (overall accuracy vs. number of labeled samples) on the Indian Pine 1992 data set. The methods Adseg_AddFeat and Adseg_AddFeat+Addsamp are compared with other spectralspatial strategies. Win_AddFeat: spectral and spatial features extracted from a 3 × 3 window; Win_SumKer: spectral and spatial features extracted from a 3 × 3 window with a kernelcomposite approach classifier; Fixseg_AddFeat: spatial features extracted from a fixed segmentation map; Fixseg_Reg: the fixed segmentation map is used as regularizer; EMP: morphological features extracted from the first two PCs. All methods adopt BT as the AL query criterion.
Figure 3.20 Normalized confusion matrices (values in percentage) achieved on the Indian Pine 1992 data set. (a) Noseg, (b) Win_AddFeat, (c) EMP, (d) Adseg_AddFeat + AddSamp.
Figure 3.21 Classification maps achieved on the Indian Pine 1992 data set. (a) Noseg, (b) Win_AddFeat, (c) EMP, (d) Adseg_AddFeat + AddSamp.
Feature extraction and AL problems for hyperspectral image classification have usually been investigated independently. Considering a traditional ALbased hyperspectral image classification chain, feature extraction is usually executed first in the original highdimensional feature space as a preprocessing step to obtain an optimally reduced feature space. This can be accomplished in an unsupervised way using manifold learning strategies or in a supervised way by exploiting the limited labeled information available at the beginning of the process. An AL algorithm is then applied in the reduced feature space to increase the number of points in the training set. However, the feature space extracted earlier may be suboptimal relative to the resulting training set. The unsupervised feature extraction step lacks connection with the classification problem or is performed using the few potentially nonrepresentative initial labeled samples. In both cases, the extracted feature space is fixed and does not interact in any way with the additional information provided by the user during the AL process. Therefore, even an optimal AL strategy cannot guarantee maximization of the classification accuracy since it is applied to a suboptimal feature space.
Novel solutions have recently been proposed in the literature with the aim of combining feature extraction and AL into a unique framework [122–124], as summarized in the flowchart of Figure 3.22. The overall idea is to learn and update a reduced feature space in a supervised way at each iteration of the AL process, thus exploiting the increasing labeled information provided by the user. In particular, the computation of the reduced feature space is based on the largemargin nearest neighbor (LMNN) metric learning principle [125]. The metric learning strategy is applied in conjunction with kNN classification and novel sample selection criteria.
Figure 3.22 Flow chart illustrating the activemetric learning approach for supervised classification.
Specifically, consider a hyperspectral image $X={\left\{{\mathbf{x}}_{i}\right\}}_{i=1}^{n}$ , where x_{i} = [x_{i} _{,1}, … , x_{i} _{,d }] is a sample in the original high ddimensional feature space and n is the total number of pixels in I. Define ${\mathbf{x}}_{i}^{\prime}=[{x}_{i,1}^{\prime},\dots {x}_{i,r}^{\prime}]$ as the same sample in the reduced rdimensional feature space obtained by adopting a feature extraction strategy. A training set $L={x}_{i},{yi}_{i=1}^{l}$ is constructed by selecting l samples from X and assigning corresponding discrete labels y_{i} (where y_{i} $\in $ 1, … , Ω, and Ω is the number of thematic classes). $U=xi{\phantom{\rule{0.3em}{0ex}}}_{i=l\phantom{\rule{0.3em}{0ex}}1}^{l\phantom{\rule{0.3em}{0ex}}u}$ is defined as the set of u remaining labeled samples, that is, U = X − L and n = u + l. The supervised dimensionality reduction strategy exploits the training set L to generate a lowdimensional feature space X′ from X, and an AL strategy, where the most uncertain samples are selected and labeled, is then applied on X′. The process is iterated until a convergence criterion is satisfied.
The dimensionality reduction step associated with the LMNN algorithm [125], which is schematically represented in Figure 3.23, is implemented in conjunction with the Mahalanobis distance [126] and extended to improve classification performance, in terms of accuracy and computational time: (1) dimensionality reduction is incorporated directly into the objective function optimization process, (2) the optimization process is iterated through a multipass approach, and (3) kNN search is accelerated through ball tree formulation.
Figure 3.23 Schematic representation of LMNN metric learning method. (a) Original feature space. (b) Feature space after training. After training, K similar samples (in green) are separated from the dissimilar ones by a unit margin. Local neighborhood in gray.
The original method, which can only handle single input features (e.g., pure spectral features) and does not specifically accommodate multiple feature scenarios, is extended in [123], where multiple feature types are concatenated by extending LMNN to heterogeneous multimetric learning (HMML) [127]. The reduced feature space is obtained for each feature type by adopting a modified version of the HMML algorithm, and AL is then applied in the resulting single feature space in conjunction with kNN classification. Further improvements were obtained in [124] via a regularized multimetric AL framework to jointly learn distinct metrics for different feature types. The regularizer incorporates unlabeled data based on the neighborhood relationship, which also helps avoid overfitting at early AL stages. As the iterative process proceeds, the regularizer is updated through similarity propagation, thus taking advantage of informative labeled samples. Finally, multiple features are projected into a common feature space, in which a new batchmode selection strategy that incorporates uncertainty and diversity criteria is used. Comparison on the Indian Pine 1992 data set among the different activemetric learning methods is reported in Figure 3.24.
Figure 3.24 Comparisons among different activemetric learning methods on the Indian Pine 1992 data set. (a) Learning curve (overall accuracy in function of the number of labeled samples) for four strategies UpdRegHMML+Unc [124], HMML+Unc [123], LMNN_Stack+Unc [122], and LMNN_Stack+Unc [122] in addition to SVM + MS and SVM + MS + ABD [8]. Classspecific accuracies as a function of the number of labeled samples for (b) SVM + MS and (c) UpdRegHMML + UncDiv.
This chapter has addressed key issues in classification of hyperspectral data and provided an overview of strategies to address these problems. As noted in the introduction, availability of hyperspectral imagery and associated ground reference data, including class labels, has been a significant hurdle to advancing classification methods focused on vegetation and agriculture croplands. This problem will be addressed in part by the upcoming launches of combinations of traditional near polar orbiting satellite missions and constellations of small satellites carrying hyperspectral cameras. The resulting improvement in temporal resolution of data will be particularly relevant to applications in agriculture and vegetation. Hyperspectral camera and global navigation satellite system (GNSS) technologies are also advancing, including miniaturization, making them viable sensors for UAVs and groundbased platforms that provide higher spatial resolution data that can be collected on demand at reduced cost. The resulting data sets will be enormous, motivating the need for advances in data processing and management, including onboard processing and analysis.
The high dimensionality of hyperspectral data, coupled with band redundancy, is a wellknown obstacle for traditional parametric classifiers, particularly when the quantity of labeled data for training is limited. Dimensionality reduction as a frontend processing step is both the most direct and most widely used approach to address this problem. Traditional methods that include direct feature selection as an optimization problem as either a supervised and unsupervised strategy (Section 3.3.1.1) and use of narrow band indices (Section 3.3.1.2) continue to be popular for classification, disease detection, and yield prediction in agriculture, as they are often correlated with plant vigor or specific chemistrybased responses at various stages of crop development. Both feature selection and vegetation indices have the advantage of being interpretable, but potentially ignore useful information in the rest of the spectrum. Alternatively, feature extraction approaches, which can exploit the full set of bands and inputs from other sources, are more popular for classification of single images, but suffer from interpretability and generalizability across multiple images. Traditional linear feature extraction methods (Section 3.3.1.3) are still widely used as inputs for classification and for other applications of hyperspectral data such as unmixing, in part because of their availability in commercial software. However, as discussed and illustrated in Section 3.3.1.4, nonlinear extraction approaches, particularly local graphbased methods, are promising for increasing class separation and thereby classification accuracy, although at increased computational cost.
Inclusion of spatial information can be extremely beneficial for classification of scenes containing natural vegetation or agriculture. Combining high spectral resolution data with spatial information has made it possible to discriminate classes that are spectrally quite similar, even at the relatively coarse spatial resolution of AVIRIS and Hyperion, as the impact of withinclass spatial variability is reduced. Higher spatial resolution data obtained by lowaltitude aircraft, UAVs, and groundbased systems provide the capability to actually exploit higherfrequency spatial components in the data, which may be useful for improved class discrimination. Incorporation of texture measures as inputs (Section 3.3.2) and use of hierarchical segmentation approaches (Section 3.4.4) and new classification strategies such as deep learning (Section 3.4.3) all leverage multiscale spatial information for classification, as illustrated in Section 3.4.3 using the 2010 Indian Pine data. While no single strategy dominates as the best approach for inclusion of spatial information in a classification problem, example results in this chapter, as well as the literature, clearly demonstrate the importance of spatial context in classification of natural vegetation and agricultural images.
New models of inputs, as well as nonparametric datadriven approaches, are also receiving significant attention for classification of hyperspectral data. SVM classifiers (Section 3.4.1.1), which directly avoid the issue of nonGaussian class conditional density functions and highdimensional inputs, are now widely implemented, although proper parameter tuning is challenging and requires adequate quantities of training data. Multikernel extensions of traditional singlekernel SVMs (Section 3.5.1.1) are especially promising for multisource inputs, as demonstrated by the examples in this chapter and the literature. When coupled with feature extraction for dimensionality reduction, Gaussian mixture models implemented in a Bayesian framework can also provide an effective strategy for classifying hyperspectral data (Section 3.4.2). Deep learning approaches (Section 3.4.3), which are now being widely explored for classification of hyperspectral data, have the capability to exploit nonlinear relationships and interactions, but require very large data sets for training. The potential of deep learning for classification of hyperspectral data is really in its infancy in this domain. Early results have stimulated significant research for both algorithm development and applications, including approaches for leveraging the structure of hyperspectral spectra and new strategies for addressing the limited training data issue. The chapter also included examples of newly developed approaches for addressing challenges and new opportunities in hyperspectral data classification. Relative to training and application of classifiers, methods to tackle domain adaptation and transfer learning will be necessary as more hyperspectral data become available. We presented one strategy for feature alignment that yielded promising results, and referenced others (Section 3.5.1.2). We also included a novel example where training data acquired by a groundbased system were used to train a classifier that was applied over an extended area. It provided not only an opportunity for multiscale learning, but also a strategy for expanding the limited training data set.
Finally, we addressed the issue of limited training data through active and metric learning in Section 3.5.2. Active learning provides a flexible framework for initiating the classification process with a small number of training samples and augmenting the pool with unlabeled data. The framework can be implemented with frontend feature extraction from single or multiple sources and with appropriate backend classifiers. An example based on the Kennedy Space Center AVIRIS data illustrates the potential of the approach, including targeted learning related to classes that are difficult to discriminate (Section 3.5.2.1). One limitation of active learning is that the true labels of pixels identified for inclusion in the training set must be determined. One strategy for addressing this problem is illustrated using the original Indian Pine data, where active learning is coupled with hierarchical segmentation to leverage spatially homogeneous areas and semisupervised learning (Section 3.5.2.2). As shown in the quantitative and qualitative results, this is a particularly effective approach for the testbed data. Another potential problem of active learning is that features that are extracted from the initial small training set may not be optimal as learning progresses, necessitating updates that can be computationally intensive, particularly for nonlinear feature extractors. Metric learning (Section 3.5.2.3) provides a new strategy that naturally integrates updates to the reduced feature space into the active learning framework using the concept of the largemargin nearest neighbor principle, which naturally couples with a kNN classifier (which can also be considered a limitation). Early results from active metric learning are promising for both single and multiplesource input data.
While the current strategies for classification of hyperspectral data build on a rich foundation, significant advances are still needed. New classification algorithms whose data structures and architecture exploit hyperspectral data for multitemporal, multiscale studies of agriculture and vegetation are particularly needed to effectively utilize hyperspectral imagery, both as standalone methods and in conjunction with biophysical models. Classification methods have traditionally been developed in a stovepipe approach by algorithm developers working in isolation from the application domain and with limited understanding of this domain. The next generation of classification systems must leverage the knowledge of collaborative, multidisciplinary research composed of both methodologically focused researchers and partners from the basic and applied earth and life sciences.
The authors would like to acknowledge the graduate students and postdoctoral scholars at Purdue University and University of Houston for setting up the experiments and generating results with methods reviewed in this overview chapter. We would also like to thank Farideh Foroozandeh Shahraki at the University of Houston for her help with formatting the chapter.