Extracting effective features is always a challenging problem for texture classification because of the uncertainty of scales and the clutter of textural patterns. For texture classification, spectral analysis is traditionally employed in the frequency domain. Recent studies have shown the potential of convolutional neural networks (CNNs) when dealing with the texture classification task in the spatial domain. In this article, we try combining both approaches in different domains for more abundant information and proposed a novel network architecture named contourlet CNN (C-CNN). The network aims to learn sparse and effective feature representations for images. First, the contourlet transform is applied to get the spectral features from an image. Second, the spatial-spectral feature fusion strategy is designed to incorporate the spectral features into CNN architecture. Third, the statistical features are integrated into the network by the statistical feature fusion. Finally, the results are obtained by classifying the fusion features. We also investigated the behavior of the parameters in contourlet decomposition. Experiments on the widely used three texture data sets (kth-tips2-b, DTD, and CUReT) and five remote sensing data sets (UCM, WHU-RS, AID, RSSCN7, and NWPU-RESISC45) demonstrate that the proposed approach outperforms several well-known classification methods in terms of classification accuracy with fewer trainable parameters.