This study proposes a novel, to the best of our knowledge, transformer-based end-to-end network (TDNet) for point cloud denoising based on encoder-decoder architecture. The encoder is based on the structure of a transformer in natural language processing (NLP). Even though points and sentences are different types of data, the NLP transformer can be improved to be suitable for a point cloud because the point can be regarded as a word. The improved model facilitates point cloud feature extraction and transformation of the input point cloud into the underlying high-dimensional space, which can characterize the semantic relevance between points. Subsequently, the decoder learns the latent manifold of each sampled point from the high-dimensional features obtained by the encoder, finally achieving a clean point cloud. An adaptive sampling approach is introduced during denoising to select points closer to the clean point cloud to reconstruct the surface. This is based on the view that a 3D object is essentially a 2D manifold. Extensive experiments demonstrate that the proposed network is superior in terms of quantitative and qualitative results for synthetic data sets and real-world terracotta warrior fragments.