Spaces:
Runtime error
Runtime error
| <!--Copyright 2023 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| --> | |
| # Monocular depth estimation | |
| Monocular depth estimation is a computer vision task that involves predicting the depth information of a scene from a | |
| single image. In other words, it is the process of estimating the distance of objects in a scene from | |
| a single camera viewpoint. | |
| Monocular depth estimation has various applications, including 3D reconstruction, augmented reality, autonomous driving, | |
| and robotics. It is a challenging task as it requires the model to understand the complex relationships between objects | |
| in the scene and the corresponding depth information, which can be affected by factors such as lighting conditions, | |
| occlusion, and texture. | |
| <Tip> | |
| The task illustrated in this tutorial is supported by the following model architectures: | |
| <!--This tip is automatically generated by `make fix-copies`, do not fill manually!--> | |
| [DPT](../model_doc/dpt), [GLPN](../model_doc/glpn) | |
| <!--End of the generated tip--> | |
| </Tip> | |
| In this guide you'll learn how to: | |
| * create a depth estimation pipeline | |
| * run depth estimation inference by hand | |
| Before you begin, make sure you have all the necessary libraries installed: | |
| ```bash | |
| pip install -q transformers | |
| ``` | |
| ## Depth estimation pipeline | |
| The simplest way to try out inference with a model supporting depth estimation is to use the corresponding [`pipeline`]. | |
| Instantiate a pipeline from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads): | |
| ```py | |
| >>> from transformers import pipeline | |
| >>> checkpoint = "vinvino02/glpn-nyu" | |
| >>> depth_estimator = pipeline("depth-estimation", model=checkpoint) | |
| ``` | |
| Next, choose an image to analyze: | |
| ```py | |
| >>> from PIL import Image | |
| >>> import requests | |
| >>> url = "https://unsplash.com/photos/HwBAsSbPBDU/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MzR8fGNhciUyMGluJTIwdGhlJTIwc3RyZWV0fGVufDB8MHx8fDE2Nzg5MDEwODg&force=true&w=640" | |
| >>> image = Image.open(requests.get(url, stream=True).raw) | |
| >>> image | |
| ``` | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/depth-estimation-example.jpg" alt="Photo of a busy street"/> | |
| </div> | |
| Pass the image to the pipeline. | |
| ```py | |
| >>> predictions = depth_estimator(image) | |
| ``` | |
| The pipeline returns a dictionary with two entries. The first one, called `predicted_depth`, is a tensor with the values | |
| being the depth expressed in meters for each pixel. | |
| The second one, `depth`, is a PIL image that visualizes the depth estimation result. | |
| Let's take a look at the visualized result: | |
| ```py | |
| >>> predictions["depth"] | |
| ``` | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/depth-visualization.png" alt="Depth estimation visualization"/> | |
| </div> | |
| ## Depth estimation inference by hand | |
| Now that you've seen how to use the depth estimation pipeline, let's see how we can replicate the same result by hand. | |
| Start by loading the model and associated processor from a [checkpoint on the Hugging Face Hub](https://huggingface.co/models?pipeline_tag=depth-estimation&sort=downloads). | |
| Here we'll use the same checkpoint as before: | |
| ```py | |
| >>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation | |
| >>> checkpoint = "vinvino02/glpn-nyu" | |
| >>> image_processor = AutoImageProcessor.from_pretrained(checkpoint) | |
| >>> model = AutoModelForDepthEstimation.from_pretrained(checkpoint) | |
| ``` | |
| Prepare the image input for the model using the `image_processor` that will take care of the necessary image transformations | |
| such as resizing and normalization: | |
| ```py | |
| >>> pixel_values = image_processor(image, return_tensors="pt").pixel_values | |
| ``` | |
| Pass the prepared inputs through the model: | |
| ```py | |
| >>> import torch | |
| >>> with torch.no_grad(): | |
| ... outputs = model(pixel_values) | |
| ... predicted_depth = outputs.predicted_depth | |
| ``` | |
| Visualize the results: | |
| ```py | |
| >>> import numpy as np | |
| >>> # interpolate to original size | |
| >>> prediction = torch.nn.functional.interpolate( | |
| ... predicted_depth.unsqueeze(1), | |
| ... size=image.size[::-1], | |
| ... mode="bicubic", | |
| ... align_corners=False, | |
| ... ).squeeze() | |
| >>> output = prediction.numpy() | |
| >>> formatted = (output * 255 / np.max(output)).astype("uint8") | |
| >>> depth = Image.fromarray(formatted) | |
| >>> depth | |
| ``` | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/depth-visualization.png" alt="Depth estimation visualization"/> | |
| </div> | |