Using WebHDFS REST API
Using WebHDFS REST API
Apache Hadoop provides native libraries for accessing HDFS. However, users prefer to use HDFS remotely over the heavy client side native libraries. For example, some applications need to load data in and out of the cluster, or to externally interact with the HDFS data. WebHDFS addresses these issues by providing a fully functional HTTP REST API to access HDFS.
WebHDFS provides the following features:
Provides read and write access. Supports all HDFS operations (like granting permissions, configuring replication factor, accessing block location, etc.).
Supports all HDFS parameters with defaults.
Permits clients to access Hadoop from multiple languages without actually installing Hadoop. You can also use common tools like curl/wget to access HDFS.
Uses the full bandwidth of the Hadoop cluster for streaming data: The file read and file write calls are redirected to the corresponding datanodes.
Uses Kerberos (SPNEGO) and Hadoop delegation tokens for authentication.
WebHDFS is completely Apache open source. Pivotal contributed the code to Apache Hadoop as a first class built-in Hadoop component.
Requires no additional servers. However, a proxy WebHDFS (for example: Httpfs is useful in certain cases and is complementary to WebHDFS).
In this section:
WebHDFS User Guide
The following examples use the
curl command tool to access HDFS via WebHDFS REST API.
To read a file (for example:
curl -i -L "http://$<Host_Name>:$<Port>/webhdfs/v1/foo/bar?op=OPEN"
To list a directory (for example:
curl -i "http://$<Host_Name>:$<Port>/webhdfs/v1/foo/?op=LISTSTATUS"
To list the status of a file (for example:
/foo/bar) or a directory:
curl -i "http://$<Host_Name>:$<Port>/webhdfs/v1/foo/bar?op=GETFILESTATUS"
To write a file into a
curl -i -X PUT -L "http://$<Host_Name>:$<Port>/webhdfs/v1/foo/newFile?op=CREATE" -T newFile
To rename the
curl -i -X PUT "http://$<Host_Name>:$<Port>/webhdfs/v1/foo/bar?op=RENAME&destination=/foo/bar2"
Make new directory
curl -i -X PUT "http://$<Host_Name>:$<Port>/webhdfs/v1/foo2?op=MKDIRS&permission=711"
When security is enabled, authentication is performed by either Hadoop delegation token or Kerberos SPNEGO. If a token is set in the delegation query parameter, the authenticated user is the user encoded in the token. If the delegation parameter is not set, the user is authenticated by Kerberos SPNEGO.
Below are examples using the
curl command tool.
Login to the Key Distribution Center (KDC).
Provide any arbitrary user name and a null password.
Execute the following commands:
curl --negotiate -u:anyUser "http://$<Host_Name>:$<Port>/webhdfs/v1/foo/bar?op=OPEN curl --negotiate -u:anyUser -b ~/cookies.txt -c ~/cookies.txt http://$<Host_Name>:$<Port>/webhdfs/v1/foo/bar?op=OPEN
--negotiateoption enables SPNEGO in curl.
-u:anyUseroption is mandatory when the user name is not specified instead, a Kerberos established user (via kinit) is used. (Ensure that you provide any user name and enter a null password when prompted.)
-coptions are used for storing and sending HTTP cookies.
The HTTP REST API supports the complete FileSystem interface for HDFS. For more information, see the following sections in the WebHDFS REST API documentation: