Class NewHadoopRDD<K,V>

Object
org.apache.spark.rdd.RDD<scala.Tuple2<K,V>>
org.apache.spark.rdd.NewHadoopRDD<K,V>
All Implemented Interfaces:
Serializable, org.apache.spark.internal.Logging

public class NewHadoopRDD<K,V> extends RDD<scala.Tuple2<K,V>> implements org.apache.spark.internal.Logging
Developer API An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the new MapReduce API (org.apache.hadoop.mapreduce).

param: sc The SparkContext to associate the RDD with. param: inputFormatClass Storage format of the data to be read. param: keyClass Class of the key associated with the inputFormatClass. param: valueClass Class of the value associated with the inputFormatClass. param: ignoreCorruptFiles Whether to ignore corrupt files. param: ignoreMissingFiles Whether to ignore missing files.

See Also:
Note:
Instantiating this class directly is not recommended, please use org.apache.spark.SparkContext.newAPIHadoopRDD()
  • Constructor Details

    • NewHadoopRDD

      public NewHadoopRDD(SparkContext sc, Class<? extends org.apache.hadoop.mapreduce.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, org.apache.hadoop.conf.Configuration _conf, boolean ignoreCorruptFiles, boolean ignoreMissingFiles)
    • NewHadoopRDD

      public NewHadoopRDD(SparkContext sc, Class<? extends org.apache.hadoop.mapreduce.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, org.apache.hadoop.conf.Configuration _conf)
  • Method Details

    • CONFIGURATION_INSTANTIATION_LOCK

      public static Object CONFIGURATION_INSTANTIATION_LOCK()
      Configuration's constructor is not threadsafe (see SPARK-1097 and HADOOP-10456). Therefore, we synchronize on this lock before calling new Configuration().
      Returns:
      (undocumented)
    • getConf

      public org.apache.hadoop.conf.Configuration getConf()
    • getPartitions

      public Partition[] getPartitions()
    • compute

      public InterruptibleIterator<scala.Tuple2<K,V>> compute(Partition theSplit, TaskContext context)
      Description copied from class: RDD
      Developer API Implemented by subclasses to compute a given partition.
      Specified by:
      compute in class RDD<scala.Tuple2<K,V>>
      Parameters:
      theSplit - (undocumented)
      context - (undocumented)
      Returns:
      (undocumented)
    • mapPartitionsWithInputSplit

      public <U> RDD<U> mapPartitionsWithInputSplit(scala.Function2<org.apache.hadoop.mapreduce.InputSplit,scala.collection.Iterator<scala.Tuple2<K,V>>,scala.collection.Iterator<U>> f, boolean preservesPartitioning, scala.reflect.ClassTag<U> evidence$1)
      Maps over a partition, providing the InputSplit that was used as the base of the partition.
    • getPreferredLocations

      public scala.collection.immutable.Seq<String> getPreferredLocations(Partition hsplit)
    • persist

      public NewHadoopRDD<K,V> persist(StorageLevel storageLevel)
      Description copied from class: RDD
      Set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. Local checkpointing is an exception.
      Overrides:
      persist in class RDD<scala.Tuple2<K,V>>
      Parameters:
      storageLevel - (undocumented)
      Returns:
      (undocumented)