org.apache.spark.rdd.RDD<scala.Tuple2<K,V>>

org.apache.spark.rdd.NewHadoopRDD<K,V>

All Implemented Interfaces:: Serializable, org.apache.spark.internal.Logging

public class NewHadoopRDD<K,V> extends RDD<scala.Tuple2<K,V>> implements org.apache.spark.internal.Logging

:: DeveloperApi :: An RDD that provides core functionality for reading data stored in Hadoop (e.g., files in HDFS, sources in HBase, or S3), using the new MapReduce API (org.apache.hadoop.mapreduce).

param: sc The SparkContext to associate the RDD with. param: inputFormatClass Storage format of the data to be read. param: keyClass Class of the key associated with the inputFormatClass. param: valueClass Class of the value associated with the inputFormatClass. param: ignoreCorruptFiles Whether to ignore corrupt files. param: ignoreMissingFiles Whether to ignore missing files.

See Also:

Serialized Form

Note:

Instantiating this class directly is not recommended, please use org.apache.spark.SparkContext.newAPIHadoopRDD()

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

NewHadoopRDD.NewHadoopMapPartitionsWithSplitRDD$

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging
org.apache.spark.internal.Logging.LogStringContext, org.apache.spark.internal.Logging.SparkShellLoggingFilter
Constructor Summary

Constructors

Constructor

Description

NewHadoopRDD(SparkContext sc, Class<? extends org.apache.hadoop.mapreduce.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, org.apache.hadoop.conf.Configuration _conf)

NewHadoopRDD(SparkContext sc, Class<? extends org.apache.hadoop.mapreduce.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, org.apache.hadoop.conf.Configuration _conf, boolean ignoreCorruptFiles, boolean ignoreMissingFiles)
Method Summary

Modifier and Type

Method

Description

InterruptibleIterator<scala.Tuple2<K,V>>

compute(Partition theSplit, TaskContext context)

:: DeveloperApi :: Implemented by subclasses to compute a given partition.

static Object

CONFIGURATION_INSTANTIATION_LOCK()

Configuration's constructor is not threadsafe (see SPARK-1097 and HADOOP-10456).

org.apache.hadoop.conf.Configuration

getConf()

Partition[]

getPartitions()

scala.collection.immutable.Seq<String>

getPreferredLocations(Partition hsplit)

 RDD

mapPartitionsWithInputSplit(scala.Function2<org.apache.hadoop.mapreduce.InputSplit,scala.collection.Iterator<scala.Tuple2<K,V>>,scala.collection.Iterator> f, boolean preservesPartitioning, scala.reflect.ClassTag evidence$1)

Maps over a partition, providing the InputSplit that was used as the base of the partition.

NewHadoopRDD<K,V>

persist(StorageLevel storageLevel)

Set this RDD's storage level to persist its values across operations after the first time it is computed.

Methods inherited from class org.apache.spark.rdd.RDD
aggregate, barrier, cache, cartesian, checkpoint, cleanShuffleDependencies, coalesce, collect, collect, context, count, countApprox, countApproxDistinct, countApproxDistinct, countByValue, countByValueApprox, dependencies, distinct, distinct, doubleRDDToDoubleRDDFunctions, filter, first, flatMap, fold, foreach, foreachPartition, getCheckpointFile, getNumPartitions, getResourceProfile, getStorageLevel, glom, groupBy, groupBy, groupBy, id, intersection, intersection, intersection, isCheckpointed, isEmpty, iterator, keyBy, localCheckpoint, map, mapPartitions, mapPartitionsWithEvaluator, mapPartitionsWithIndex, max, min, name, numericRDDToDoubleRDDFunctions, partitioner, partitions, persist, pipe, pipe, pipe, preferredLocations, randomSplit, rddToAsyncRDDActions, rddToOrderedRDDFunctions, rddToPairRDDFunctions, rddToSequenceFileRDDFunctions, reduce, repartition, sample, saveAsObjectFile, saveAsTextFile, saveAsTextFile, setName, sortBy, sparkContext, subtract, subtract, subtract, take, takeOrdered, takeSample, toDebugString, toJavaRDD, toLocalIterator, top, toString, treeAggregate, treeAggregate, treeReduce, union, unpersist, withResources, zip, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitions, zipPartitionsWithEvaluator, zipWithIndex, zipWithUniqueId

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface org.apache.spark.internal.Logging
initializeForcefully, initializeLogIfNecessary, initializeLogIfNecessary, initializeLogIfNecessary$default$2, isTraceEnabled, log, logDebug, logDebug, logDebug, logDebug, logError, logError, logError, logError, logInfo, logInfo, logInfo, logInfo, logName, LogStringContext, logTrace, logTrace, logTrace, logTrace, logWarning, logWarning, logWarning, logWarning, org$apache$spark$internal$Logging$$log_, org$apache$spark$internal$Logging$$log__$eq, withLogContext

Constructor Details
- NewHadoopRDD
 
 public NewHadoopRDD(SparkContext sc, Class<? extends org.apache.hadoop.mapreduce.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, org.apache.hadoop.conf.Configuration _conf, boolean ignoreCorruptFiles, boolean ignoreMissingFiles)
- NewHadoopRDD
 
 public NewHadoopRDD(SparkContext sc, Class<? extends org.apache.hadoop.mapreduce.InputFormat<K,V>> inputFormatClass, Class<K> keyClass, Class<V> valueClass, org.apache.hadoop.conf.Configuration _conf)
Method Details
- CONFIGURATION_INSTANTIATION_LOCK
 
 public static Object CONFIGURATION_INSTANTIATION_LOCK()
 
 Configuration's constructor is not threadsafe (see SPARK-1097 and HADOOP-10456). Therefore, we synchronize on this lock before calling new Configuration().
 
 Returns:
 
 (undocumented)
- getConf
 
 public org.apache.hadoop.conf.Configuration getConf()
- getPartitions
 
 public Partition[] getPartitions()
- compute
 
 public InterruptibleIterator<scala.Tuple2<K,V>> compute(Partition theSplit, TaskContext context)
 
 Description copied from class: RDD
 
 :: DeveloperApi :: Implemented by subclasses to compute a given partition.
 
 Specified by:
 
 compute in class RDD<scala.Tuple2<K,V>>
 
 Parameters:
 
 theSplit - (undocumented)
 
 context - (undocumented)
 
 Returns:
 
 (undocumented)
- mapPartitionsWithInputSplit
 
 public RDD mapPartitionsWithInputSplit(scala.Function2<org.apache.hadoop.mapreduce.InputSplit,scala.collection.Iterator<scala.Tuple2<K,V>>,scala.collection.Iterator> f, boolean preservesPartitioning, scala.reflect.ClassTag evidence$1)
 
 Maps over a partition, providing the InputSplit that was used as the base of the partition.
- getPreferredLocations
 
 public scala.collection.immutable.Seq<String> getPreferredLocations(Partition hsplit)
- persist
 
 public NewHadoopRDD<K,V> persist(StorageLevel storageLevel)
 
 Description copied from class: RDD
 
 Set this RDD's storage level to persist its values across operations after the first time it is computed. This can only be used to assign a new storage level if the RDD does not have a storage level set yet. Local checkpointing is an exception.
 
 Overrides:
 
 persist in class RDD<scala.Tuple2<K,V>>
 
 Parameters:
 
 storageLevel - (undocumented)
 
 Returns:
 
 (undocumented)

Class NewHadoopRDD<K,V>

Nested Class Summary

Nested classes/interfaces inherited from interface org.apache.spark.internal.Logging

Constructor Summary

Method Summary

Methods inherited from class org.apache.spark.rdd.RDD

Methods inherited from class java.lang.Object

Methods inherited from interface org.apache.spark.internal.Logging

Constructor Details

NewHadoopRDD

NewHadoopRDD

Method Details

CONFIGURATION_INSTANTIATION_LOCK

getConf

getPartitions

compute

mapPartitionsWithInputSplit

getPreferredLocations

persist