Routines in support of tiled segmentation of very large rasters.

Main entry routine is doTiledShepherdSegmentation(). See that function for further details.

The broad idea is that the Shepherd segmentation algorithm, as implemented in the shepseg module, runs entirely in memory. For larger raster files, it is more efficient to divide the raster into tiles, segment each tile individually, and stitch the results together to create a segmentation of the whole raster.

The main caveats arise from the fact that the initial clustering is performed on a uniform subsample of the whole image, in order to give consistent segment boundaries at tile intersections. This means that for a larger raster, with a greater range of spectra, one may wish to increase the number of clusters in order to allow sufficient initial segments to characterize the variation.

Related to this, one may also consider reducing the percentile used for automatic estimation of maxSpectralDiff (see shepseg.doShepherdSegmentation() and shepseg.autoMaxSpectralDiff() for further details).

Because of these caveats, one should be very cautious about segmenting something like a continental-scale image. There is a lot of spectral variation across an area like a whole continent, and it may be unwise to use all the same parameters for the whole area.

exception pyshepseg.tiling.PyShepSegSubsetError
exception pyshepseg.tiling.PyShepSegTilingError
class pyshepseg.tiling.RatPage(numIntCols, numFloatCols, startSegId, numSeg)

Hold a single page of the paged RAT

class pyshepseg.tiling.SegmentStats(segmentHistDict, missingStatsValue)

Manage statistics for a single segment

class pyshepseg.tiling.TileInfo

Class that holds the pixel coordinates of the tiles within an image.

addTile(xpos, ypos, xsize, ysize, col, row)
getTile(col, row)
class pyshepseg.tiling.TiledSegmentationResult

Result of tiled segmentation


Largest segment ID used in final segment image


Number of rows of tiles used


Number of columns of tiles used


Percentage of image subsampled for clustering


The value used to limit segment merging (in all tiles)


The sklearn KMeans object, after fitting


Boolean flag, this is an error condition

hasEmptySegments = None
kmeans = None
maxSegId = None
maxSpectralDiff = None
numTileCols = None
numTileRows = None
subsamplePcnt = None
pyshepseg.tiling.accumulateSegDict(segDict, noDataDict, imgNullVal, tileSegments, tileImageData)

Accumulate per-segment histogram counts for all pixels in the given tile. Updates segDict entries in-place.

pyshepseg.tiling.calcHistogramTiled(segfile, maxSegId, writeToRat=True)

Calculate a histogram of the given segment image file.

Note that we need this function because GDAL’s GetHistogram function does not seem to work when attempting a histogram with very large numbers of entries. We want an entry for every segment, rather than an approximate count for a range of segment values, and the number of segments is very large. So, we need to write our own routine.

It works in tiles across the image, so that it can process very large images in a memory-efficient way.

For a raster which can easily fit into memory, a histogram can be calculated directly using


Once completed, the histogram can be written to the image file’s raster attribute table, if writeToRat is True). It will also be returned as a numpy array, indexed by segment ID.

segfile can be either a filename string, or an open gdal.Dataset object. If writeToRat is True, then a Dataset object should be opened for update.

pyshepseg.tiling.calcPerSegmentStatsTiled(imgfile, imgbandnum, segfile, statsSelection, missingStatsValue=-9999)

Calculate selected per-segment statistics for the given band of the imgfile, against the given segment raster file. Calculated statistics are written to the segfile raster attribute table (RAT), so this file format must support RATs.

Calculations are carried out in a memory-efficient way, allowing very large rasters to be processed. Raster data is handled in small tiles, attribute table is handled in fixed-size chunks.

The statsSelection parameter is a list of tuples, one for each statistics to be included. Each tuple is either 2 or 3 elements,

(columnName, statName) or (columnName, statName, parameter)

The 3-element form is used for any statistic which requires a parameter, which currently is only the percentile.

The columnName is a string, used to name the column in the output RAT. The statName is a string used to identify which statistic is to be calculated. Available options are:

‘min’, ‘max’, ‘mean’, ‘stddev’, ‘median’, ‘mode’, ‘percentile’, ‘pixcount’.

The ‘percentile’ statistic requires the 3-element form, with the 3rd element being the percentile to be calculated.

For example
[(‘Band1_Mean’, ‘mean’),

(‘Band1_stdDev’, ‘stddev’), (‘Band1_LQ’, ‘percentile’, 25), (‘Band1_UQ’, ‘percentile’, 75)


would create 4 columns, for the per-segment mean and standard deviation of the given band, and the lower and upper quartiles, with corresponding column names.

Any pixels that are set to the nodata value of imgfile (if set) are ignored in the stats calculations. If there are no pixels that aren’t the nodata value then the value passed in as missingStatsValue is put into the RAT for the requested statistics.

The ‘pixcount’ statName can be used to find the number of valid pixels (not nodata) that were used to calculate the statistics.

pyshepseg.tiling.calcStatsForCompletedSegs(segDict, noDataDict, missingStatsValue, pagedRat, statsSelection_fast, segSize, numIntCols, numFloatCols)

Calculate statistics for all complete segments in the segDict. Update the pagedRat with the resulting entries. Completed segments are then removed from segDict.

pyshepseg.tiling.checkForEmptySegments(outfile, maxSegId, overlapSize)

Check the final segmentation for any empty segments. These can be problematic later, and should be avoided.


Check for the Histogram column in the attribute table. Return its column number, and raise an exception if it is not present

pyshepseg.tiling.checkSegComplete(segDict, noDataDict, segSize, segId)

Return True if the given segment has a complete entry in the segDict, meaning that the pixel count is equal to the segment size

pyshepseg.tiling.copyColumns(inRat, outRat)

Copies columns between RATs. Note that empty columns will be created - no data is copied.

Also return numIntCols and NumFloatCols for RatPage() processing.

pyshepseg.tiling.copySubsettedSegmentsToNew(inPage, outPagedRat, recodeDict)

Using the recodeDict, copy across the rows inPage to outPage.

inPage is processed and (taking into account of inPage.startSegId) the original input row found. This value is then looked up in recodeDict to find the row in the output RAT to copy the row from the input to.


Create the dictionary that holds counts of nodata seen for each segment. The key is the segId, value is the count of nodata seen for that segment in the image data.


Create the dictionary for the paged RAT. Each element is a page of the RAT, with entries for a range of segment IDs. The key is the segment ID of the first entry in the page.

The returned dictionary is initially empty.


Create the Dict of Dicts for handling per-segment histograms. Each entry is a dictionary, and the key is a segment ID. Each dictionary within this is the per-segment histogram for a single segment. Each of its entries is for a single value from the imagery, the key is the pixel value, and the dictionary value is the number of times that pixel value appears in the segment.

pyshepseg.tiling.createStatColumns(statsSelection, attrTbl, existingColNames)

Create requested statistic columns on the segmentation image RAT. Statistic columns are of type gdal.GFT_Real for mean and stddev, and gdal.GFT_Integer for all other statistics.

Return the column indexes for all requested columns, in the same order.

pyshepseg.tiling.crossesMidline(overlap, segLoc, orientation)

Return True if the given segment crosses the midline of the overlap array. Orientation of the midline is either


segLoc is the segment location entry for the segment in question

pyshepseg.tiling.doTiledShepherdSegmentation(infile, outfile, tileSize=4096, overlapSize=1024, minSegmentSize=50, numClusters=60, bandNumbers=None, subsamplePcnt=None, maxSpectralDiff='auto', imgNullVal=None, fixedKMeansInit=False, fourConnected=True, verbose=False, simpleTileRecode=False, outputDriver='KEA', creationOptions=[], spectDistPcntile=50, kmeansObj=None, tempfilesDriver='KEA', tempfilesExt='kea', tempfilesCreationOptions=[])

Run the Shepherd segmentation algorithm in a memory-efficient manner, suitable for large raster files. Runs the segmentation on separate tiles across the raster, then stitches these together into a single output segment raster.

The initial spectral clustering is performed on a sub-sample of the whole raster (using fitSpectralClustersWholeFile), to create consistent clusters. These are then used as seeds for all individual tiles. Note that subsamplePcnt is used at this stage, over the whole raster, and is not passed through to shepseg.doShepherdSegmentation() for any further sub-sampling.

The tileSize is the minimum width/height of the tiles (in pixels). These tiles are overlapped by overlapSize (also in pixels), both horizontally and vertically. Tiles on the right and bottom edges of the input image may end up slightly larger than tileSize to ensure there are no small tiles.

outputDriver is a string of the name of the GDAL driver to use for the output file. creationOptions is the list of creation options to use for this driver.

tempfilesDriver is a string of the name of the GDAL driver to use for temporary files. tempfilesExt the the extension to use and tempfilesCreationOptions is the list of creation options to use for this driver.

Most of the arguments are passed through to shepseg.doShepherdSegmentation, and are described in the docstring for that function.

Return an instance of TiledSegmentationResult class.

pyshepseg.tiling.doTiledShepherdSegmentation_doOne(inDs, filename, tileInfo, col, row, bandNumbers, imgNullVal, kmeansObj, minSegmentSize=50, maxSpectralDiff='auto', verbose=False, spectDistPcntile=50, fourConnected=True, tempfilesDriver='KEA', tempfilesCreationOptions=[])

Called from doTiledShepherdSegmentation(). Does a single tile and split out here as a seperate function so it can be called from in parallel with other tiles if desired.

tileInfo is object returned from doTiledShepherdSegmentation_prepare() and col, row describe the tile that this call will process.

Return value is that from shepseg.doShepherdSegmentation().

pyshepseg.tiling.doTiledShepherdSegmentation_finalize(inDs, outfile, tileFilenames, tileInfo, overlapSize, tempDir, simpleTileRecode=False, outputDriver='KEA', creationOptions=[], verbose=False)

Do the stitching of tiles and check for empty segments. Call after every doTiledShepherdSegmentation_doOne() has completed for a given tiled segmentation.

Returns a tuple with (axSegId, hasEmptySegments).

pyshepseg.tiling.doTiledShepherdSegmentation_prepare(infile, tileSize=4096, overlapSize=1024, bandNumbers=None, imgNullVal=None, kmeansObj=None, verbose=False, numClusters=60, subsamplePcnt=None, fixedKMeansInit=False)

Do all the preparation for the tiled segmentation. Call this first if creating a parallel implementation, then call doTiledShepherdSegmentation_doOne() for each tile in the returned TileInfo object.

Returns a tuple with: (datasetObj, bandNumbers, kmeansObj, subsamplePcnt, imgNullVal, tileInfo)

pyshepseg.tiling.equalProjection(proj1, proj2)

Returns True if the proj1 is the same as proj2

Stolen from rios/pixelgrid.py

pyshepseg.tiling.fitSpectralClustersWholeFile(inDs, bandNumbers, numClusters=60, subsamplePcnt=None, imgNullVal=None, fixedKMeansInit=False)

Given a raster filename, read a selected sample of pixels and use these to fit a spectral cluster model. Uses GDAL to read the pixels, and shepseg.fitSpectralClusters() to do the fitting.

If bandNumbers is not None, this is a list of band numbers (1 is 1st band) to use in fitting the model.

If subsamplePcnt is not None, this is the percentage of pixels sampled. If it is None, then a suitable subsample is calculated such that around one million pixels are sampled (Note - this would include null pixels, so if the image is dominated by nulls, this would undersample.) No further subsampling is carried out by fitSpectralClusters().

If imgNullVal is None, the file is queried for a null value. If none is defined there, then no null value is used. If each band has a different null value, an exception is raised.

fixedKMeansInit is passed to fitSpectralClusters(), see there for details.

Returns a tuple

(kmeansObj, subsamplePcnt, imgNullVal)

where kmeansObj is the fitted object, subsamplePcnt is the subsample percentage actually used, and imgNullVal is the null value used (perhaps from the file).

pyshepseg.tiling.getImgNullValue(inDs, bandNumbers)

Return the null value for the given dataset. Raises an error if not all bands have the same null value.


For the given segment ID, return the page ID. This is the segment ID of the first segment in the page.


The given dictionary is keyed by pixel values from the imagery, and the values are counts of occurences of the corresponding pixel value. This function returns a pair of numpy arrays (as a tuple), one for the list of pixel values, and one for the corresponding counts. The arrays are sorted in increasing order of pixel value.

pyshepseg.tiling.getTilesForFile(ds, tileSize, overlapSize)

Return a TileInfo object for a given file and input parameters.

pyshepseg.tiling.makeFastStatsSelection(colIndexList, statsSelection)

Make a fast version of the statsSelection data structure, combined with the global column index numbers.

Return a tuple of

(statsSelection_fast, numIntCols, numFloatCols)

The statsSelection_fast is a single array, of shape (numStats, 5). The first index corresponds to the sequence in statsSelection. The second index corresponds to the STATSEL_* values.

Everything is encoded as an integer value in a single numpy array, suitable for fast access within numba njit-ed functions.

This is all a bit ugly and un-pythonic. Not sure if there is a better way.

pyshepseg.tiling.overlapFilename(col, row, edge, tempDir)

Return the filename used for the overlap array

pyshepseg.tiling.processSubsetTile(tile, recodeDict, histogramDict, maskData)

Process a tile of the subset area. Returns a new tile with the new codes. Fills in the recodeDict as it goes and also updates histogramDict.

if maskData is not None then only pixels != 0 are considered. Assumed that maskData is for the same area/size as tile.

pyshepseg.tiling.readColDataIntoPage(page, data, idx, colType, minVal)

Numba function to quickly read a column returned by rat.ReadAsArray() info a RatPage.

pyshepseg.tiling.readRATIntoPage(rat, numIntCols, numFloatCols, minVal, maxVal)

Create a new RatPage() object that represents the section of the RAT for a tile of an image. The part of the RAT between minVal and maxVal is read in and a RatPage() instance returned with the startSegId param set to minVal.

pyshepseg.tiling.readSubsampledImageBand(bandObj, subsampleProp)

Read in a sub-sampled copy of the whole of the given band.

bandObj is an open gdal.Band object. subsampleProp is the proportion by which to sub-sample (i.e. a value between zero and 1, applied to rows and columns separately)

Returns a numpy array of the image data, equivalent to gdal.Band.ReadAsArray().

Note that one can, in principle, do this directly using GDAL. However, if overview layers are present in the file, it will use these, and so is dependent on how these were created. Since these are often created just for display purposes, past experience has shown that they are not always to be trusted as data, so we have chosen to always go directly to the full resolution image.

pyshepseg.tiling.recodeSharedSegments(tileData, overlapA, overlapB, orientation, recodeDict)

Work out a mapping which recodes segment ID numbers from the tile in tileData. Segments to be recoded are those which are in the overlap with an earlier tile, and which cross the midline of the overlap, which is where the stitchline between the tiles will fall.

Updates recodeDict, which is a dictionary keyed on the existing segment ID numbers, where the value of each entry is the segment ID number from the earlier tile, to be used to recode the segment in the current tile.

overlapA and overlapB are numpy arrays of the overlap region in question, giving the segment ID numbers is the two tiles. The values in overlapB are from the earlier tile, and those in overlapA are from the current tile.

It is critically important that the overlapping region is either at the top or the left of the current tile, as this means that the row and column numbers of pixels in the overlap arrays match the same pixels in the full tile. This cannot be used for overlaps on the right or bottom of the current tile.

The orientation parameter defines whether we are dealing with overlap at the top (orientation == HORIZONTAL) or the left (orientation == VERTICAL).

pyshepseg.tiling.recodeTile(tileData, maxSegId, tileRow, tileCol, overlapSize, tempDir, top, bottom, left, right)

Adjust the segment ID numbers in the current tile, to make them globally unique across the whole mosaic.

Make use of the overlapping regions of tiles above and left, to identify shared segments, and recode those to segment IDs from the adjacent tiles (i.e. we change the current tile, not the adjacent ones). Non-shared segments are increased so they are larger than previous values.

tileData is the array of segment IDs for a single tile. maxSegId is the current maximum segment ID for all preceding tiles. tileRow, tileCol are the row/col numbers of this tile, within the whole-mosaic tile numbering scheme.

Return a copy of tileData, with new segment ID numbers.

pyshepseg.tiling.relabelSegments(tileData, recodeDict, maxSegId, top, bottom, left, right)

Recode the segment IDs in the given tileData array.

For segment IDs which are keys in recodeDict, these are replaced with the corresponding entry. For all other segment IDs, they are replaced with sequentially increasing ID numbers, starting from one more than the previous maximum segment ID (maxSegId).

A re-coded copy of tileData is created, the original is unchanged.

Return value is a tuple

(newTileData, newMaxSegId)

pyshepseg.tiling.setHistogramFromDictionary(dictn, histArray)

Given a dictionary of pixel counts keyed on index, write these values to the array.

pyshepseg.tiling.setSubsetRecodeFromDictionary(dictn, array)

Given the recodeDict write the original values to the array at the new indices.

pyshepseg.tiling.stitchTiles(inDs, outfile, tileFilenames, tileInfo, overlapSize, tempDir, simpleTileRecode, outputDriver, creationOptions, verbose)

Recombine individual tiles into a single segment raster output file. Segment ID values are recoded to be unique across the whole raster, and contiguous.

outfile is the name of the final output raster. tileFilenames is a dictionary of the individual tile filenames, keyed by a tuple of (col, row) defining which tile it is. tileInfo is the object returned by getTilesForFile. overlapSize is the number of pixels in the overlap between tiles. outputDriver is a string of the name of the GDAL driver to use for the output file.

If simpleTileRecode is True, a simpler method will be used to recode segment IDs, using just a block offset to shift ID numbers. If it is False, then a more complicated method is used which recodes and merges segments which cross the boundary between tiles.

Return the maximum segment ID used.

pyshepseg.tiling.subsetImage(inname, outname, tlx, tly, newXsize, newYsize, outformat, creationOptions=[], origSegIdColName=None, maskImage=None)

Subset an image and “compress” the RAT so only values that are in the new image are in the RAT. Note: the image values will be recoded in the process.

gdal_translate seems to have a problem with files that have large RAT’s so while that gets fixed do the subsetting in this function.

tlx and tly are the top left of the image to extract in pixel coords of the input image. newXSize and newYSize are the size of the subset to extract in pixels.

outformat is the GDAL driver name to use for the output image and creationOptions is a list of creation options to use for creating the output.

If origSegIdColName is not None, a column of this name will be created in the output file that has the original segment ids so they can be linked back to the input file.

If maskFile is not None then only pixels != 0 in this image are considered. It is assumed that the top left of this image is at tlx, tly on the input image and its size is (newXsize, newYsize).

pyshepseg.tiling.updateCounts(tileData, hist)

Fast function to increment counts for each segment ID

pyshepseg.tiling.writeCompletePages(pagedRat, attrTbl, statsSelection_fast)

Check for completed pages, and write them to the attribute table. Remove them from the pagedRat after writing.

pyshepseg.tiling.writeCompletedPagesForSubset(inRAT, outRAT, outPagedRat)

For the subset operation. Writes out any completed pages to outRAT using the inRAT as a template.