tilingstats

Routines to support calculation of statistics on large rasters. The statistics are calculated per segment so require two input rasters - the segmentation output, and another image to gather statistics from, for each segment in the first image.

These are optimised to work on one tile of the images at a time so should be efficient in terms of memory use.

The main functions in this module are calcPerSegmentStatsTiled() and calcPerSegmentSpatialStatsTiled().

exception pyshepseg.tilingstats.PyShepSegStatsError
class pyshepseg.tilingstats.SegPoint(x, y, val)

Class for handling a given data point and it’s location in pixel space (within the whole image, not a tile).

Used so that all the data for a given segment can be collected together even if the segment straddles a tile.

class pyshepseg.tilingstats.SegmentStats(segmentHistDict, missingStatsValue)

Manage statistics for a single segment

getPercentile(percentile)

Return the pixel value for the given percentile, e.g. getPercentile(50) would return the median value of the segment

getStat(statID, param)

Return the requested statistic

pyshepseg.tilingstats.accumulateSegDict(segDict, noDataDict, imgNullVal, tileSegments, tileImageData)

Accumulate per-segment histogram counts for all pixels in the given tile. Updates segDict entries in-place.

Parameters:
segDictnumba.typed.Dict

Dictionary of segments keyed on segment id. Values are histograms for the segment

noDataDictnumba.typed.Dict

Dictionary of nodata values for each segment.

imgNullValint

No data value for image

tileSegmentsint ndarray of shape (nRows, nCols)

Contains the tile of segments currently being processed.

tileImageDataint ndarray of shape (nRows, nCols)

Contains the tile of the image data currently being processed.

pyshepseg.tilingstats.accumulateSegSpatial(segDict, noDataDict, imgNullVal, tileSegments, tileImageData, topLine, leftPix)

Accumulates data for each segment for the given tile. The data is put into segDict.

Parameters:
segDictnumba.typed.Dict

Dictionary of segments keyed on segment id. Values are a list of SegPoint objects.

noDataDictnumba.typed.Dict

Dictionary of nodata values for each segment.

imgNullValint

No data value for image

tileSegmentsint ndarray of shape (nRows, nCols)

Contains the tile of segments currently being processed.

tileImageDataint ndarray of shape (nRows, nCols)

Contains the tile of the image data currently being processed.

topLineint

row of the top of this tile in the full image

leftPixint

col of the left of this tile in the full image

pyshepseg.tilingstats.calcPerSegmentSpatialStatsRIOS(imgfile, imgbandnum, segfile, colNamesAndTypes, userFunc, userParam=None, concurrencyStyle=None, missingStatsValue=-9999)

Similar to the calcPerSegmentStatsTiledRIOS() function but allows the user to calculate spatial statistics on the data for each segment. This is done by recording the location and value for each pixel within the segment. Once all the pixels are found for a segment the userFunc is called with the following parameters:

pts, imgNullVal, intArr, floatArr, userParam

where pts is List of SegPoint objects. If 2D Numpy tile is prefered the userFunc can call convertPtsInto2DArray(). intArray is a 1D numpy array which all the integer output values are to be put (in the same order given in colNamesAndTypes). floatArr is a 1D numpy array which all the floating point output values are to be put (in the same order given in colNamesAndTypes). These arrays are initialised with missingStatsValue so the function can skip any that it doesn’t have values for. userParam is the same value passed to this function and needs to be a type understood by Numba.

This function uses RIOS to perform the reading so it works in a memory-efficient way. Also, the number of read workers can be varied by passing an instance of rios.applier.ConcurrencyStyle which may be helpful when reading from data sources with high latency (ie S3). RIOS timeouts can also be changed using this method. Note that only 0 compute workers is supported and computeWorkerKind must be set to CW_NONE.

Parameters:
imgfilestring

Path to input file for collecting statistics from

imgbandnumint

1-based index of the band number in imgfile to use for collecting stats

segfilestr or gdal.Dataset

Path to segmented file or an open GDAL Dataset. Will collect stats in imgfile for each segment in this file.

colNamesAndTypeslist of (colName, colType) tuples

This defines the names, types and order of the output RAT columns. colName should be a string containing the name of the RAT column to be created and colType should be one of gdal.GFT_Integer or gdal.GFT_Real and this controls the type of the column to be created. Note that the order of columns given in this parameter is important as this dicates the order of the intArray and floatArr parameters to userFunc.

userFunca Numba function (ie decorated with @jit or @njit).

See above for description

userParamanything that can be passed to a Numba function

This includes: arrays, scalars and @jitclass decorated classes.

concurrencyStylerios.applier.ConcurrencyStyle

Concurrency parameters for RIOS

missingStatsValueint

The value to fill in for segments that have no data.

pyshepseg.tilingstats.calcPerSegmentSpatialStatsTiled(imgfile, imgbandnum, segfile, colNamesAndTypes, userFunc, userParam=None, missingStatsValue=-9999)

Similar to the calcPerSegmentStatsTiled() function but allows the user to calculate spatial statistics on the data for each segment. This is done by recording the location and value for each pixel within the segment. Once all the pixels are found for a segment the userFunc is called with the following parameters:

pts, imgNullVal, intArr, floatArr, userParam

where pts is List of SegPoint objects. If 2D Numpy tile is prefered the userFunc can call convertPtsInto2DArray(). intArray is a 1D numpy array which all the integer output values are to be put (in the same order given in colNamesAndTypes). floatArr is a 1D numpy array which all the floating point output values are to be put (in the same order given in colNamesAndTypes). These arrays are initialised with missingStatsValue so the function can skip any that it doesn’t have values for. userParam is the same value passed to this function and needs to be a type understood by Numba.

Parameters:
imgfilestring

Path to input file for collecting statistics from

imgbandnumint

1-based index of the band number in imgfile to use for collecting stats

segfilestr or gdal.Dataset

Path to segmented file or an open GDAL Dataset. Will collect stats in imgfile for each segment in this file.

colNamesAndTypeslist of (colName, colType) tuples

This defines the names, types and order of the output RAT columns. colName should be a string containing the name of the RAT column to be created and colType should be one of gdal.GFT_Integer or gdal.GFT_Real and this controls the type of the column to be created. Note that the order of columns given in this parameter is important as this dicates the order of the intArray and floatArr parameters to userFunc.

userFunca Numba function (ie decorated with @jit or @njit).

See above for description

userParamanything that can be passed to a Numba function

This includes: arrays, scalars and @jitclass decorated classes.

missingStatsValueint

The value to fill in for segments that have no data.

pyshepseg.tilingstats.calcPerSegmentSpatialStats_riosFunc(info, inputs, outputs, otherArgs)

Called by RIOS from inside calcPerSegmentSpatialStatsRIOS. Do accumulation of statistics

pyshepseg.tilingstats.calcPerSegmentStatsRIOS(imgfile, imgbandnum, segfile, statsSelection, concurrencyStyle=None, missingStatsValue=-9999)

Calculate selected per-segment statistics for the given band of the imgfile, against the given segment raster file. Calculated statistics are written to the segfile raster attribute table (RAT), so this file format must support RATs.

This function uses RIOS to perform the reading so it works in a memory-efficient way. Also, the number of read workers can be varied by passing an instance of rios.applier.ConcurrencyStyle which may be helpful when reading from data sources with high latency (ie S3). RIOS timeouts can also be changed using this method. Note that only 0 compute workers is supported and computeWorkerKind must be set to CW_NONE.

Parameters:
imgfilestring

Path to input file for collecting statistics from

imgbandnumint

1-based index of the band number in imgfile to use for collecting stats

segfilestr

Path to segmented file. Will collect stats in imgfile for each segment in this file.

statsSelectionlist of tuples.

One tuple for each statistic to be included. Each tuple is either 2 or 3 elements:

(columnName, statName)

or

(columnName, statName, parameter)

The columnName is a string, used to name the column in the output RAT. The statName is a string used to identify which statistic is to be calculated. Available options are:

'min', 'max', 'mean', 'stddev', 'median', 'mode', 'percentile', 'pixcount'

The ‘percentile’ statistic requires the 3-element form, with the 3rd element being the percentile to be calculated.

For example:

[('Band1_Mean', 'mean'),
 ('Band1_stdDev', 'stddev'),
 ('Band1_LQ', 'percentile', 25),
 ('Band1_UQ', 'percentile', 75)]

would create 4 columns, for the per-segment mean and standard deviation of the given band, and the lower and upper quartiles, with corresponding column names.

Any pixels that are set to the nodata value of imgfile (if set) are ignored in the stats calculations. If there are no pixels that aren’t the nodata value then the value passed in as missingStatsValue is put into the RAT for the requested statistics. The ‘pixcount’ statName can be used to find the number of valid pixels (not nodata) that were used to calculate the statistics.

concurrencyStylerios.applier.ConcurrencyStyle

Concurrency parameters for RIOS

missingStatsValueint or float

What to set for segments that have no valid pixels in imgile

pyshepseg.tilingstats.calcPerSegmentStatsTiled(imgfile, imgbandnum, segfile, statsSelection, missingStatsValue=-9999)

Calculate selected per-segment statistics for the given band of the imgfile, against the given segment raster file. Calculated statistics are written to the segfile raster attribute table (RAT), so this file format must support RATs.

Calculations are carried out in a memory-efficient way, allowing very large rasters to be processed. Raster data is handled in small tiles, attribute table is handled in fixed-size chunks.

Parameters:
imgfilestring

Path to input file for collecting statistics from

imgbandnumint

1-based index of the band number in imgfile to use for collecting stats

segfilestr or gdal.Dataset

Path to segmented file or an open GDAL dataset. Will collect stats in imgfile for each segment in this file.

statsSelectionlist of tuples.

One tuple for each statistic to be included. Each tuple is either 2 or 3 elements:

(columnName, statName)

or

(columnName, statName, parameter)

The columnName is a string, used to name the column in the output RAT. The statName is a string used to identify which statistic is to be calculated. Available options are:

'min', 'max', 'mean', 'stddev', 'median', 'mode', 'percentile', 'pixcount'

The ‘percentile’ statistic requires the 3-element form, with the 3rd element being the percentile to be calculated.

For example:

[('Band1_Mean', 'mean'),
 ('Band1_stdDev', 'stddev'),
 ('Band1_LQ', 'percentile', 25),
 ('Band1_UQ', 'percentile', 75)]

would create 4 columns, for the per-segment mean and standard deviation of the given band, and the lower and upper quartiles, with corresponding column names.

Any pixels that are set to the nodata value of imgfile (if set) are ignored in the stats calculations. If there are no pixels that aren’t the nodata value then the value passed in as missingStatsValue is put into the RAT for the requested statistics. The ‘pixcount’ statName can be used to find the number of valid pixels (not nodata) that were used to calculate the statistics.

missingStatsValueint or float

What to set for segments that have no valid pixels in imgile

pyshepseg.tilingstats.calcPerSegmentStats_riosFunc(info, inputs, outputs, otherArgs)

Called by RIOS from inside calcPerSegmentStatsRIOS. Do accumulation of statistics

pyshepseg.tilingstats.calcStatsForCompletedSegs(segDict, noDataDict, missingStatsValue, pagedRat, statsSelection_fast, segSize, numIntCols, numFloatCols)

Calculate statistics for all complete segments in the segDict. Update the pagedRat with the resulting entries. Completed segments are then removed from segDict.

Parameters:
segDictnumba.typed.Dict

Dictionary of segments keyed on segment id. Values are histograms for the segment

noDataDictnumba.typed.Dict

Dictionary of nodata values for each segment.

missingStatsValueint

value to insert into the RAT where no valid pixels were found

pagedRatnumba.typed.Dict

The RAT as a paged data structure

statsSelection_fastint ndarray of shape (numStats, 5)

Allows quick access to the types of stats required

segSizeint indarray of shape (numSegments+1, )

Array containing the histograms of the segment file

numIntColsint

Number of Integer RAT cols to be created

numFloatColsint

Number of Float RAT cols to be created

pyshepseg.tilingstats.calcStatsForCompletedSegsSpatial(segDict, noDataDict, missingStatsValue, pagedRat, segSize, userFunc, userParam, statsSelection_fast, intArr, floatArr, imgNullVal)

Calls the userFunc on data for completed segs and saves the results into the pagedRat.

Parameters:
segDictnumba.typed.Dict

Dictionary of segments keyed on segment id. Values are a numba.typed.List containing instances of SegPoint.

noDataDictnumba.typed.Dict

Dictionary of nodata values for each segment.

missingStatsValueint

Value to fill the intArr and floatArr parameters with so this valu gets written to the RAT if no valid pixels found

pagedRatnumba.typed.Dict

The RAT as a paged data structure

segSizeint indarray of shape (numSegments+1, )

Array containing the histograms of the segment file

userFuncNumba function

This is the user defined function to call for each completed segment

userParamNumba compatible argument

This is passed to the userFunc

statsSelection_fastint ndarray of shape (numStats, 5)

Allows quick access to the types of stats required

intArrint indarray of shape (numIntCols,)

Spaced used for storing the integer outputs for this segment

floatArrfloat indarray of shape (numFloatCols,)

Spaced used for storing the float outputs for this segment

imgNullValint

No data value for image

pyshepseg.tilingstats.checkHistColumn(existingColNames)

Check for the Histogram column in the attribute table. Return its column number, and raise an exception if it is not present

Parameters:
existingColNameslist of strings

The existing column names in the segment file

Returns:
histColNdxint

The index of the histogram column

pyshepseg.tilingstats.checkSegComplete(segDict, noDataDict, segSize, segId)

Return True if the given segment has a complete entry in the segDict, meaning that the pixel count is equal to the segment size

Parameters:
segDictnumba.typed.Dict

Dictionary of segments keyed on segment id. Values are histograms for the segment

noDataDictnumba.typed.Dict

Dictionary of nodata values for each segment.

segSizeint indarray of shape (numSegments+1, )

Array containing the histograms of the segment file

segIdshepseg.SegIdType

Segment to check for completeness

Returns:
completebool

Whether the segId is complete or not

pyshepseg.tilingstats.checkSegCompleteSpatial(segDict, noDataDict, segSize, segId)

Return True if the given segment has a complete entry in the segDict, meaning that the pixel count is equal to the segment size

Note: this is distinct from checkSegComplete() as that function is working with a dictionary of histograms

Parameters:
segDictnumba.typed.Dict

Dictionary of segments keyed on segment id. Values are a list of SegPoint objects.

noDataDictnumba.typed.Dict

Dictionary of nodata values for each segment.

segSizeint indarray of shape (numSegments+1, )

Array containing the histograms of the segment file

segIdshepseg.SegIdType

Segment to check for completeness

Returns:
completebool

Whether the segId is complete or not

pyshepseg.tilingstats.convertPtsInto2DArray(pts, imgNullVal)

Given a list of points for a segment turn this back into a 2D array where the value for each pixel is the value of the pixel in the original (data) image.

The tile is created just large enough for the shape of the segment. Areas of the tile not within a segment is given the value of imgNullVal.

Parameters:
ptsnumba.typed.List containing SegPoint objects

This is the list passed to the userFunc.

imgNullValint

No data value for image

Returns:
tiletiling.numbaTypeForImageType ndarray of shape (ysize, xsize)

Where ysize and xsize are the total extent of the tile

pyshepseg.tilingstats.convertPtsInto2DMaskArray(pts, imgNullVal)

Similar to convertPtsInto2DArray() but burns in the value 1 where each pixel is within a segment.

Given a list of points for a segment turn this back into a 2D array for passing to the user function.

The tile is created just large enough for the shape of the segment. Areas of the tile not within a segment is given the value of imgNullVal.

Parameters:
ptsnumba.typed.List containing SegPoint objects

This is the list passed to the userFunc.

imgNullValint

No data value for image

Returns:
tilenumpy.uint8 ndarray of shape (ysize, xsize)

Where ysize and xsize are the total extent of the tile

pyshepseg.tilingstats.createNoDataDict()

Create the dictionary that holds counts of nodata seen for each segment. The key is the segId, value is the count of nodata seen for that segment in the image data.

Returns:
nodataDictnumba.typed.Dict

Dictionary of nodata counts for each segment

pyshepseg.tilingstats.createSegDict()

Create the Dict of Dicts for handling per-segment histograms. Each entry is a dictionary, and the key is a segment ID. Each dictionary within this is the per-segment histogram for a single segment. Each of its entries is for a single value from the imagery, the key is the pixel value, and the dictionary value is the number of times that pixel value appears in the segment.

Returns:
segDictnumba.typed.Dict

Dictionary of histograms used for calculating statistics.

pyshepseg.tilingstats.createSegSpatialDataDict()

Create a dictionary where the key is the segment ID and the value is a List of SegPoint objects.

Returns:
segDictnumba.typed.Dict

Dictionary containing lists of SegPoint objects.

pyshepseg.tilingstats.createStatColumns(statsSelection, attrTbl, existingColNames)

Create requested statistic columns on the segmentation image RAT. Statistic columns are of type gdal.GFT_Real for mean and stddev, and gdal.GFT_Integer for all other statistics.

Return the column indexes for all requested columns, in the same order.

Parameters:
statsSelectionlist of tuples

Same as passed to calcPerSegmentStatsTiled()

attrTblgdal.RasterAttributeTable

The Raster Attribute Table object for the file

existingColNameslist of strings

A list of the existing column names

Returns:
colIndexListlist of ints

A list of the indexes of each of the requested new columns in the same order as statsSelection.

pyshepseg.tilingstats.createUserColumnsSpatial(colNamesAndTypes, attrTbl, existingColNames)

Used by calcPerSegmentSpatialStatsTiled() to create columns specified in the colNamesAndTypes structure.

Returns a tuple with number of integer columns, number of float columns and a statsSelection_fast array for use by calcStatsForCompletedSegsSpatial().

Parameters:
colNamesAndTypeslist of (colName, colType) tuples

Same as passed to calcPerSegmentSpatialStatsTiled().

attrTblgdal.RasterAttributeTable

The Raster Attribute Table object for the file

existingColNameslist of strings

A list of the existing column names

Returns:
numIntColsint

The number of new Integer columns

numFloatColsint

The number of new Float columns

statsSelection_fastint ndarray of shape (numStats, 5)

Allows quick access to the types of stats required

pyshepseg.tilingstats.doImageAlignmentChecks(segfile, imgfile, imgbandnum)

Do the checks that the segment file and image file that is being used to collect the stats actually align. We refuse to process the files if they don’t as it is not clear how they should be made to line up - this is up to the user to get right. Also checks that imgfile is not a float image.

Parameters:
segfilestr or gdal.Dataset

Path to segmented file or an open GDAL dataset.

imgfilestring

Path to input file for collecting statistics from

imgbandnumint

1-based index of the band number in imgfile to use for collecting stats

Returns:
segds: gdal.Dataset

Opened GDAL datset for the segments file

segband: gdal.Band

First Band of the segds

imgds: gdal.Dataset

Opened GDAL dataset for the image data file

imgband: gdal.Band

Requested band for the imgds

pyshepseg.tilingstats.equalProjection(proj1, proj2)

Returns True if the proj1 is the same as proj2

Stolen from rios/pixelgrid.py

Parameters:
proj1string

WKT string for the first projection

proj2string

WKT string for the second projection

Returns:
equalbool

Whether the projections are equal or not

pyshepseg.tilingstats.getSortedKeysAndValuesForDict(d)

The given dictionary is keyed by pixel values from the imagery, and the values are counts of occurences of the corresponding pixel value. This function returns a pair of numpy arrays (as a tuple), one for the list of pixel values, and one for the corresponding counts. The arrays are sorted in increasing order of pixel value.

Parameters:
ddictionary of int

Counts of each pixel value.

Returns:
keysSortedtiling.numbaTypeForImageType ndarray of shape (numValues,)

Pixel values sorted

valuesSortedint ndarray of shape (numValues,)

Counts of each pixel sorted by pixel value

pyshepseg.tilingstats.makeFastStatsSelection(colIndexList, statsSelection)

Make a fast version of the statsSelection data structure, combined with the global column index numbers.

Return a tuple of

(statsSelection_fast, numIntCols, numFloatCols)

The statsSelection_fast is a single array, of shape (numStats, 5). The first index corresponds to the sequence in statsSelection. The second index corresponds to the STATSEL_* values.

Everything is encoded as an integer value in a single numpy array, suitable for fast access within numba njit-ed functions.

This is all a bit ugly and un-pythonic. Not sure if there is a better way.

Parameters:
colIndexListlist of ints

The column indexes for all requested columns

statsSelectionlist of tuples

See tilingstats.calcPerSegmentStatsTiled() for a complete description of this parameter.

Returns:
statsSelection_fastint ndarray of shape (numStats, 5)

The statsSelection_fast structure

intCountint

Number of int columns

floatCountint

Number of float columns

pyshepseg.tilingstats.userFuncMeanCoord(pts, imgNullVal, intArr, floatArr, transform)

Calculates the mean coordinate of each segment. This function is intended to be passed in as the userFunc parameter to calcPerSegmentSpatialStatsTiled() if the mean coordinates of each segment are required.

The transform of the image (ie. returned by ds.GetGeoTransform() where ds is a GDAL dataset) is to be passed as userParam but converted to an array so it works with Numba.

The result will be written to floatArr (2 values - easting then northing). It is expected that at least 2 float columns are available.

Parameters:
ptsnumba.typed.List containing SegPoint objects

This is the list passed to the userFunc.

imgNullValint

The nodata value for the imagery

intArrint ndarray of shape (numIntCols, )

The integer columns - not used in this function

floatArrint ndarray of shape (numFloatCols, )

The float columns - output written here

transformfloat ndarray of shape (6,)

The GDAL transform array. Passed in as a userParam to tilingstats.calcPerSegmentSpatialStatsTiled().

pyshepseg.tilingstats.userFuncNumEdgePixels(pts, imgNullVal, intArr, floatArr, fourConnected)

Calculates the number of ‘edge’ pixels for each segment. Edge pixels are pixels that touch either another segment or the edge of the image. This function is intended to be passed in as the userFunc parameter to calcPerSegmentSpatialStatsTiled() if the number of edge pixels are required.

The result will be written to intArr (1 value - number of edge pixels).

Parameters:
ptsnumba.typed.List containing SegPoint objects

This is the list passed to the userFunc.

imgNullValint

The nodata value for the imagery

intArrint ndarray of shape (numIntCols, )

The integer columns - output written here

floatArrint ndarray of shape (numFloatCols, )

The float columns - not used in this function

fourConnectedbool

If True, use four-way connectedness to judge neighbours, otherwise use eight-way.

pyshepseg.tilingstats.userFuncVariogram(pts, imgNullVal, intArr, floatArr, maxDist)

Calculates the variogram at the given distance for the segment contained in the tile. This function is intended to be passed in as the userFunc parameter to calcPerSegmentSpatialStatsTiled() if variograms are to be calculated.

maxDist is the number of variograms to calculate and should be passed in as the userParam argument to calcPerSegmentSpatialStatsTiled().

It is assumed that floatArr has enough space for the number of variograms requested (this is calculated from the colNamesAndTypes parameter to calcPerSegmentSpatialStatsTiled()).

Parameters:
ptsnumba.typed.List containing SegPoint objects

This is the list passed to the userFunc.

imgNullValint

The nodata value for the imagery

intArrint ndarray of shape (numIntCols, )

The integer columns - not used in this function

floatArrint ndarray of shape (numFloatCols, )

The float columns - output written here

maxDistint

Number of variograms to calculate

pyshepseg.tilingstats.writeCompletePages(pagedRat, attrTbl, statsSelection_fast)

Check for completed pages, and write them to the attribute table. Remove them from the pagedRat after writing.

Parameters:
pagedRatnumba.typed.Dict

The RAT as a paged data structure

attrTblgdal.RasterAttributeTable

The Raster Attribute Table object for the file

statsSelection_fastint ndarray of shape (numStats, 5)

Allows quick access to the types of stats required