Initial data collection


This is the first task outlined on the Gantt chart following the project specification submission. This task involves collecting, formatting, and cleaning the data necessary for the foundational model.


Action points#

  • Selected GSP ABHA1, Latitude: 50.47108, Longitude: -3.72977, PVLive ID: 1
  • Collected NSRDB data with corresponding coordinates, temporal accuracy of 30 minutes
  • Excel data actually says 50.49, -3.74
  • NSRDB data will be split 80:10:10 for training, validating, and testing
    • 17520 total data entries
    • 14,016 for training
    • 1,752 each for validating and testing
    • Dates are: 2019-01-01 00:00 to 2019-12-31 @ 23:30
  • Getting PV data for corresponding GSP between date and time specified
    • The time ordering of the data is inconsistent, but can be sorted with a simple A to Z Excel sort


  • Get tensorflow locally
  • See how to provide test data to tensorflow model, and how to add labels
    • Do I want the NSRDB and PVLive in separate or combined CSV?


Action points#

  • Downloaded tensorflow locally
    • Will my laptop be powerful enough? Should I use DCS instead? I never mentioned this in my specification but it was considered. I don’t think for the foundational model it will be required but once Satellite imagery is involved it may become more intensive
  • The data can stay in separate files then be accessed separately within the code
    • It can then be merged easily in code if necessary
  • Data formatted and is accessible from the code for use in the model
  • Trying to find how to provide the test data which is leading to looking into ANNs
    • Perceptron models -> receive multiple numerical inputs to produce a single numerical output (output is binary)


Action points#

  • Since I’ve got all the data I’m slightly ahead of schedule and am looking into models and trying to build the foundational model
  • First attempt with mean squared error loss of nan, accuracy increases slightly once then maintains constant value so errors
    • Since I had to gather sections of the data in the code, the ordering is different again so times aren’t matching up
  • Sort data within the code so its always consistent
    • Data now looks correct and is temporally matching
  • Same issue with the model itself


  • Potentially normalise the values for input and labels
    • NSRDB currently has large ranges and for each measurement the range is different
    • PVLive has large range between ~50 to e-5
  • Which activation function should I be using, since sigmoid is for classification so that doesn’t apply here
    • I think it is linear


  • Attended webinar
    • Not too applicable, interesting to hear about how the system works in California
  • Normalised the data
    • Still not improved the model
  • I think the abundance of all zero rows is causing issues
    • I need to remove all 0s rows between both NSRDB and PVLive


  • Correct bug for removing 0 rows