Introduction
This project is basically concerned with predicting the winds and solar generation using simple linear regression method from the weather data. So in practice we will practice with many data science concepts and algorithms.
So in order to implement this project firstly we have checked all the data algorithms from those areas where data sets for weather are clear and these data sets are just ready for machine learning implementation
The second thing is that the data comes from winds, solar generation and weather is in hourly resolution means after each an hour solar and winds condition of the specific area has been checked regularly. That project allowed us to deal with time series in a real-world context.
The other thing is that as we are not using some critical models or algorithms to predict the weather system but we are using the real and correct authentic data that has been previously observed in the specific areas where we are concerned to predict the solar generation and winds.
It is not a good exercise to work with the data that is connected to energy which is renewable.
We can image that it is most important predicting the future renewables generation of nature like wind and solar generation.
Main Body
The data that we have used in the analysis of predicting the solar generation and wind generation is come from a free source named Open Power System Data that is arranged by country wise and in time series
Basically this platform contains data for many countries like about 37 to 40 countries by year or month wise but now in our project/prediction we will just use the weather data for Germany and moreover we will just concentrate on two of the parameters :
- Time series
- Weather data
Time series with load, wind and solar prices in hourly resolution. The CSV file with data for all 37 countries from 2006 and 2017.
Weather Data
A greater amount of data has been provided with many parameters like wind speed temperature radiations solar heat raining water and many more measurements.
Wind and solar generation
Now we have the CSV file that the data with time series of more about 37 European countries of 12 years from the year 2006 to 2017 but as we have talk earlier about that we will just concern about the data of weather only Germany
production = pd.read_csv(“data/time_series_60min_singleindex.csv”,
usecols=(lambda s: s.startswith(‘utc’) |
s.startswith(‘DE’)),
parse_dates=[0], index_col=0)
In the following code the function used optionare_dates=[0] with the index_col=0, this one of the function guarantee that in DatetimeIndex date and time with column will be stored.
Finally we have just filters the rows of year 2016 for Germany and just got the data frames 8784 entries and just about the 48 column each of the row and column reates to the different quantries such as wind capacity and solar capacity.
But as it is clear from the name of our project and the goal of our project that we just concerned with ofly two columns Wind generation and the solar generation
‘DE_solar_generation_actual’,
‘DE_wind_generation_actual’,
Luckily we can make the plots for the data for wind generation and solar generation because there is no any missing value in the data that we have got so it will be the best practice.
From the above plotted pattern for wind generation in the Germany it can be clearly says that there is no clear and understandable pattern so that from this plot we can’t predict or understand as many things as we have to understand.
Secondly below another plot has been generated using the solar generation data in Germany in the MW.
Now from the above plotted graph one main thing can be observed that the solar generation much greater in the middle
Weather data
weather = pd.read_csv(“data/weather_data_GER_2016.csv”,
parse_dates=[0], index_col=0)
If we check the info atribute of the weather DataFrame, we obtain:
<class ‘pandas.core.frame.DataFrame’>
DatetimeIndex: 2248704 entries, 2016-01-01 00:00:00 to
2016-12-31 23:00:00
Data columns (total 14 columns):
cumulated hours int64
lat float64
lon float64
v1 float64
v2 float64
v_50m float64
h1 int64
h2 int64
z0 float64
SWTDN float64
SWGDN float64
T float64
rho float64
p float64
dtypes: float64(11), int64(3)
memory usage: 257.3 MB
The other columns are as follows:
Parameters of wind:
v1: velocity [m/s] at height h1 (2 meters above displacement height);
v2: velocity [m/s] at height h2 (10 meters above displacement height);
v_50m: velocity [m/s] at 50 meters above ground;
h1: height above ground [m] (h1 = displacement height +2m);
h2: height above ground [m] (h2 = displacement height +10m);
z0: roughness length [m];
Solar parameters:
SWTDN: total top-of-the-atmosphere horizontal radiation [W/m²];
SWGDN: total ground horizontal radiation [W/m²];
Temperature data:
T: Temperature [K] at 2 meters above displacement height (see h1);
Air data:
Rho: air density [kg/m³] at surface;
p: air pressure [Pa] at surface.
As we did not know the locations of winds panels and solar panels in the Germany so we have this one of the main limitation for analysis so simply we are just grouping the weather data for each of the hour and we will take the average over the chunks.
Now firstly we have to check the average weather behavior with the help of plots in Germany 2016
In the plot of wind velocity generation that is given above it can be clearly observed that the speed of wind is not normally in all over the year but it is larger in the months February And November December
As with the wind generation, we see that the wind velocity does not follow a specific pattern, although it was larger in February, November and December.
In the summer/hot months like June July august the horizontal radiations at the level of ground expected to be larger than others.
By all of the above plots that has been plotted for wind generations and solar generations from these plots somethings can be observed or understand like the correlation between the solar generation and wind generation.
So further observation can also be obtained by the help of plots that is plotted below in which wind and solar generation by the function of some weather parameters has been shown.
As we have talked earlier that there must exists a relation between solar generation and wind generation by means of some other weather quantities so from the above plots it has been cleared that the co relation between winds and solar generation is by the function of velocities like v1, v2, and v_50m
So what is v1, v2 and v_50 it has been mentioned and described in details above in code.
From the above plotted graphs it can be observed that there exists a linear relation between the top of the atmosphere ground radiations and solar generations
By all of the observations now we are going to make linear regression algorithm to predict the solar generation and wind generation by means of some of the above quantities
Predicting the wind and solar generation using linear regression
Linear regression is a linear approach to modelling the relationship between a dependent variable and independent variables.
The resulted output of a linear regression algorithm is a linear function of the input.
The equation of linear regression is:
The objective is to find the parameters which minimize the mean squared error:
This can be achieved using Linear Regression from the scikit-learn library.
Wind generation
A feature matrix named X_wind with the following features v1, v2 and v_50m with the target named Y_wind with actual wind generation has been created in order to predict the wind generation
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
lr = LinearRegression()
scores_wind = cross_val_score(lr, X_wind, y_wind, cv=5)
print(scores_wind, “ average =”, np.mean(scores_wind))
we will analyse that what is going in the algorithm because we are familiar with this
Frst of all our first job is to import the LinearRegression from sklearn.linear_model by which ordinary least square regression will be done.
After applying the regression algorithm now the prime duty is evaluating the performance of the algorithm so for this purpose we divide the data using a specific procedure named cross validation. For the K folds of the cross validation the data must be divided into tiny sets and then the resulting model is validated on the remaining part of the data set as it can also be observed in the picture and concept will be clear
The performance measure provided by the CV is then the average of the performance measure computed in each experiment. In the code above, we use cross_val_score from sklearn.model_selection, with number of folds cv=5
Example of 5-fold CV. Source: https://www.kaggle.com/dansbecker/cross-validation
The performance measure that LinearRegression gives by default is the coefficient of determination R² of the prediction. It measures how well the predictions approximate the true values. A value close to 1 means that the regression makes predictions which are close to the true values. It is formally computed using the formula:
The output of the code above for our case is:
[0.88261401 0.88886305 0.83623262 0.88974363 0.85338174]
average = 0.870167010172279
Now we have to give some idea about analyzing of this output means what is going on in this output or what this output represent/show.
The one line contains the results of R² for each of the five sets, but the line 2 basically contains their average that has been resulted in line 1.
Solar generation
Here again the matrix X_solar with features SWTDN, SWGDN and the T has been constructed and the matrix Y_solar with the actual solar generation has been created after that this algorithm has been implemented
scores_solar = cross_val_score(lr, X_solar, y_solar, cv=5)
print(scores_solar, “ average =”, np.mean(scores_solar))
The output is:
[0.8901974 0.95027431 0.95982151 0.95090201 0.8715077 ]
average = 0.9245405855731855
Now again we have to give some idea about analyzing of this output means what is going on in this output or what this output represent/show.
The one line contains the results of R² for each of the five sets, but the line 2 basically contains their average that has been resulted in line 1
As it has been concluded that we have done a better job using the real weather data of the world and used very simple algorithms for predicting the winds and solar generations of course more sophisticated and better performance algorithms and analysis can also be made but those will be beyond out scope.
Conclusion
So here through this code we predict wind and solar generation by using the ML technique of linear regression. The results of mean square value seems to be quite accurate as it comes out to be near to 1.
We find out that it’s quite easy and impressive to predict the results using this simple regression technique and we get the accuracy of almost 85% for solar generation and 92% for wind generation.
However we can use more complex models to improve the accuracy even more.
References
[4] https://www.quora.com/What-are-the-advantages-and-disadvantages-of-linear-regression
[5] https://datascience.stackexchange.com/questions/30465/tips-to-improve-linear-regression-model
Cite This Work
To export a reference to this article please select a referencing style below: