wrangle_report

I downloaded the 'weRateDogs' twitter archives file from my classroom, and read it into a pandas DataFrame. I programmatically downloaded the image predictions using the get method of the 'requests' library, and saved the file into another pandas DataFrame. I queried the Twitter API to get the number of retweets and favourites for each tweet in the Twitter archives DataFrame using the Twitter id as the identifier, I then saved the results into a text file which I read into a third pandas DataFrame.

I visually and programmatically assessed all three DataFrames to spot any quality or tidiness issues. Visual assessment was done using Google spreadsheets, some of the issues that were discovered include:

Invalid ratings in the twitter_archives dataFrame.
Retweeted tweets in the twitter_archives DataFrame.
Irrelevant columns such as: 'source', 'expanded_urls', 'name', 'jpg_url', 'in_reply_to_status' and 'in_reply_to_user' in the twitter_archivesDataFrame.
Dog classification as 4 columns instead of 1.

Programmatical assessment was carried out by getting an overview of all 3 DataFrames and explorating the them using the pandas library: checking for basic quality issues such as erroneous datatypes, missing values, e.t.c As a result of my programmatic assessment, I discovered the following quality and tidiness issues:
Tweet_id should be a string.
Some tweets don't have image predictions.
Some tweet_ids weren't found using the Twitter API.
Erroneous datatype for the timestamp, retweet count, favorite count column e.t.c in the twitter_archives DataFrame.

The first step in my data cleaning process was to create copies of the DataFrames, this was done to ensure that the original DataFrames were kept intact after all the data wrangling processes.

After creating copies of all 3 DataFrames, the next step was to ensure that every variable was a column. To achieve this, I used the the pandas 'melt' method to join the dog stages columns which were 4 in total. I named the new, single column 'dog_stages', but realised that the method of entry of the values was repetitive in the sense that each row had 4 different values for the dog stage, as opposed to having only one value and 3 null columns per record. To combat this issue, I coverted the 'none' string values to 'NaN' datatype which made the melt method efficient.

I then merged the DataFrames into 1 master Dataframe using the pandas 'merge' method to join them using the tweet id as the id entifier, as there was no need for them to be analysed as independent dataFrames.

The correction of the datatypes of all the column was then done using the pandas 'info' method. The timestamp column datatype was change from object to datetime, and the p1_dog, p2_dog and p3_dog columns datatypes were changed from oblect to boolean, and the retweet count and favorite count columns were changed from object to integar datatype.

I queried the Dataframe to find and drop all retweeted enteries, leaving only original tweets to be used to draw any conclusions or insights. I also dropped all the rows that had missing values in the retweet and favourite coumns. Those rows had missing values because upon querying the twitter API, their tweet IDs weren't linked to any of the 'weRateDogs' tweets.

The next data wrangling step I took was to drop all the columns in the Twitter archives DataFrame that I deemed to be irrelevant, the following columns were dropped:

'tweet_id'
'in_reply_to_status_id'
'in_reply_to_user_id'
'timestamp'
'source'
'text'
'retweeted_status_id'
'retweeted_status_user_id'
'retweeted_status_timestamp'
'expanded_urls', 'rating_numerator'
'rating_denominator'
'name'

I then checked for any duplicate values in the tweet id column and dropped all such duplicates. All the tweets that lacked predictions were also dropped. The tweet id column was converted to the object datatype because it is not a statistical variable, and hence shouldn't remain as an integar.

Reporting: wrangle_report¶

Gathering¶

Assessing¶

Cleaning¶

Storing¶